<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Tensor Relational Algebra for Distributed Machine Learning System Design</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10280658</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the VLDB Endowment</title>
<idno>2150-8097</idno>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue">8</biblScope>					

					<author>B Yuan</author><author>D Jankov</author><author>J Zou</author><author>Y Tang</author><author>D Bourgeois</author><author>C. Jermaine</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We consider the question: what is the abstraction that should be implemented by the computational engine of a machine learning system? Current machine learning systems typically push whole tensors through a series of compute kernels such as matrix multiplications or activation functions, where each kernel runs on an AI accelerator (ASIC) such as a GPU. This implementation abstraction provides little built-in support for ML systems to scale past a single machine, or for handling large models with matrices or tensors that do not easily fit into the RAM of an ASIC. In this paper, we present an alternative implementation abstraction called the tensor relational algebra (TRA). The TRA is a set-based algebra based on the relational algebra. Expressions in the TRA operate over binary tensor relations, where keys are multi-dimensional arrays and values are tensors. The TRA is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML workflows in distributed clusters.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This operational approach has certain drawbacks, namely that the system has limited latitude for automatic optimization. The programmer is responsible for making sure that the operations can run successfully using available resources (such as the amount of RAM on each GPU), and if the operations cannot run successfully, the programmer must figure out how to break the operations up into smaller pieces that can run. Tasks such as getting a computation to run on multiple GPUs or on multiple machines in a distributed cluster so as to minimize communication are left to the programmer.</p><p>Toward declarative tensor programming. There has been work on programming models that are more declarative. PyTorch and TensorFlow now both support variants of Einstein notation-a classical notation for specifying operations over tensors, and work on optimizing such computations has made its way into commercial systems <ref type="bibr">[8]</ref>. Researchers have proposed variants of the Ricci calculus as a specification language for ML <ref type="bibr">[32]</ref>. There have been other proposals for declarative tensor programming languages that allow for automatic generation of compute kernels that can be optimized to handle the data at hand such as Tensor Comprehensions <ref type="bibr">[43]</ref>.</p><p>Nevertheless, while there has been attention paid at the question of how to specify ML computations declaratively, there has been little attention paid to the question of what the correct implementation abstraction for ML system design should be. That is, what target should a ML system back-end present to the frontend?<ref type="foot">foot_0</ref> There are several requirements for such a back-end interface. It should be able to express most/all important ML or numerical computations. It should be hardware agnostic, but computations specified using the interface should be easily scaled to multiple ASICs and multiple machines. It should allow for computations over very large input data. It should facilitate easy, automatic optimization. And it should provide for execution times that are as fast as a carefully-designed, "hand-built" computation on top of the very best tools such as ScaLAPACK, Dask, TensorFlow, or PyTorch.</p><p>The tensor relational algebra. In this paper, we argue that the implementation abstraction that should be offered by a ML system back-end is the tensor relational algebra, or TRA for short. The TRA is a simple variant of the classical relational algebra (RA), which serves as the standard implementation abstraction for modern relational database system design. The key difference is that in the TRA, the relations over which computations are performed are always binary relations between &#119896;-dimensional vectors (keys) and &#119903; -dimensional tensors.</p><p>Of course, it is widely understood that a &#119896;-dimensional tensor can be stored as a binary relation between &#119896;-dimensional keys and real number values, e.g. <ref type="bibr">[25]</ref>. Thus, why not use classical RA as the implementation abstraction? There is good evidence that a compute engine based upon such an abstraction will perform very poorly over the dense-tensor computations common in deep learning <ref type="bibr">[35]</ref>: the overhead associated with storing each entry in a high-dimensional tensor as a separate tuple and moving billions of tuples through a system can result in poor performance, especially when the competition is a high-performance CPU or GPU kernel. This is why the TRA specifically allows for tensors or "chunks" of a larger tensor to be stored in each tuple-the fixed, per-tuple cost is incurred by a much smaller number of tuples.</p><p>We argue that the TRA has three key advantages as an implementation abstraction: expressivity, easy optimization, and high performance. The joins and aggregations offered by the TRA can implement the indexed operations required by the Einstein notation and related specification languages. It is easy to implement and optimize relational operations in a distributed environment, as the decades of success enjoyed by relational database systems has demonstrated. And by design, an ML system back-end implementing the TRA is able to run classical HPC algorithms which rely on decomposing matrices into chunks, and it can run them almost as well as special-purpose HPC softwares such as ScaLAPACK.</p><p>Our Contributions. We propose the TRA as well as an implementation algebra (IA) which is easy to implement in a distributed system. We consider how computations in the TRA can be transformed into computations in the IA, and propose a set of transformations or equivalences that allow re-writes of computations in the IA. Finally, we implement a prototype of the IA, and show that it can enable efficient, distributed implementations of ML computations, which can reach or even significantly outperform the existing HPC and ML systems including ScaLAPACK, Dask, TensorFlow, and PyTorch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">TRA OVERVIEW 2.1 Motivation</head><p>There has been significant interest in tensor manipulation languages, which allow for declarative specification of ML and numerical computations. This simplest of these is the Einstein notation, which provides a way to write a tensor computation like the form:</p><p>This example describes matrix multiplication. The value &#119894;-th row and &#119895;-th column of the output is the dot product of the &#119894;-th row of input matrix A and the &#119895;-th column of input matrix B. Different proposals have different variations on this idea <ref type="bibr">[8,</ref><ref type="bibr">32,</ref><ref type="bibr">43]</ref>, but the common features are that (1) values from different tensors are fed into a scalar function (such as the multiplication above) by matching indices, and (2) those dimensions are summed over. Languages such as the Einstein notation provide an excellent, declarative interface for programming an ML system-much as SQL provides the interface for relational systems. But the question remains: what is the correct implementation abstraction for ML systems? That is, what is the interface that the back-end should export, which can be targeted by an ML programming system compiler and auto-differentiation engine that make up the ML system's front-end?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">TRA: The Basics</head><p>We propose the TRA as this ML implementation abstraction. The TRA operates over tensor relations containing pairs of the form:</p><p>(key, array).</p><p>Conceptually, these tensor relations store sets of arrays. Each key value serves, not surprisingly, as the key for the pair.</p><p>Tensors are decomposed into sets of sub-tensors to represent them as tensor relations. For example, consider the matrix A, </p><p>, we may store this as a tensor relation</p><p>]&#65027; )&#65027; ,</p><p>]&#65027; )&#65027; }&#65028; .</p><p>The TRA offers a similar set of operations to the RA: joins, aggregations, selections. It makes an excellent compilation target for tensor manipulation languages like the Einstein notation for several reasons. First, these languages typically match elements in tensors based on shared index values (which can be implemented as relational joins) and then sum out various dimensions (implemented as aggregations). The problem with using the RA as an implementation abstraction for tensors, where tensors are stored relationally as keys identifying a single non-zero value in a tensor, is performance. Tensors can have millions or billions of entries, and it seems impossible to build a back-end with acceptable performance if it has to process one tuple per non-zero value.</p><p>Thus, the TRA operates over sub-tensors or "chunks", and expects the ML system front-end to supply a kernel function (typically a high-performance CPU or GPU operation) to operate over the sub-tensors themselves. This does complicate the TRA compared to the RA, but as we show experimentally, this modification makes it possible for a TRA-based back-end to significantly outperform the back-ends offered by TensorFlow and PyTorch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">TRA: the Implementation Algebra</head><p>One of the key advantages of the TRA is that like the RA, there are a number of re-writes, inherited from the RA (such as push-down of selections) that can be used to optimize TRA computations.</p><p>However, when designing an ML system back-end, one of our key goals is to distribute ML computations across multiple ASICs or multiple machines. Such distributed computations have many different implementations. Consider the matrix multiply required to push a batch of feature vectors (represented as one matrix) through the links into a fully connected layer in a neural network (the weights are represented as a second matrix). This could be implemented by decomposing the feature matrix into sub-matrices at each site (each containing a subset of the feature vectors) and broadcasting the weight matrix to all sites. This is the common "data parallel" implementation, in ML system parlance. Or, one could fully decompose both matrices and apply a complex distributed algorithm, such as the 3D algorithm <ref type="bibr">[9]</ref>. This would be a combined data and "model parallel" implementation in ML system parlance. Crucially, the TRA cannot express the distinctions among such implementations.</p><p>As such, we also propose an implementation algebra (IA) that can express these different distributed computations, as well as a simple compilation strategy from the TRA to the IA. Further, we propose an extensive set of re-writes for the IA that allow such disparate implementations to be reached from an initial TRA computation, and a simple cost model. Given this, an ML system back-end would export a simple, TRA-based interface which says nothing about distribution to multiple machines or ASICs. A TRA computation can then be compiled into a computation in the IA, and optimized into a high-performance, distributed computation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">TENSOR RELATIONAL ALGEBRA</head><p>The RA operates over (key, array) pairs. Define an array type that consists of:</p><p>Define u &lt; v similarly. Informally, we say that an array of rank &#119903; is bounded by vector b if the array is &#119903; dimensional, and for any &#119903; -dimensional vector i that is less than the bound, array i returns a real number. Formally:</p><p>(1) For any index i &#8712; (Z * ) &#119903; , 0</p><p>Subsequently, we denote the set of all arrays of rank &#119903; and bound b as &#119879; (&#119903;,b) . Thus, &#119879; (&#119903;,b) defines an array type.</p><p>We denote the power set of (Z * ) &#119896; &#215; &#119879; (&#119903;,b) as &#119877; (&#119896;,&#119903;,b) ; this is the set of all possible tensor relations with &#119896;-dimensional keys, storing arrays of type &#119879; (&#119903;,b) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Operations in Tensor Relational Algebra</head><p>Given this, the TRA is essentially a set of higher-order functions over tensor relations. That is, each operation takes as input a kernel function defined over multi-dimensional arrays (in practice, this function is likely be an array-based MKL, CUDA, or Verilog kernel) and returns a function over tensor relations.</p><p>We begin by giving an overview of the higher-order functions taking binary functions as input: aggregation (denoted using &#931;) and join (denoted using &#8904;&#65025;).</p><p>(1) Aggregation is a function: ]&#65027; )&#65027;}&#65027; .</p><p>Because of the argument &#10216;1&#10217;, the call &#931; ( &#10216;1&#10217;,matAdd) constructs an aggregation function that groups all pairs having the same value for the key in position 1, and sums them. Or we could sum up the individual arrays into a single array using:</p><p>which gives:</p><p>]&#65027; )&#65027;}&#65027; .</p><p>(2) Join is a function:</p><p>&#8904;&#65025;:</p><p>)&#65026; &#8904;&#65025; (joinKeysL, joinKeysR, projOp) takes as input a set of key dimensions to join on from the left and from the right, as well as an operation to run over all (leftArray, rightArray) pairs that are created during the join, and returns a function that performs the join and applies projOp to all pairs. Similar to a natural join in classical databases systems, the output key is all of the key values from the left input, with all of the key values from the right input appended to them, subject to the constraint that no value in joinKeysR is repeated a second time.</p><p>With join and aggregation we may implement matrix multiply over two matrices stored as tensor relations. Imagine that we want to implement A &#215; A for the matrix A defined previously, where A is stored as a tensor relation R &#119860; . This can be written as:</p><p>This computes a matrix multiply of the matrix A because all of the pairs in R &#119860; are first joined on key index 1 from the first instance of R &#119860; equaling key index 0 from the second instance of R &#119860; . Each pair of arrays are then multiplied using the kernel matMul. For example, (&#65027; &#10216;0, 1&#10217;, ]&#65027; )&#65027; .</p><p>The index &#10216;0, 1, 0&#10217; in this output pair is a combination of &#10216;0, 1&#10217; and &#10216;1, 0&#10217; from the two input pairs, with the redundant index entry dropped (redundant because we know that two of the entries in positions 1 and 0, respectively, are repeated due to the join). Next, the arrays are aggregated using matAdd, summing out index 1 (keeping indices &#10216;0, 2&#10217; as groupByKeys), to complete the matrix multiply.</p><p>In contrast to join and aggregation, rekey, filter and transform are higher-order functions taking a unary function as input.</p><p>(3) ReKey allows manipulation of keys: ReKey :</p><p>)&#65026;</p><p>ReKey (keyFunc) applies the keyFunc on every key in the relation and generates a new key.</p><p>(4) Filter is a function:</p><p>)&#65026; &#120590; (boolFunc) returns a function that accepts a tensor relation and filters each of the tuples in the tensor relation by applying boolFunc to the keys in the tuples.</p><p>(5) Transform is a function:</p><p>&#120582; :</p><p>)&#65026; &#120582; (transformFunc) returns a function that accepts a tensor relation and applies the kernel function transformFunc to the array in each tuple from the tensor relation.</p><p>For an example of the rekey, filter and transform operations, assume we have a kernel operation diag that diagonalizes a matrix block, a function isEq(&#10216;k 0 , k 1 )&#10217; &#8614; &#8594; k 0 = k 1 that accepts a key and returns true if entries in position 0 and 1 in the key are identical to one another, and a function getKey0(&#10216;k 0 , k 1 &#10217;) &#8614; &#8594; &#10216;k 0 &#10217; that returns the first dimension of a key. We can use these functions along with filter, rekey, and transform to diagonalize a matrix A represented as a tensor relation R &#119860; , by first examining the keys to remove all pairs that do not contain entries along the diagonal, and then diagonalizing the resulting arrays:</p><p>)&#65026;)&#65026; .</p><p>In addition, there are a number of operations that can be used to alter the organization of arrays within a tensor relation. This allows the manipulation of how a tensor is represented as a tensor relation. For this purpose, we have tile and concat: (6) Tile:</p><p>)&#65026; Tile (tileDim, tileSize) returns a function that decomposes (or tiles) each array along a dimension tileDim to arrays of the target tileSize (by applying the arrayTileOp function on the array). As a result, a new key dimension is created, that effectively counts which tile the tuples holds along the tiling dimension. For example, consider the matrix B, B =</p><p>[&#65027; 1 2 5 6 9 10 13 14 3 4 7 8 11 12 <ref type="bibr">15 16</ref> ]&#65027; , partitioned by columns and stored in tensor relation: ]&#65027; )&#65027;}&#65027;</p><p>If we make the call Tile (1,2) (R &#119861; ), we will decompose each array along dimension 1, creating one new array for each two columns.</p><p>In addition, a new key dimension is created, that effectively counts which tile the pair holds along the tiling dimension:</p><p>]&#65027; )&#65027; }&#65028; .</p><p>We may sometimes find ourselves in a situation where it is necessary to manipulate the key in each pair in a tensor relation so that the key is consistent with the desired interpretation. For example, the tensor relation R &#119861; defined above can represent a matrix with eight columns and two rows, so Tile (1,2) (R &#119861; ) is inconsistent with this, logically representing a matrix having four columns and four rows.</p><p>For this purpose, we can leverage the ReKey operator as we defined before.</p><p>For example, we can rekey the output of Tile (1,2) (R &#119861; ) so that logically, it corresponds to a two-by-eight matrix:</p><p>]&#65027; )&#65027; }&#65028; Finally, we have the ability to undo a tiling. ( <ref type="formula">7</ref>) Concat:</p><p>)&#65026;</p><p>Concat (keyDim, arrayDim) is an inverse to tile, which first groups all pairs in the relation using all of the key dimensions other than keyDim, then concatenates all of the arrays in each group along arrayDim, with the concatenation ordering provided by keyDim.</p><p>A call to Concat (1,1)</p><p>)&#65026; first groups all pairs in Tile (1,2) (R &#119861; ) using all of the key dimensions other than key dimension 1, and then concatenates the arrays in each group along array dimension 1, with the ordering provided by key dimension 1. Hence, this computation simply results in the recovery of R &#119861; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Integrity Constraints and Closedness</head><p>There are two important integrity constraints that each tensor relation must follow: uniqueness of keys, and a lack of "holes" in the tensor relation. The primary reason for defining these constraints is facilitating easy, cost-based optimization. With such constraints, cardinality estimation, one of the most vexing problems in relational optimization, goes away-see Section 5.3. Further, neither is particularly burdensome when expressing computations using the TRA. In fact, if the interpretation of a tensor relation of type &#119877; (&#119896;,&#119903;,b) is that it represents a &#119903; -dimensional tensor decomposed into chunks, these constraints are quite natural:</p><p>&#8226; Uniqueness: every key should be unique in a tensor relation.</p><p>&#8226; Continuity: there are no "holes". Given a tensor relation R, of key-arity &#119896;, define the frontier of R as Front(R). f = Front(R) is a &#119896;-dimensional vector that bounds all keys in R. That is, for each key vector k in R, k &lt; f. Further, the frontier is the "smallest" vector bounding R, in that for any other vector f &#8242; bounding R, f &#8804; f &#8242; . Continuity requires that for any vector k &lt; f, some tuple in R have the key k.</p><p>It is easy to show that for the majority of TRA operations-the exceptions being the rekey and filter operations-tensor relations are closed. That is, if the input(s) are tensor relation(s) that obey uniqueness and continuity, then the output must be a tensor relation that obeys these constraints. Filtering a tensor relation or altering the keys can obviously violate the constraints, where the the former probably leads to holes in the resulting relation, and the latter can result in repeated key values. Analyzing a TRA expression to automatically detect whether it can violate these constraints is left as future work; we conjecture that if the filtering predicate (or rekeying computation) are limited to simple arithmetic expressions, it may be possible to check for closedness using an SMT solver <ref type="bibr">[19]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">IMPLEMENTATION ALGEBRA</head><p>We now describe TRA's implementation algebra (IA) that is suitable for execution in a parallel/distributed environment.</p><p>In IA, we extend each (key, array) tuple in a tensor relation with an additional site attribute, so that a physical tensor relation R will consist of triples:</p><p>(key, array, site).</p><p>The site attribute takes a value in {1, 2, ..., &#119904;} where &#119904; is the number of computation sites. Conceptually, the site value indicates the location where the tuple is stored; this could be a machine in a distributed cluster, or a compute unit like a GPU.</p><p>Each physical tensor relation can map a particular (key, array) pair to one or more sites. There are a few especially important mappings, recognized by the predicates All() and Part &#119863; ():</p><p>(1) If All(R) = true, it indicates that if we project away the array attribute, the resulting set will take the value:</p><p>where Front(R) is the frontier of R (the frontier of a physical tensor relation is defined as in a "regular" tensor relation).</p><p>In other words, this means that each possible (key, array) tuple in R appears at all sites. (2) If Part &#119863; (R) = true for some set &#119863; &#8838; {1...&#119896; }, it indicates that: (i) for a given key value, there is only one tuple in R and (ii) two tuples with the same key values for all dimensions in &#119863; must be found on the same site. In other words, R is partitioned according to the key dimensions in the set &#119863;.</p><p>We are ready to describe the IA. Let R (&#119896;,&#119903;,b,&#119904;) specify the set of all valid physical tensor relations with key-arity of dimension &#119896;, storing arrays of type &#119879; (&#119903;,b) , and partitioned across &#119904; sites.</p><p>The first two operations are concerned with manipulating the assigning of tuples in a physical relation to sites, while the later four operations operate over the key and array attributes.</p><p>(1) Broadcast is defined as Bcast : R (&#119896;,&#119903;,b,&#119904;) &#8594; R (&#119896;,&#119903;,b,&#119904;) Given a physical tensor relation, Bcast simply ensures that each tuple takes each site value, so that (i) the set of (key, array) pairs is unchanged after Bcast, but (ii) in any physical relation R output from a broadcast, All(R) = true.</p><p>(2) Shuffle is defined as:</p><p>Shuf (partDims) is a function that accepts a set of key dimensions, and returns a function that accepts physical tensor relation, and then repartitions the physical tensor relation, so that (i) the set of (key, array) pairs is unchanged after Shuf, but (ii) in any physical relation R output from a shuffle, Part partDims (R) = true.</p><p>(3) Local join is an extension of the TRA's join operation:</p><p>Similar to TRA join (&#8904;&#65025;), &#8904;&#65025; &#119871; (joinKeysL, joinKeysR, projOp) takes as input a set of key dimensions to join on from the left and from the right, as well as a kernel operation to run over all (leftArray, rightArray) pairs that are created during the join. The key difference that the local join combines only on pairs from the left and right inputs that have the same site values. If two tuples successfully join, the corresponding output tuple will have the site value as those input tuples. (4) Local aggregation is an extension of TRA aggregation:</p><p>Like TRA aggregation (&#931;), &#931; &#119871; (groupByKeys, aggOp) takes as input a list of key dimensions to aggregate over groupByKeys as well as a kernel function aggOp. However, it returns a function that takes as input a physical tensor relation, groups the arrays in the relation based upon the indicated key values and the site value, and applies aggOp to the arrays in the group. Each output tuple in the resulting, physical tensor relation will take its site value from the site value of the set of input tuples that were aggregated to produce it.</p><p>(5) IA has a filter:</p><p>The only difference is that each accepted input tuple's site value is carried through the filter. ( <ref type="formula">6</ref>) Map provides two functionalities:</p><p>&#120582; &#119871; (keyMapFunc,arrayMapFunc) is a multi-map. It returns a function that applies keyMapFunc to each key value in the input and applies arrayMapFunc to each array value in the input. Both keyMapFunc and arrayMapFunc return &#119898; output tuples per input tuple; the site value is simply copied from input to output. We subsequently call &#119898; the arity of keyMapFunc/arrayMapFunc. In most cases the arity of these functions will be one, but on some cases (such as replicationbased matrix multiply, see Section 4.2.2), the arity will be greater.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">COMPILATION AND OPTIMIZATION</head><p>To distribute tensor-based computations so that they can run efficiently requires an optimization framework. We consider three core questions related to actually distributing a computation specified in the TRA: (1) How is that TRA computation compiled into an equivalent statement in IA? (2) What are a set of equivalence rules that allow computations in IA to be re-written, so as to produce different implementations that are known to produce the same results, but may run more efficiently? And (3) How to cost those different, equivalent implementations, so that a search strategy can be used to choose the most efficient one?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Compiling the TRA</head><p>A complete set of rules mapping from TRA operations to IA operations are listed in Table <ref type="table">1</ref> 2 . Note that though there can be multiple IA computations for a given TRA computation, the compiler will generate one of such IA computation as the initial plan, and an optimizer will typically be responsible for a search algorithm to produce the optimal physical plan represented by IA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Equivalence Rules</head><p>Once a (possibly inefficient) computation in IA is produced, it can be optimized via the application of a set of equivalence rules. An equivalence rule for IA expressions holds if, for any input physical tensor relations, the two expressions produce equivalent outputstwo physical tensor relations are said to be equivalent if they contain the same set of (key, array) pairs after projecting away the site attribute.</p><p>There are two classes of equivalence rules we consider: simple equivalence rules which are often extensions of classic relational equivalence rules (e.g., commutative property of selections), and domain-specific equivalence rules that are more complex transformations that always hold, and tend to be useful for mathematical computations, such as matrix multiplications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">Simple Equivalence Rules.</head><p>In Table <ref type="table">2</ref>, we give an extensive list of two types of simple equivalence rules: (i) those based on kernel function composition, and (ii) equivalence rules based on optimization of re-partitions.</p><p>Kernel function composition targets the order or location of the application of kernel functions in order to reduce the computation load and memory consumption. This is closely related to the idea of ML operator fusion, which has been explored in systems such as TVM <ref type="bibr">[16]</ref> (though TVM does not consider the sort of distributed computations considered here). Re-partition rules formalize notions of distributed query optimization over tensor relations, and are primarily designed to reduce communication.</p><p>Such rules are surprisingly effective for optimizing distributed tensor manipulations. Consider the example of extracting the diagonal elements of matrix X plus matrix Y: diag(X + Y), where matrix X and Y are stored in physical tensor relations R &#119883; and R &#119884; . This computation can be represented by the following TRA expression, where isEq(&#10216;k 0 , k 1 &#10217;) &#8614; &#8594; k 0 = k 1 , merge(&#10216;k 0 , k 1 &#10217;) &#8614; &#8594; &#10216;k 0 &#10217;, matAdd is an element-wise sum of two arrays, and diag diagonalizes the array. Then diag(X + Y) can be written as: This TRA expression can be translated to the IA expression:</p><p>We can apply the following equivalence rules for the above IA expression:</p><p>)&#65026; .</p><p>The transformation will significantly reduce both the communication overhead and the computation load: by applying R1-6, the isEq functions will be pushed down, this transformation not only reduces the input tuple pairs for the join to execute the matAdd function but also the enables reduction of communication overhead where the the filter operation is commuted with the broadcast operation by R2-2; lastly, R1-7 leverages the property that kernel functions diag and matAdd are distributive, as a result, addition will only be applied for the diagonal elements for the paired blocks after kernel function composition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2">Domain-Specific Equivalence</head><p>Rules. Such rules encode specific knowledge from parallel and distributed computing algorithms. Adding such rules to a system allows IA to have at its disposal common implementation strategies, that it can choose from in a cost-based manner.</p><p>We do not attempt to produce an exhaustive list of such rules, but rather we consider in detailing one example: distributed matrix multiplication over tensor relations R &#119883; and R &#119884; :</p><p>For physical tensor relations R &#119883; and R &#119884; , using the rules of Table <ref type="table">1</ref>, this would be compiled into:</p><p>This is a simple, broadcast-based matrix multiply. Applying simple equivalence rules brings us to cross product-based matrix multiplication, which partitions R &#119883; on columns, and R &#119884; on rows. The IA program is:</p><p>However, more complicated schemes are possible, which are expressible in IA, but not derivable using the simple equivalence rules. For example, replication-based matrix multiplication can be viewed as a relational version of the 3D parallel matrix multiplication <ref type="bibr">[9]</ref>. The algorithm first replicates matrix X and Y's blocks multiple times, viewing the result as a 3-D array, and shuffles them using the index of the corresponding voxel as a key; then each site joins the tuples with the same keys and performs local multiplications, aggregating to obtain the final results. If xDups is defined as Front(R &#119884; ) [1] and yDups is Front(R &#119883; ) [0], the shuffle stage can be implemented in IA as: </p><p>&#120582; &#119871; (keyTileOp(tileDim, tileSize), arrayTileOp(tileDim, tileSize)) (R) ]&#65027; )&#65027; }&#65028; will produce: ]&#65027; )&#65027; }&#65028; Next we execute:</p><p>The equivalence of these three implementations is an example of a set of domain-specific equivalence rules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Cost Model</head><p>One of the beneficial aspects of the TRA is that cost-based optimization is much easier than that for classical relational algebra: If the uniqueness and continuity constraints hold, tuple counts need not be estimated and can be computed exactly.</p><p>In the simple cost model presented here, we use the number of floating point values that must be transferred between sites as the cost metric. There are two reasons for this decision.</p><p>Fist, the number of floating point operations in distributed ML computation is fixed. For example, all of the classical distributed matrix multiply algorithms-2.5D <ref type="bibr">[40]</ref>, SUMMA <ref type="bibr">[42]</ref>, etc., have the same floating point cost. While this is not a hard and fast rule-it is possible to push the filtering of tuples in a tensor relation past the application of a kernel function, which would change the number of floating-point operations-in many applications, network transfer is the dominant cost, and is a reasonable cost metric.</p><p>Second, skew, which could slow down a computation with low network cost, is generally not an issue in a TRA computation, unlike in classical relational database system. The TRA continuity and uniqueness constraints imply that joins and aggregations cannot encounter skew. Consider a join. For two tuples &#119905; 1 and &#119905; 2 in tensor relation R, when that relation is joined with another relation S, the number of tuples from S that join with &#119905; 1 and &#119905; 2 must be the same. This, along with the fact that the TRA requires all arrays in a tensor relation to be of the same type, implies that skew will be very rare. In a TRA implementation, the only source of delay where the entire computation is blocked on a machine is likely to be a machine that is, simply stated, slower to perform computations than the others in the cluster. Such slowness may be due to hardware heterogeneity-a challenging issue for future work-or for unpredictable reasons, such as other workloads running on the machine, which are eventualities that cannot be planned for and must be handled by the runtime.</p><p>To compute the network transfer cost for a plan in IA, we need to be able to compute the frontier of each physical relation R: f = Front(R). The reason is that, assuming that uniqueness and continuity constraints hold, we can compute the number of floating point numbers in R using f. If R is of type R (&#119896;,&#119903;,b,&#119904;) , and f = Front(R), then the number of tuples in the corresponding tensor relation is &#119899; = &#8719;&#65025; &#119894; f &#119894; , and the number of floating point numbers in the tensor relation is &#119899; &#215; &#8719;&#65025; &#119894; b &#119894; . Once the frontier is known, it is used to compute the transfer cost for each Bcast and Shuf operation. The cost to broadcast a tensor relation of type R (&#119896;,&#119903;,b,&#119904;) and having &#119891; floating point numbers, is simply &#119891; &#215; &#119904;. The cost to shuffle a tensor relation of &#119891; floating point numbers is simply &#119891; .</p><p>Thus, the task of costing a physical TRA plan reduces to computing the type of each intermediate relation, as well as its frontier.</p><p>Computing the type is relatively easy: we work up from the leaves to the end result(s) of a physical plan, using the type signature of each of the physical operations (Section 4) to infer the type of each intermediate physical relation.</p><p>Computing the frontier in this way, working up from leaves to outputs, is also possible, but it requires a bit more thought. We now consider how the frontier of an output is computed for each of the various operations in the physical algebra:</p><p>(1) &#8904;&#65025; &#119871; (joinKeysL, joinKeysR, projOp) (R &#119897; , R &#119903; ). For local join, assuming that R &#119897; and R &#119903; have an appropriate partitioning to sites, let f &#119897; and f &#119903; be the left and right input frontiers </p><p>Map operations can be merged, if the output arity of the key and array mapping functions is one. For a physical relation R:</p><p>. R1-3. Map and filter are commutative if keyMapFunc is an identify function (idOp). For a physical relation R &#8712; R (&#119896;,&#119903;,b,&#119904; ) , if &#8704; key &#8712; (Z * ) &#119896; , keyMapFunc(key) = key:</p><p>The arrayMapFunc in map can be composed with local aggregation if keyMapFunc is an identify function (idOp): For a physical relation R &#8712; R (&#119896;,&#119903;,b,&#119904; ) , if &#8704; key &#8712; (Z * ) &#119896; , keyMapFunc(key) = key:</p><p>And if the kernel function arrayMapFunc and aggOp is distributive; that is, &#8704; array 1 , array 2 &#8712; &#119879; (&#119903;,b) , arrayMapFunc (aggOp (array 1 , array 2 )) = aggOp (arrayMapFunc (array 1 ) , arrayMapFunc (array 2 )): </p><p>The kernel function in local map can be composed with local join, if the output arity of the key and array mapping functions is one. For physical relations R &#119897; and R &#119903; :</p><p>And if the kernel function arrayMapFunc and projOp is distributive, that is, &#8704; array 1 , array 2 &#8712; &#119879; (&#119903;,b) , arrayMapFunc : (projOp (array 1 , array 2 )) = projOp (arrayMapFunc (array 1 ) , arrayMapFunc (array 2 )):</p><p>Re-partition based rules: R2-1. Only the final broadcast/shuffle in a sequence of broadcast/shuffle operations is needed. For a physical relation R:</p><p>The re-partition operations are commutative with the local filter operation. For a physical relation R: </p><p>)&#65025; &#8801; &#931; &#119871; (groupByKeys, aggOp) (R) R2-5. An aggregation can be split to two phases, if the physical relation is only partially partitioned. For a physical relation R, if groupByKeys &#8834; partDims:</p><p>A Join &#8904;&#65025; defined by the TRA can be implemented in the following equivalent ways. For physical relations R &#119897; and R &#119903; :</p><p>)&#65025; . R2-7. The local join can be pushed through shuffle. For physical relations R &#119897; &#8712; R (&#119896; &#119897; ,&#119903; &#119897; ,b &#119897; ,&#119904; ) and R &#119903; &#8712; R (&#119896;&#119903; ,&#119903;&#119903; ,b&#119903; ,&#119904; ) , if partDims &#8838; joinKeysL:</p><p>of dimensionality &#119896; &#119897; and &#119896; &#119903; , respectively. Then the output frontier f is computed as follows. For &#119896; &lt; &#119896; &#119897; and &#119896; not in joinKeysL, f[&#119896;] = f &#119897; [&#119896;], as the frontier value for that dimension is inherited from the left. For &#119896; &lt; &#119896; &#119897; and where</p><p>), as the frontier value for that dimension results from the join of the two relations. And finally, for all other &#119896;, f[&#119896;] is inherited from the corresponding dimension in the right frontier. (2) &#931; &#119871; (groupByKeys, aggOp) (R). For local aggregation, assuming an appropriate partitioning, let f &#119894; denote the input frontier, and let &#119899; be the number of key dimensions in groupByKeys. In this case, for &#119896; &#8804; &#119899;,</p><p>. This performs a filter in the physical tensor relation R. For an &#119899;-dimensional input frontier f &#119894; , for any &#119896; &#8804; &#119899;, by definition:</p><p>That is, the &#119896;-th dimension in the frontier is inherited from the largest key value in that dimension accepted by boolFunc. In many cases, especially if boolFunc consists of simple arithmetic expressions any comparisons, symbolic methods can be used to compute this. But in practice, it may simply be easier to use a brute-force approach, where each key value is fed into boolFunc to compute the required maximum. Since the size of a tensor relation is typically small-tens of thousands of tuples would be very large-this is a very practical approach. (4) &#120582; &#119871; (keyMapFunc,arrayMapFunc) (R). Similarly, for an &#119899;-dimensional input frontier f &#119894; , for any &#119896; &#8804; &#119899;, by definition:</p><p>Again, a brute force-approach is appropriate for computing the frontier in this case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EVALUATION</head><p>The goal of our paper is to design a computational abstraction that could be exported by the back-end of a machine learning system. To be specific, we have detailed that: (1) such an abstraction should be expressive enough to express a variety of computations;</p><p>(2) the computations expressed using the abstraction should be competitive with hand-coded or special-purpose solutions; and (3) the abstraction should be amenable to automatic optimization.</p><p>To determine whether the TRA meets these goals, we have implemented a simple Python back-end that exports the TRA/IA interface. Across three different large-scale ML computations, we experimentally evaluate this TRA-based back-end. To see whether a TRAbased back-end can provide suitable performance, we compare with a number of other options, including hand-coded MPI solutions, high-performance libraries such as ScaLAPACK, distributed data science tools such as Dask, and ML systems such as TensorFlow and PyTorch. To see whether it is amenable to automatic optimization, for each ML computation, we apply a series of transformations to obtain multiple implementations in the IA, and evaluate whether the cost model is able to predict which is preferable. Benchmark Tasks. (i) distributed matrix multiplication, (ii) distributed nearest neighbor search in a Riemannian metric space, and (iii) distributed stochastic gradient decent (SGD) in a two-layer, feed-forward neural network (FFNN). TRA Implementation. We implement an execution engine for the IA in Python. While it may seem surprising that Python is appropriate for implementing a relational engine, for even very large ML problems, the number of tuples in a TRA computation is small; most data are stored in the large arrays. Our Python execution engine makes heavy use of PyTorch to handle those arrays. PyTorch is used to actually execute the compute kernels on the various sites in a compute cluster, and our IA implementation uses PyTorch's optimized communication library to move the arrays stored in tensor relations between machines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Matrix Multiplication</head><p>Multiplication of A &#8712; R &#119868; &#215;&#119870; and B &#8712; R &#119870;&#215;&#119869; can be formalized:</p><p>where matrix A and B are stored in tensor relations R A and R B .</p><p>To test the effectiveness of IA optimization, as others have done <ref type="bibr">[22,</ref><ref type="bibr">23]</ref>, we consider three different multiplications: (i) general matrices (&#119868; = &#119870; = &#119869; = 4 &#215; 10 4 ), (ii) matrices with a common large dimension (&#119870; = 6.4 &#215; 10 5 , &#119868; = &#119869; = 10 4 ), and (iii) matrices with two large dimensions (&#119868; = &#119869; = 8 &#215; 10 4 , &#119870; = 10 4 ). Matrices are filled with random data following uniform distribution U (-1, 1).</p><p>As discussed, the above TRA as three equivalent IA plans: broadcast based matrix multiplication (BMM), cross-product based matrix multiplication (CMM), and replication-based matrix multiplication (RMM). We compare these three IA implementations with Intel's version of ScaLAPACK <ref type="bibr">[17]</ref> which realizes the classic SUMMA <ref type="bibr">[42]</ref>. We also compare with our own, hand-coded version of the classical 2.5D matrix multiply algorithm <ref type="bibr">[40]</ref>, implemented on top of MPI <ref type="bibr">[12]</ref>; with Dask [1], a popular distributed analytic tool with a Python interface <ref type="bibr">[39]</ref>; and with PETSc [2], a popular high-performance distributed computing library <ref type="bibr">[10]</ref>. All methods are benchmarked over Amazon EC2 clusters with 5, 10 or 15 r5d.2xlarge instances (each with 8 vCPU, 64 GB RAM, and connected by up to 10 Gb/s interconnect). Note that we have made reasonable amount of effort to tune the hyper-parameters in the alternative solutions (e.g., grid size, thread number, initial layout, etc.) and report the best results. Results are in Table <ref type="table">3</ref>. In Table <ref type="table">4</ref>, we report the IA cost (as computed in Section 5.3) predicted for a 10-node cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Nearest Neighbor Search</head><p>We use TRA to implement a nearest neighbor search problem in a Riemannian metric space encoded by matrix A &#8712; R &#119863;&#215;&#119863; , where given a query vector x &#119902; &#8712; R 1&#215;&#119863; and a candidate set X &#8712; R &#119873; &#215;&#119863; , the goal is to find the &#119894;-th row in the matrix that minimizes:</p><p>R X and R A , the corresponding TRA program can be encoded as:  where matVecSub is matrix-vector subtraction, elemMul is elementwise matrix multiplication (Hadamard product), minIndex finds the minimal element's index. We hand-compile this into an expression in the IA, and then use the various equivalence rules to produce two different implementations: Opt4Horizontal and Opt4Vertical. Opt4Horizontal will broadcast R x &#119902; and R A to each compute site and partition R X by dimension 0; then the computation of R diff , R proj , and R dist will be conducted by local operations.</p><p>Opt4Vertical will first broadcast R x &#119902; to each site and compute R diff , then partition R diff by dimension 1 and partition R A by dimension 0 so that R proj is computed in a cross-product based matrix multiplication. For the the Opt4Horizontal IA implementation, R x &#119902; , R X and R A are initially partitioned by dimension 0. For the the Opt4Vertical IA implementation, R x &#119902; , and R A are initially partitioned by dimension 0, while R X is initially partitioned by 1. We generate two data sets: (i) Large, with a large number of data points ( &#119873; = 1.5&#215;10 5 , 1.5&#215;10 6 ) but small feature space (&#119863; = 6&#215;10 3 ); and (ii) Wide, with a small number of data points (&#119873; = 6&#215;10 3 ), with a large feature space ( &#119863; = 3&#215;10 4 , 10 5 ). We execute this computation on compute clusters with 4, 8 or 12 r5d.2xlarge instances.</p><p>We also implemented the same computation using <ref type="bibr">Dask[1]</ref>. And as a baseline, we compare the execution time with a PyTorch implementation that runs on a single site equipped with the same computing power as the TRA implementation: an r5d.8xlarge instance (with 32 vCPU, 256 GB RAM), an r5d.16xlarge instance (with 64 vCPU, 512 GB RAM) and an r5d.24xlarge instance (with 96 vCPU, 768 GB RAM). Since the single-site implementation has zero communication overhead, this should be something of a lowerbound on the time required to run the computation. The results and predicted costs are enumerated in Table <ref type="table">5</ref> and Table <ref type="table">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Feed-Forward Neural Network</head><p>Lastly, we benchmark a training iteration of a two-layer FFNN for multiple label classification, computed over an input matrix.</p><p>Again, we compile the TRA program for FFNN learning by hand into the IA, and use the equivalence rules to produce two implementations. The first, called TRA-DP, resembles the classic data parallel implementation. The second, called TRA-MP, corresponds to an intra-operation model parallel plan.</p><p>We compare these two IA plans with the state-of-the-art data parallel implementation provided by PyTorch 1.7.1 <ref type="bibr">[33]</ref> and Tensor-Flow 2.4.1 <ref type="bibr">[5]</ref>. We also compare with the same computation written on top of <ref type="bibr">Dask [1]</ref>, and hand-coded using ScaLAPACK <ref type="bibr">[17]</ref>. Note that these two options do not fully support GPU.</p><p>Two data sets are considered. First, the data from the Google speech recognition task <ref type="bibr">[44]</ref>, where a 1600 feature vector is extracted from audio wave-forms; the goal is to identify 10 keywords (&#119863; = 1600 and &#119871; = 10); for this task, we train a very wide hidden layer with large number of neurons where &#119867; = 1 &#215; 10 5 , 1.5 &#215; 10 5 , or 2 &#215; 10 5 ; a batch size of 10 4 (&#119873; = 10 4 ) are used for min-batch SGD. Second, we consider the AmazonCat-14K <ref type="bibr">[36,</ref><ref type="bibr">37]</ref> benchmark, which is an extreme multi-label (XML) classification dataset including a large number of features (&#119863; = 597540) and labels (&#119871; = 14588); we train a relatively narrow network with &#119867; = 0.5 &#215; 10 3 , 1 &#215; 10 3 , 3 &#215; 10 3 , 5 &#215; 10 3 , or 7 &#215; 10 3 ; a batch size of 10 3 (&#119873; = 10 3 ) are used for mini-batch SGD. Each is executed on CPU clusters with 2, 5 or 10 r5dn.2xlarge instances connected by up to 25 Gb/s interconnect) and GPU clusters with 2, 5 or 10 p3.2xlarge instances (each with a NVIDIA Tesla V100 GPU, and connected by 10 Gb/s interconnect). The results for Google speech are listed in Table <ref type="table">7</ref>; for Amazon-XML in Table <ref type="table">8</ref>. Predicted costs are given in Table <ref type="table">9</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Discussion</head><p>First, the experiments do seem to show that the TRA provides an abstraction upon which a variety of ML computations can be mapped. Further, the TRA seems to provide for good performance. On the matrix multiplication experiments, the best TRA-based implementations were at least competitive with ScaLAPACK as well as our hand-coded MPI-based implementation (we observed 29s for the TRA-based CMM vs. 23s for hand-coded MPI in the "two general marices" case, 31s for the TRA-based BMM vs. 21s for hand-coded MPI in the "two large dims case") even beating them both (23s for the TAR-based CMM vs. 35s for hand-coded MPI) in the "common large dim case". On the FFNN experiments, the best TRA-based implementation for each task was about 1.5&#215; times as slow as the hand-constructed ScaLAPACK implementation for the Google data set, but considerably faster than ScaLAPACK for the more challenging Amazon data set. In general it is fair to say that the best TRA-based implementation for each task was at least competitive with the ScaLAPACK and MPI-based codes. The fact that there is not a significant performance hit moving from a special-purpose tool  It is also instructive to compare our TRA-based implementations with the other, more user-friendly tools tested, which in practice would be more reasonable alternatives to an ML system with a TRA-based back-end. Dask was not competitive, and was often one or two orders of magnitude slower. On the FFNN experiments, PyTorch was generally better performing than TensorFlow in CPU clusters, while both systems perform almost identically in GPU clusters. For Google speech, the optimal partition schema is identical to data parallelism, where the best TRA-based option is able to closely match PyTorch's speed. Further, while PyTorch failed on the larger Google computations in a 2-GPU cluster, the TRA implementation was able to run to completion. On the even larger, extreme classification problem, the TRA-MP (model parallel) IA was much, much faster than PyTorch, (where TensorFlow fails in most cases since it does not allow a parameter matrix to exceed 2 GB), and much more scalable. PyTorch also cannot handle the huge matrices required to power this computation in some settings.</p><p>The final question we wanted to address was whether the TRA is amenable to automatic optimization. Note that in each case, there was one IA implementation that was suitable for the input data, and one that was not; the difference between the two was often significant. In a system based upon the TRA, it would be crucial to automatically choose the suitable implementation. We found that in each case, the simple cost model from Section 5.3 would have chosen the correct implementation. For example, consider Table <ref type="table">9</ref>. In each case, the cost metric correctly assigns the lower cost to the appropriate IA computation: TRA-DP for the smaller, Google problem, and TRA-MP for the larger, extreme classification problem.</p><p>These results suggest that it should easily be possible to perform cost-based optimization over the IA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RELATED WORK</head><p>Our focus has been on the proper implementation abstraction for ML systems. The TRA is "front-end agnostic. " Still, there has been considerable interest in programming and compilation for such systems. FAQ, by Khamis et al. <ref type="bibr">[8]</ref>, considers how to compute Einstein-notation-like expressions over semi-rings. Effectively, FAQ re-writes such expressions so that they can easily be computed using the "OutsideIn" method for first determining the non-zero entries using a series of joins, followed by the computation of the values. Laue et al. <ref type="bibr">[32]</ref> propose a variant on the Ricci calculus for computing tensor-based derivatives using the Einstein notation. Tensor Comprehensions are an Einstein-like programming language and associated compiler that is able to produce efficient CUDA kernels <ref type="bibr">[43]</ref>; the tensor algebra compiler is a similar effort <ref type="bibr">[31]</ref>. Our efforts are complementary. One could imagine, for example, using FAQlike algorithms along with a compiler for high-performance kernels to generate expressions for a TRA-based back-end. Classic data-flow systems have been modified to support distributed machine learning. Both Spark <ref type="bibr">[45]</ref> and SystemML <ref type="bibr">[21]</ref> provide native libraries for deep learning. A set of deep learning frameworks can run on top of Spark, such as TensorFrames <ref type="bibr">[24]</ref>, Deeplearning4j <ref type="bibr">[41]</ref>, SparkNet <ref type="bibr">[38]</ref> and BigDL <ref type="bibr">[18]</ref>. Conceptually, these deep learning frameworks are related to the TRA as they allow the distribution of ML computations. Consider TensorFrames. TensorFrames allows the items in a Spark DataFrame to be operated on by a TensorFlow computation. One could view those TensorFlow computations as being similar to the kernels applied by TRA, and the Spark operations used to manipulate the data as being similar to the the joins, aggregations, and so on offered by the TRA. The key difference is that while these systems are each significant engineering efforts aimed at marrying different technologies (TensorFlow and Spark in the case of TensorFrames), the TRA is designed as a generic back-end. In fact, a TensorFrameslike programming model could easily be mapped onto TRA, with mapRows, aggregate, etc., being mapped to the appropriate TRA operations, and the TensorFlow computations run as kernels.</p><p>Relational systems have long been proposed for ML. MLog <ref type="bibr">[34]</ref> is a declarative relational system managing data movement, data persistency, and training batch generation. Similar ideas have been applied in <ref type="bibr">[7]</ref> for feature extraction queries over multi-relation databases, and <ref type="bibr">[28]</ref> for optimizing sparse tensor computations constructed from relational tables. Recently, relational systems have also been considered as runtime engine (instead of an efficient data loader) for distributed ML. DB4ML <ref type="bibr">[27]</ref> proposes user-defined iterative transactions. Multi-dimensional-recursion has been built on top of SimSQL <ref type="bibr">[15]</ref>, a distributed analytic database system, that can support neural network training <ref type="bibr">[26]</ref>.</p><p>The idea of moving past relations onto arrays as a database data model, is long-standing (e.g., consider Baumann's work on Rasdaman <ref type="bibr">[13]</ref>). SciDB <ref type="bibr">[14]</ref> is a well-known system following this idea. LevelHeaded <ref type="bibr">[6]</ref> uses a special key-value structure to support linear operations. MATLANG <ref type="bibr">[11]</ref> introduces a language for matrix manipulation. TensorDB <ref type="bibr">[29,</ref><ref type="bibr">30]</ref> is a database system that can perform tensor manipulation. LARA <ref type="bibr">[25]</ref> proposes an algebra with tuple-wise operators, attribute-wise operators, and tuple extensions, then defines linear and relational algebra operations using these primitives. RMA <ref type="bibr">[20]</ref> attempts to bridge the gap between relations and matrices. While related, these systems attempt to implement tensor computations as algebraic expressions (e.g., a join followed by an aggregation) over relations of (key, value) pairs. This requires pushing a huge number of pairs through the system, which introduces significant overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSION</head><p>We have introduced the tensor relational algebra (TRA), and suggested this as the interface that could be exported by the back-end of a machine learning system. We have showed through extensive experimentation that a computation expressed in the TRA then transformed into the implementation algebra and optimized, is competitive with (and often faster than) other options, including HPC softwares such as ScaLAPACK, and ML softwares such as TensorFlow and PyTorch.</p><p>There are many avenues for future work. TRA is not meant to be a user-facing programming language. Thus, a key question is: can a language such as Tensor Comprehensions or Einstein notation be compiled into TRA? At a high level, this should not be too difficult, as these languages match indices in different tensors (which is easily implemented as a join) and then sum out dimensions (aggregation). But there are many details to consider. The TRA uses arrays or "chunks" for speed. How to automatically block or chunk a tensor computation? How to automatically generate the compute kernels? Sparsity is also an important issue. A compiler could also decide to store a sparse tensor using arrays that do not have zero dimensions, but where those arrays are stored sparsely, with a high-performance kernel generated to handle the specific sparsity pattern.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Throughout the paper, we use the term front-end to refer to the programmer-facing API and compiler; in PyTorch, for example, this would be the part of the system that accepts Einstein notation and transforms it into a set of executable operations. We use the term back-end to refer to the sub-system that actually executes the computation.</p></note>
		</body>
		</text>
</TEI>
