<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>JoinBoost: Grow Trees over Normalized Data Using Only SQL</title></titleStmt>
			<publicationStmt>
				<publisher>VLDB Endowment</publisher>
				<date>07/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10559959</idno>
					<idno type="doi">10.14778/3611479.3611509</idno>
					<title level='j'>Proceedings of the VLDB Endowment</title>
<idno>2150-8097</idno>
<biblScope unit="volume">16</biblScope>
<biblScope unit="issue">11</biblScope>					

					<author>Zezhou Huang</author><author>Rathijit Sen</author><author>Jiaxiang Liu</author><author>Eugene Wu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?</p> <p>We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the<italic>Y</italic>variable to the residual in the<italic>non-materialized join result.</italic>Although this view update problem is generally ambiguous, we identify<italic>addition-to-multiplication preserving</italic>, the key property of variance semi-ring to support<italic>rmse</italic>the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Tree-based models-ranging from decision trees, random forests, and gradient boosting-are recursive space-partitioning algorithms for classification and regression <ref type="bibr">[48]</ref>. They have shown exceptional effectiveness for tabular datasets, outperforming popular neural network models <ref type="bibr">[31]</ref>. In fact, random forests and gradient boosting were rated the most popular model from 2019 to 2021 on Kaggle <ref type="bibr">[8]</ref>. In response, the ML community has developed optimized tree-training libraries like LightGBM <ref type="bibr">[42]</ref> and XGBoost <ref type="bibr">[25]</ref>, that offer user-friendly API, superior single-node performance, and compatibility with distributed frameworks like Dask <ref type="bibr">[59]</ref> or Spark <ref type="bibr">[70]</ref>.</p><p>In practice, however, almost all tabular datasets are normalized and stored in a DBMS. Using ML libraries introduces performance, usability, and privacy drawbacks. First, the libraries expect a single (typically CSV) training dataset. Thus, a developer must denormalize the database into a "wide table" , materialize and export , and load it into the library. The join materialization cost is prohibitive for all but the simplest schemas and smallest databasesthe 1.2 IMDB dataset (Figure <ref type="figure">3</ref>) is well over 1 when fully materialized due to N-to-N relationships. Second, managing the exported data as well as the separate execution frameworks adds considerable operational complexity and is error-prone <ref type="bibr">[64,</ref><ref type="bibr">66]</ref>. Third, exporting sensitive data can raise privacy concerns <ref type="bibr">[10]</ref>.</p><p>Ideally, we would "move computation to the data" and train treebased models directly within the DBMS. This would let developers manage the entire data lifecycle-preparation, cleaning, analysis, and modeling-within a single DBMS, and benefit from the DBMSes' excellent security, performance, and scalability. To be broadly useful, we believe an initial In-DB solution should meet three criteria: (C1) be easily portable to any DBMS by translating ML algorithms into vanilla SQL queries, (C2) offer training performance comparable with the SOTA ML libraries (e.g., LightGBM), and (C3) scale to massive data warehouse sizes and complexity. Unfortunately, these criteria are often in tension. Common wisdom and prior results suggest that training tree models via SQL queries is portable but notoriously slow <ref type="bibr">[32,</ref><ref type="bibr">68,</ref><ref type="bibr">70]</ref> as DBMSes do not employ modelspecific optimizations. However, specialized optimizations <ref type="bibr">[25,</ref><ref type="bibr">42]</ref> or GPU acceleration <ref type="bibr">[19,</ref><ref type="bibr">33,</ref><ref type="bibr">54]</ref> achieve competitive performance at the expense of portability to general DBMSes.</p><p>To optimize training over normalized schemas common in DBM-Ses, recent works in factorized ML <ref type="bibr">[43,</ref><ref type="bibr">44,</ref><ref type="bibr">51,</ref><ref type="bibr">61]</ref> avoid materializing joins by treating ML as semi-ring aggregation queries over the join graph; the aggregations can then be pushed through the joins. This approach potentially provides DBMSes with an edge over conventional ML libraries that face significant expenses for join materialization and export during training. Thus, it is intriguing to investigate the potential of training tree models in DBMSes using vanilla SQL, with factorized ML optimizations.</p><p>Nevertheless, the current applicability of factorized ML to treebased models is restricted due to both algorithmic and systemrelated limitations. LMFAO <ref type="bibr">[61]</ref> describes a factorized algorithm that is limited to decision trees, and does not support the widely used gradient boosting models. It rewrites the tree node split algorithm into a batch of group-by aggregations, and uses multi-query optimization to share work between the aggregations. Unfortunately, it does not support the residual updates needed for gradient boosting models. Even for single decision tree training, LMFAO fails to exploit considerable work-sharing between parent-child tree nodes. Finally, LMFAO lacks portability because its performance is tied to a compilation engine specially designed for its batch-wise query execution. In our experiments, despite using an off-the-shelf DBMS (DuckDB <ref type="bibr">[57]</ref>), we train decision trees 1.9&#215; faster than LMFAO.</p><p>To this end, we present JoinBoost, the first In-DB ML system designed for tree-based models (decision trees, random forests, and gradient boosting) that fulfill C1-3. Our main contribution lies in the system design and implementation: we study the feasibility of implementing JoinBoost as a Python library that translates ML algorithms into simple Select-Project-Join-Aggregation (SPJA) queries and employs factorized ML as pure query rewriting. Such design ensures portability (C1) and scalability (C3), as the translated SQL queries can run on any single node or cloud-based DBMS. Nonetheless, this design poses significant challenges. Algorithmically, previous factorized ML systems do not support gradient boosting. Furthermore, while earlier factorized ML systems have demonstrated good performance, their success is closely linked to specialized execution engines. It's unclear whether implementing factorized ML as pure query rewriting can enable existing DBMSes to deliver performance (C2) competitive to specialized ML libraries.</p><p>To support factorized gradient boosting, each iteration trains a decision tree on the residuals of the prior tree, requiring updates to in the denormalized relation ; however, is not materialized in factorized ML, and the view update generally requires ( ) <ref type="bibr">[1]</ref>. To address this, for snowflake schema, we exploit 1-1 mapping between with the fact table , to update in directly. For the more complex galaxy schema, we found that despite challenges in directly updating in , the tree training only relies on aggregates of (e.g., count, sum) that could be updated efficiently. Our technique identifies addition-to-multiplication preserving, the key property to efficiently update aggregates, by rewriting residual updates over in into a join with an update relation . However, each boosting iteration generates a new update relation that may create cycles and doesn't scale with the number of iterations. To address this, we introduce Clustered Predicate Tree (CPT) that restricts the tree splits to attributes within the same cluster. This enables us to train gradient boosting on IMDB datasets, which was previously prohibitive due to the large size of (&gt;1 ). To assess the viability of implementing factorized ML with pure SQL, we conduct extensive experiments on DuckDB and a popular commercial DBMS. Although columnar DBMS has the potential to compete with specialized ML libraries, we find residual updates as the major bottleneck for gradient boosting in current columnar DBMSes: Residual updates require sequential writes to all values in a column. However, this process is not efficient in existing DBMSes due to a combination of columnar compression, write-ahead-log (WAL), and concurrency control (CC); these are fundamental to DBMSes but unnecessary for gradient boosting. Disabling WAL, CC, and compression in existing DBMSes directly is challenging as they are deeply integrated into the codes. To match the performance of LightGBM, update performance must be similar to a parallelized write to an in-memory byte array.</p><p>To minimize DBMS overheads for residual updates (C2) without compromising portability (C1), we explore the approach to create a new column of residual values and adding it as a projection <ref type="bibr">[67]</ref> on columnar DBMSes. We emulate this on DuckDB in two ways, with no or little modification to its internals. First, by leveraging DuckDB Pandas API <ref type="bibr">[7]</ref>, we store the fact table in a Pandas dataframe, use DuckDB for joins and aggregations, and update the residual by creating a new Pandas dataframe column. This results in a &#8764;15&#215; improvement in updates, making gradient boosting competitive without modifying DuckDB. However, one drawback is a &#8764;1.6&#215; slowdown in aggregations due to the DuckDB-Pandas interop overhead. To explore the full potential, we further modify DuckDB internals to support column swaps between DuckDB tables. We update residuals by swapping the existing residual column with the newly created one, making JoinBoost 1.1&#215; faster than LightGBM. Lastly, we simulate column swap on a commercial DBMS, and see a potential 15&#215; improvement. Such modifications, being both feasible (&lt; 100 LOC on DuckDB) and effective, provide direction for other closed-source columnar DBMSes to support in-DB gradient boosting efficiently.</p><p>Finally, we apply optimizations to further improve JoinBoost performance. Algorithmically, we enhance prior batch optimization <ref type="bibr">[61]</ref> by sharing intermediate results (materialized as tables in DBMSes) across batches (tree nodes), leading to a 3&#215; improvement. System-wise, we leverage inter-query parallelism among SQL queries, accelerating the gradient boosting training by 28% and random forest by 35%. We conduct extensive experiments with JoinBoost on various DBMS backends (local and cloud), using datasets with varying feature numbers, sizes (TPC-DS with SF 10&#8594;1000), and join graph complexity (galaxy schemas), against SOTA ML libraries (LightGBM,XGBoost,Sklearn) and In-DB systems (LMFAO,MADLib). On a single node, JoinBoost trains a gradient boosting model with 100 trees 1.1&#215; faster than LightGBM, and is 3&#215; faster for random forests. On multiple nodes, JoinBoost outperforms the Dask <ref type="bibr">[59]</ref> versions of LightGBM and XGBoost by &gt;9&#215;. JoinBoost easily scales in the # of features, join graph complexity, and dataset size (TPC-DS SF=1000 in Section 6.2), whereas LightGBM encounters memory limitations. Note: The paper is self-contained and references to appendices can be skipped, or can be found in the technical report <ref type="bibr">[35]</ref>.</p><p>and XGBoost <ref type="bibr">[25]</ref>, both of which are highly optimized and outperform others as per previous benchmarks <ref type="bibr">[18]</ref>. According to Kaggle 2021 surveys <ref type="bibr">[8]</ref>, among all the commonly used ML libraries, XGBoost and LightGBM are ranked 4 &#8462; and 6 &#8462; respectively for popularity (while all the top 3 also support Tree-based ML). However, none of them apply factorized ML for normalized datasets. In-DB ML systems. Most in-DB ML works <ref type="bibr">[28,</ref><ref type="bibr">32,</ref><ref type="bibr">68]</ref> focus on extending DBMSes (e.g., PostgreSQL for MADLib) to support linear algebra using UDFs and user-defined types. However, these are not needed for tree-based models that only rely on simple aggregations.</p><p>Other work optimizes ML training by leveraging specialized features (distributed execution <ref type="bibr">[20,</ref><ref type="bibr">38,</ref><ref type="bibr">39,</ref><ref type="bibr">45,</ref><ref type="bibr">70]</ref> or GPU acceleration <ref type="bibr">[19,</ref><ref type="bibr">33,</ref><ref type="bibr">46,</ref><ref type="bibr">54]</ref>) of a specific DBMS. Although we focus on optimizing the single-node CPU setting via SQL rewrites, JoinBoost can run on any DBMS and benefit from its optimizer and executor. For instance, Section 6.2 scales decision tree training on TPC-DS SF=1000 using a cloud DBMS. We leave further DBMS-, data-, and model-specific optimizations to future work.</p><p>In practice, cloud vendors (Azure <ref type="bibr">[11]</ref>, Redshift <ref type="bibr">[14]</ref>, BigQuery <ref type="bibr">[47]</ref>, Snowflake <ref type="bibr">[13]</ref>) offer SQL syntax to train tree models. Under the covers, however, they still materialize and export join results to an external library (LightGBM or XGBoost). Thus, we do not consider them in-DB ML from a performance or portability standpoint. Factorized ML systems. Factorized ML is optimized to train models over normalized databases. It translates ML models as aggregations over an appropriately designed semi-ring, and pushes aggregations through joins to achieve asymptotically lower time complexity. They support many popular models (ridge regression <ref type="bibr">[61]</ref>, SVM <ref type="bibr">[43]</ref>, factorization machine <ref type="bibr">[61]</ref>), and approximate others (kmeans <ref type="bibr">[27]</ref>, GLM <ref type="bibr">[37]</ref>). Of these, only LMFAO <ref type="bibr">[61]</ref> supports decision trees, albeit in a limited way (see Section 6.4). Further, most factorized ML works build specialized query optimizers and executors from scratch, which hinders portability (of the systems and performance wins) to existing DBMSes. JoinBoost extends factorized ML to tree models via vanilla SQL queries, and shows competitive performance with LightGBM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">BACKGROUND</head><p>We provide the background of factorized tree-based models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Annotated Relations and Message Passing</head><p>This section provides an overview of annotated relations and message passing fundamental to factorized query execution <ref type="bibr">[17,</ref><ref type="bibr">52]</ref>. Data Model. We use the traditional relational data model with the following notations: Given relation , let uppercase be an attribute, ( ) be its domain, = [ 1 , &#8226; &#8226; &#8226; , ] be its schema, &#8712; as a tuple of , and [ ] be tuple 's value of attribute . For clarity, we include the schema in the square bracket followed by the relation</p><p>Annotated Relations. The annotated relational model <ref type="bibr">[30,</ref><ref type="bibr">40,</ref><ref type="bibr">50]</ref> maps each &#8712; to a commutative semi-ring ( , &#8853;, &#8855;, 0, 1), where is a set, &#8853; and &#8855; are commutative binary operators closed over , and 0/1 are the zero/unit elements. These annotations form the basis of provenance semi-rings <ref type="bibr">[30]</ref>, and are amenable query optimizations based on algebraic manipulation. Different semi-ring definitions support different aggregation functions, ranging from standard statistical functions to ML models. For instance, the natural numbers semi-ring (N, +, &#215;, 0, 1) allows for integer addition and multiplication, and supports the COUNT aggregate. For an annotated relation , let ( ) denote tuple 's annotation. Semi-ring Aggregation Query. Semi-ring aggregation queries can now be re-expressed over annotated relations by translating group-by (general projection) and join operations respectively into + and &#215; operations over the semi-ring annotations:</p><p>(1) Each group-by result annotation in A sums all the annotations in its input group. (2) In , the annotation of each join result is the product of annotations from corresponding tuples in and . Aggregation Pushdown. The key optimization in factorized query execution <ref type="bibr">[17,</ref><ref type="bibr">62]</ref> is to distribute aggregations (additions) through joins (multiplications). <ref type="figure">Consider ( [</ref> , <ref type="figure">]</ref> [ , ]</p><p>[ , ]). The naive execution first materializes the join and then computes the aggregation. This costs ( 3 ) where is each relation's cardinality. An alternative evaluation could apply aggregation (addition) to before joining (multiplication) with , aggregate again before joining with , and then apply a final aggregation. The largest intermediate result, and thus the join complexity, is now ( ).</p><p>Message Passing. Given a join graph, the above optimization can be viewed as Message Passing <ref type="bibr">[55]</ref>. While Message Passing supports general SPJA queries <ref type="bibr">[36]</ref>, it is sufficient for tree models to restrict ourselves to SPJA queries with zero ( ) or one ( ) group-by attribute. Message passing operates over a tree that spans the join graph. For the root, we can pick any relation (taking cost models into account) that contains the grouping attribute. Then we direct all the edges of the join graph toward the root to form a tree. Starting from the leaf, we send messages along its path to the root. Each message is computed as: (1) Join the relation with all incoming messages from children relations, if any. This blocks until all children have emitted messages. Then (2) let A be the attributes in the current relation that are also contained in the remaining relations in the path to the root. Compute A over the previous join result, and emit the result to the parent relation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Consider again ( [ , ] [ , ]</head><p>[ , ]) along the join graph: --. If we choose T as the root, then the directed join graph is &#8594; &#8594; and the messages are:</p><p>Once the root receives and joins with all messages, it performs absorption, which simply applies the final group-by: ( &#8594; [ , ]). In some cases, the aggregate result is already part of the semi-ring (e.g., COUNT and the natural numbers semi-ring); in other cases, such as tree-based models, the semi-ring decomposes the training metric into their constituent statistics, so we combine them in the final annotation to restore the metric (see next section). </p><p>(1, 0, ..., = 1, ..., 0)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Tree-based Models</head><p>In this section, we describe the traditional tree-based models. The algorithms are based on CART <ref type="bibr">[22]</ref>, and its extensions (e.g., bagging and boosting) follow standard ML textbooks <ref type="bibr">[48]</ref>. Decision Tree. Decision tree maps (predicts) target variable from a set of features X. Internally, it maintains a tree structure of selection predicates, with each edge containing a predicate and selecting data based on the conjunction of predicates along the path from the root. Leaf nodes associate with a prediction value &#8712; ( ). The selection predicates by leaves are mutually exclusive and collectively exhaustive in (X). For &#8712; (X), decision tree predicts by traversing the edges where meets the edge predicate until a leaf is reached, then outputs its prediction .</p><p>Training a Decision Tree over relation with features X and target variable where X &#8746; &#8838; involves recursively splitting to minimize a criterion (&#8226;). For regression, the popular criterion is Variance for root mean square error ( ), while Gini Impurity, Entropy, and Chi-Square are common for classification. Numerical attribute splits use inequality ( &gt; , &#175; &#8804; ), and categorical attribute splits use equality ( = , &#175; &#8800; ) or set-based ( &#8712; , &#175; &#8713; ) predicates. Given the criteria (&#8226;) and split , the reduction of criteria after the split is ( ) -( ( ( )) + ( &#175;( ))). Tree growth could be depth-wise or best-first <ref type="bibr">[65]</ref>. Depth-wise growth splits the tree node with the least depth, and best-first growth splits the tree node greedily with the largest criteria reduction. Finally, each leaf prediction is the average for regression or mode for classification.</p><p>Algorithm 1 presents the training algorithm. Given relation (with encoded in annotations), features X, and the maximum number of leaves, the algorithm iteratively calls GetBestSplit to find the best next split (L3,8,9). GetBestSplit iterates through all features, evaluates its best split (based on reduction of criteria), and returns the best split across all features (L11-16). For best-first growth, a priority queue (L2) sorts leaf nodes descendingly based on reduction of criteria. In each iteration, the algorithm splits the best leaf node (L7), creating two tree nodes that partition the parent node. It then finds the best split for each new node and pushes them to the queue (L8,9). Training ends when the number of leaves reaches the maximum (L5). Leaf node predictions are computed, and the tree is returned (L10). During training, evaluating the best split (L14) is the most computationally expensive. Bagging and Boosting. Big, deep decision trees risk overfitting. Ensemble methods combine multiple smaller or shallower decision trees to create a more robust model. Popular tree-based ensemble models include random forests <ref type="bibr">[21]</ref> (bagging) and gradient boosting <ref type="bibr">[29]</ref> (boosting). Random forests parallelly train decision trees on samples of and features and aggregate their predictions (e.g., Algorithm 1: Decision tree training algorithm. L14 (underlined) is the most computationally expensive.</p><p>1 Function DecisionTree( , X, )</p><p>&#8462;( , , , , ); 5 while ++ &lt; do 6 // get the best split with the highest among leaves; 7 _, , , , &#8592; . (); 8</p><p>. &#8462;(GetBestSplit( , X, , )); </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Factorized Decision Tree</head><p>We now introduce factorized decision trees over joins. The training process follows Algorithm 1, but we optimize the computation of criteria reduction (L14), the most computationally intensive part, by avoiding join materialization. We consider a database with a set of relations R = { 1 , 2 , ..., }, and aim to train a decision tree over = 1 2 ... with a set of features X and target variable . Let be the relation that contains (or pick one if is in many relations as a join key). For simplicity, we will assume natural join, set semantics, and the Variance split criterion; Appendix B describes extensions to theta and outer joins, bag semantics, and formulas for classification semi-rings. Factorized learning avoids materializing by expressing the criterion as semi-ring aggregation queries, and applying aggregation pushdown over join. We first describe semiring annotations for trees and then the computation of criterion. Tree Semi-rings. We now illustrate how to use variance semiring (Table <ref type="table">1</ref>) to compute the Variance for regression. Each semiring defines a (&#8226;) function <ref type="bibr">[50]</ref> that annotates a base tuple with its appropriate semi-ring element. Variance semi-ring lifts &#8712; with (1, [ ], [ ] 2 ), and from the remaining relations  with the 1 element (1, 0, 0). During message passing, the annotations are combined via &#8853; and &#8855; as defined in Table <ref type="table">1</ref>. The aggregated semi-ring ( ) then forms a 3-tuple ( , , ) that denotes the count, sum of , and the sum of squares of . The variance statistic can be derived from this aggregated semi-ring as = -<ref type="foot">foot_1</ref> / . Thus, any filter aggregation query over the join graph can be expressed using message passing and lightweight post-processing. <ref type="figure">1a</ref> with variance semi-ring, join graph -and target variable = . To compute the variance over , the naive solution is to materialize (Figure <ref type="figure">1b</ref>), then compute the variance (=4) over . Instead, we compute ( , , ) = ( ) = (8, 16, 36) and = -2 / =4.</p><p>To find the best split, we similarly use variance semi-ring compute the reduction in variance (Appendix A). Message Caching. Training a decision stump requires computing the set of aggregation queries grouped by each feature: { ( )| &#8712; X}. Executing each independently via message passing is wasteful due to reusable messages. Consider the example in Figure <ref type="figure">1</ref>:</p><p>Example 2. Let X = { , }, and suppose we first aggregate on (Figure <ref type="figure">1c</ref>). We choose S as the message passing root because it contains ; we then pass messages 1 from R to S and 2 from T to S, and absorb the messages into . We then aggregate on (Figure <ref type="figure">1d</ref>). We choose T as the message passing root, pass 1 from R to S, 3 from S to T, and absorb into T. The two queries can reuse 1 .</p><p>Recent work <ref type="bibr">[36]</ref> shows that a simple caching scheme that materializes all messages between relations in the join graph (in both directions) is effective in various analytical workloads. The factorized learning system LMFAO [61] also optimizes batch queries for splitting a single decision tree node. For decision trees, where each query groups by &#8804;1 feature, its optimizations are equivalent to this simple caching scheme. However, LMFAO optimizes a single decision tree node's batch of queries, missing work-sharing across tree nodes and not supporting residual updates for boosting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">FACTORIZED GRADIENT BOOSTING</head><p>This section delves into the algorithms for factorized gradient boosting, which iteratively trains the next tree to predict the residuals from preceding trees. Recall from Section 3.2 that, each leaf in the decision tree is associated with a predicate . and prediction . . We aim to update, for each leaf , the target variable of &#8712; . ( ) to the residual [E]= [ ]-. . The fundamental challenge is to train decision tree over the updated in without materializing . One tempting approach is to treat this as a view update problem: we update in the view, translate it into updates over base relations, re-lift the updated base relations, and train the next factorized decision tree. However, view updates are susceptible to "side-effects" <ref type="bibr">[1]</ref>: for &#8712; , = [ ] can be duplicated in due to 1-N (or M-N) joins. If one duplicate is updated to a new value &#8242; , updating [ ] = &#8242; in would inadvertently cause other duplicates to be updated as well (as "side-effects"). To address it, new tuples are created in base relations for new mappings. These "side-effects" are particularly problematic for "residual updates", as the full column in is updated, requiring (| |) new tuples. We address these issues for snowflake (a single fact table) and galaxy schemas (multiple fact tables)-arguably the most common database schema forms. Snowflake schemas exhibit a 1-to-1 relationship between the fact table and , and we exploit this to efficiently updates over . We then discuss the system challenges when performing full-column updates. These techniques do not apply to galaxy schemas due to the M-N relationships between the fact tables. While directly updating in remains challenging, we identify the addition-over-multiplication preserving property that allows us to directly update semi-ring aggregates derived from updated , and use this to efficiently support the popular .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Snowflake Schemas</head><p>Snowflake schemas have one fact table and N-to-1 paths along the dimension tables; this means that is 1-1 with 1 and we can directly update . Let us first assume that is in (i.e., = ). The main challenge is to translate a leaf 's predicate . , which may refer to dimension attributes, into one over . We do so by translating . as semi-join predicates 2 over . Given join path -1 -... -, and ( ), we "move" the predicate to be &#8242; ( -1 ):</p><p>where &#8242; is over the join keys J = -1 &#8745; . Note that doesn't have to be equality-based but can be of any arbitrary type: is applied to J ( ( )) to identify the matching join keys for the semi-join. If is in a dimension table (i.e., &#8800; ), we join the relations along the path from to and project all attributes in along with . This reduces to the first case above. System Challenge: Since | | = | |, factorized gradient boosting does not offer an asymptotic advantage over specialized ML libraries. Theoretically, if the performance is comparable, it could still benefit from the parallelization, scaling, and administrative features of DBMSes. However, in practice, we find that bulk updates to . are a major bottleneck (Section 5.3.2) for existing DBMSes. The next section dives into these system challenges, and explores system design, logical, and physical optimizations to address them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Galaxy Schemas</head><p>Galaxy schemas model M-N relationships between multiple fact tables. They are prevalent in enterprise settings <ref type="bibr">[60]</ref>; the recent "semantic layers" trend <ref type="bibr">[5,</ref><ref type="bibr">12,</ref><ref type="bibr">24,</ref><ref type="bibr">34]</ref> (pre-defined denormalized views) have made analysis and ML over galaxy schemas more accessible. However, galaxy schemas induce the 1-N relationship between and that causes side-effects <ref type="bibr">[1]</ref> during residual updates. We observe that individual values are not needed for trainingthe split criteria only refers to semi-ring aggregates over the updated values (Section 3.3), and can potentially be updated efficiently. For instance, for a leaf with original aggregates = 1, = over . ( ), we can directly update the sum of residuals as ( -. ) = -. = -&#215; . , without referencing individual values. Unfortunately, this approach is not efficient for arbitrary split criteria. To review, factorized ML maps (via (&#8226;)) in the real number semi-ring R(R, +, &#215;, 0, 1) (for base relation) to another semi-ring S(S, &#8853;, &#8855;, 0 &#8242; , 1 &#8242; ) (e.g., variance semi-ring for ), and computes ( ) over with aggregation pushdown. To directly update aggregates given leaf , we want to find a function :(S, R) &#8594; S, such that takes the original aggregates ( ) (without referencing individual ) and additive inverse of prediction . 3 as input, and outputs updated aggregates: ( ( ), -. ) = ( -. ). To maintain our asymptotic benefits, both and semi-ring operation/annotation should be of constant time/size. Such does not exist for all semi-rings. For example, mean absolute error ( ) relies on the count and sum of signs: 1, ( ). The naive semi-ring would track ( ) = ( 1, ( )). However, doesn't exist because given the leaf , ( -. ) cannot be solely decided by 1, ( ), -. , as it depends on individual -. . A naive solution uses a semi-ring that collects all values, but its annotation size will be (| |) and defeats the purpose of factorization. To this end, we identify a sufficient property for constant sized semi-ring S to construct : Definition 1. Addition-to-Multiplication Preserving. Given two semi-rings R(R, +, &#215;, 0, 1), S(S, &#8853;, &#8855;, 0 &#8242; , 1 &#8242; ), a lift function (&#8226;) : R&#8594;S is addition-to-multiplication preserving from R to S if</p><p>for any 1 , 2 &#8712; .</p><p>3 Additive inverse of . exists because R is also a ring.</p><p>Based on the property, we can construct as:</p><p>It is easy to verify that, the variance semi-ring (and lift) in Table <ref type="table">1</ref> satisfies the property:</p><p>). In contrast, there is no known and constant-sized semi-ring with this property for . A possible future direction using polynomial approximations <ref type="bibr">[37]</ref> of the semi-ring aggregates. To keep the text concrete, the subsequent text will refer to the criteria.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Update Relation:</head><p>For decision tree and each of its leaf , we apply the corresponding that multiplies aggregates over . ( ) with (-. ). We model this as an Update Relation joins with . is constructed as follows: let A be the set of attributes referenced by . for any leaf , and be the projection A ( ), along with a columnof the additive inverse of leaf prediction (unique as leaf predicates are non-overlapping). Naively, the next boosting lifts the residual using -(</p><p>) 4 , where has to be materialized. Instead, we rewrite this into ( ) -( ): </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Algorithmic Challenge.</head><p>Unfortunately, can introduce cycles in the join graph over time (e.g., Figure <ref type="figure">2c</ref> shows cycle &#8594; &#8594; ); whereas Message Passing requires acyclic join graphs. Standard hypertree decomposition <ref type="bibr">[17,</ref><ref type="bibr">41]</ref> removes cycles by joining the relations in a cycle, materializing their join result &#8242; (e.g.,</p><p>), and replacing these relations in the join graph with &#8242; . However, as the number of trees in model increases, the number of referenced attributes is likely to span the entire join graph-| | will thus converge to | |.</p><p>To address this, we propose Clustered Predicate Tree (CPT): each boosted tree restricts split on features that can be pushed to the same fact table. To do so, we cluster relations such that within each cluster, a single fact table maintains N-to-1 relationships with all other relations. Predicates in this cluster can be rewritten as semi-joins to the same fact table (Section 4.1) and won't create cycles <ref type="bibr">[17]</ref>. During training, while the root decision tree node can split on any feature, subsequent splits are confined to attributes within the same cluster. Although this might affect model accuracy, 4 When is applied to a column , we denote it as .  . Each tuple in has the same annotation as the materialized lifted on E. Therefore, the materialization of can be avoided.</p><p>Figure <ref type="figure">3</ref>: Clusters for IMDB dataset. Each cluster is enclosed by dotted lines and its fact table is filled.</p><p>it allows efficient residual updates over join graphs, which would otherwise be untrainable using existing techniques.</p><p>Example 4. Figure <ref type="figure">3</ref> shows the join graph of IMDB datasets <ref type="bibr">[2]</ref>, which was previously prohibitive to train gradient boosting due to the large join size ( is &gt;1 ). The five clusters are enclosed by dotted lines and the cluster's fact table is highlighted. If the current tree initially splits on Person's age (in Person Info), the rest of the tree can only split on attributes in Person or Person Info.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">JOINBOOST OVERVIEW AND OPTIMIZATIONS</head><p>We now present the system overview and optimizations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">JoinBoost Developer Interface</head><p>As described in Section 1, our design goals are portability, performance, and scalability. We evaluate performance and scalability in the experiments, so focus on portability below. Portability: Our design deviates from prior factorized ML works <ref type="bibr">[25,</ref><ref type="bibr">42,</ref><ref type="bibr">44,</ref><ref type="bibr">61]</ref>, which build (fast) custom execution engines. In contrast, JoinBoost is implemented as a Python library that transparently generates SQL queries to the user's DBMS backend (Figure <ref type="figure">4</ref>). JoinBoost internally translates the ML algorithms into CREATE TABLE and SELECT SQL queries. Compared to in-DB systems like MADLib <ref type="bibr">[32]</ref>, JoinBoost generates pure SQL and does not require user-defined types or functions. This enables portability (criteria C2): JoinBoost runs on embedded databases, single-node databases, cloud data warehouses, and even Pandas and R dataframes <ref type="bibr">[7]</ref>.</p><p>The compiler fully supports decision trees, random forests, and gradient boosting with all learning parameters that LightGBM supports. Currently, regression with objective supports galaxy schema with Clustered Predicate Trees; other objectives (e.g., , &#8462; , ...) require snowflake schema. We are actively extending capabilities to support pruning, dropout, and early stopping, which build on the techniques in the preceding sections.</p><p>One usability challenge is that JoinBoost end users are typically domain scientists who lack expertise in data access and schema, commonly handled by data engineers. While addressing is beyond the scope, we highlight the recent trend of "semantic layer" <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">12,</ref><ref type="bibr">24]</ref> as a means to bridge the knowledge gap. In the "semantic layer, " data engineers employ SQL-based recipes to transform and preprocess raw data into tables containing meaningful attributes to domain scientists, and can be queried by JoinBoost using SQL. Safety: Training shall never modify user data. To achieve that, JoinBoost creates temporary tables in a specified namespace or with a unique prefix. By default, JoinBoost deletes these tables after training, but users can keep them for provenance or debugging.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Architecture Overview</head><p>The ML Compiler runs the ML logics (Algorithm 1) to train treebased models in Python and translates the most computational intensive L14, which identifies the best split across features for tree nodes, into SQL queries to be executed by DBMS. At this stage, the compiler treats the join graph as a single "wide" table, with the SQL query operating on this table, and factorization is applied in a later step. After training, it returns a reference to the trained model. The Semi-ring Library stores semi-ring definitions and translates math expressions in the compiler-generated queries (&#215;, +,</p><p>) into SQL aggregation functions.</p><p>( ) creates a copy of a base relation that contains an additional attribute for each component in the semi-ring (e.g., c, s, q in the variance semi-ring). This also ensures that any update in-place will not modify user data. In addition to the variance (for regression) and the class count (for classification) semi-rings, JoinBoost implements semi-rings for a wide range of popular objectives; it supports for snowflake and galaxy schemas, and mae, huber loss, fair loss, log loss, softmax and more for snowflake schemas (see Appendix B for full list).</p><p>The Factorizer decomposes each aggregation query into message passing and absorption queries. It also materializes each message as a database table, and re-uses them when possible. After choosing a node split, the factorizer keeps messages that can be reused in descendent nodes (Section 5.5.1) and drops the rest.</p><p>Finally, the Connector takes our internal SQL representation, and translates them into the appropriate SQL string or dataframe API calls. Although DBMSes are notorious for incompatible SQL variants, JoinBoost only uses a subset of SQL that is generally consistent across vendors. For instance, it generates standard nonnested SPJA queries with simple algebra expressions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Residual Updates Logical Optimization</head><p>Although correct, the residual update technique in Section 4 is still expensive to implement naively. For simplicity, if we assume a single join attribute between the fact table and update table , and the variance semiring, the SQL query would be:</p><p>where , , are semiring components, and the remaining columns in are copied over (shown as . . . ). Unfortunately, this is &gt;50&#215; slower than LightGBM's residual update procedure (see experiments below), because can potentially be as large as the materialized .</p><p>To this end, we present an optimization for snowflake and galaxy schemas that completely avoids materializing as well as .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1">Semi-join Update Optimization.</head><p>We directly UPDATE 's semiring annotations. Let us start with a snowflake schema. Each decision tree leaf logically corresponds to a separate join graph containing a set of messages (Figure <ref type="figure">6</ref>), with . as its predicate and . as its prediction. Using the semi-join optimization (Section 4), we translate predicates over (e.g., . ) into semi-joins between and relevant incoming messages M. A message &#8712; M is relevant if it is along a join path from a relation containing an attribute in . to . For each leaf node , we execute the following query where . is the join attribute with its relevant incoming message : In some databases, updates in place can be very slow. Thus an alternative is to create a new fact table with the updated semi-ring annotations. Let be the &#8462; leaf in the decision tree, and , , , be the &#8462; message and its join attribute with in 's join graph:</p><p>TABLE F_updated AS SELECT CASE WHEN F.a_ij IN (SELECT a_ij FROM m_ij) AND ... THEN s -l_j.p*c WHEN ... // Other leaves END AS s, ... // other semi-ring components ... // copy other attributes in F FROM F</p><p>The same ideas apply to galaxy schemas, where corresponds to the fact table of the current tree's cluster. Further, we show in the technical report <ref type="bibr">[35]</ref> that and are not necessary to materialize; thus only is needed for the variance semi-ring.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2">Pilot Study.</head><p>When should we perform in-place updates as compared to creating new tables? On which DBMSes? We now report a microbenchmark to understand the performance tradeoffs, and use them to motivate a new optimization. We used an Azure VM, with 16 cores, 128 GB memory, and an SSD. Workloads. We create a synthetic fact table ( , , 1 , ..., ) with 100 rows to simulate residual updates in a decision tree with 8 leaves. is the semi-ring column to update, &#8712; <ref type="bibr">[1,</ref><ref type="bibr">10 ]</ref> is the join key, and are simply extra columns that would need to be duplicated in a new table. For the &#8462; leaf node, its prediction is a random float, and we construct its semi-join message ( ) to contain all values in (1250 &#215; ( -1), 1250 &#215; ]. Methods. We evaluate three approaches. Naive materializes Update Relation , then re-create fact table: &#8242; = A as discussed in Section 4. SET and CREATE use the update-in-place and create table optimizations in the preceding subsubsection. CREATE-k denotes the number of extra columns in , where &#8712; {0, 5, 10}; we set = 0 for Naive, and does not affect SET. DBMSes. We evaluate two systems. DBMS-X is a popular commercial RDBMS that supports both column-oriented (X-col) and roworiented (X-row) storage and query processing. DBMS-X is diskbased only, and we set the isolation and recovery to the lowest level (read uncommitted and minimum logging). DuckDB <ref type="bibr">[57]</ref> is a popular embedded column-oriented OLAP DBMS and is highly performant <ref type="bibr">[9]</ref>. DuckDB has disk-based (D-disk) and memory-based (D-mem) modes. As a reference, we also use LightGBM to train 1 iteration of gradient Boosting with the same training settings, and report residual update time. Experiment Results. Figure <ref type="figure">5</ref> shows that Naive incurs high materialization and join costs. CREATE is &#8764; 2&#215; faster for DBMS-X and &#8764; 4&#215; faster for DuckDB, but its cost grows linearly with . SET mainly depends on the DBMS-it is prohibitive for DBMS-X, but more efficient than CREATE when &gt; 5 for DuckDB. All DBMS approaches take &gt;3 for updating residuals. In contrast, LightGBM stores the target variable in a C++ array and performs parallel writes; its residual update takes &#8764;0.2 . These poor results are due to four main factors.</p><p>&#8226; Compression: CREATE in X-col incurs high compression costs: the database is 1 in X-col as compared to 2.6 for DuckDB. This also penalizes SET due to decompression.</p><p>&#8226; Write-ahead Log (WAL) introduces costly disk writes.</p><p>&#8226; Concurrency Control (CC): In-memory DuckDB doesn't use WAL, but incurs MVCC <ref type="bibr">[49]</ref> overheads, including versioning, and logging for undo and validation. &#8226; Implementation: DuckDB's update is currently single-threaded.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Residual Updates Physical Optimization</head><p>Logical rewrites are effective but still much slower than LightGBM's residual updates, even when existing CC mechanisms are lowered. We observe that JoinBoost does not need durability and concurrency control, since it writes to private tables, performs applicationlevel concurrency control (Section 5.5), and can simply re-run upon failure. Compression is also unnecessary for the heavily updated columns. Unfortunately, WAL, CC, and compression are deeply integrated into DBMS designs, and nontrivial to fully disable.</p><p>We note that columnar engines are well suited to avoid these costs by adding the new residual column as a projection <ref type="bibr">[67]</ref>, but is not generally supported (e.g., by DuckDB, DBMS-X). We thus evaluate its benefits using three methods: one that uses DuckDB's existing APIs to swap columns in Pandas dataframe, one that adds "column swapping" to DuckDB, and one that simulates it in DBMS-X.</p><p>The first solution, called DP (DuckDB + Pandas), uses the existing DuckDB Relational API <ref type="bibr">[7]</ref> to directly access Pandas dataframes <ref type="bibr">[53]</ref> -an in-memory data structure that stores columns in contiguous uncompressed C arrays. Internally, it uses a custom scan operator for Pandas's matrix format, and DuckDB's executor for the rest of the query. We store the fact table as a dataframe, and the remaining tables in DuckDB, and join them via the relational API during training. For residual update, we submit a query over Pandas and native relations to compute the semi-ring annotations for the updated residuals, and store the result in a NumPy array. Then using Pandas, we replace the old column in with the new NumPy array (a pointer swap). This is fast because it avoids WAL or CC overhead, and reduces residual updates to 0.72 -competitive with LightGBM (Figure <ref type="figure">5</ref>). The drawback is that the scan overhead slows down the join-aggregation query, and increases training time by &#8764;1.6&#215; as compared to querying only native relations (Section 6.3).</p><p>The second approach, D-Swap, modifies DuckDB internals slightly (&lt;100 LOC) to support pointer-based column swapping between two DuckDB tables. DuckDB stores tables in row groups, with each group containing pointers to the column data. D-Swap iterates through the row groups of the two relations and swaps the column pointers. Such column swap is a schema-level modification, so it is very fast and side-steps decompression, CC, and WAL overheads. It achieves similar residual update performance to DP (Figure <ref type="figure">5</ref>) without degrading aggregations (Section 6.3).</p><p>Finally, to extend beyond DuckDB, we also simulate column swapping's impact on DBMS-X in Section 6.3, and see a potential 15&#215; improvement. We believe that implementing such column swap in closed-source DBMSes is both feasible and effective for residual updates. We encourage columnar DBMS developers to incorporate this operation for efficient In-DB gradient boosting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Optimizations and Features</head><p>We implement various optimizations and features in JoinBoost for performance and usability. Due to space limit, we only discuss the most critical ones here, while the rest can be found in Appendix D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.1">Message Sharing Among nodes.</head><p>Previous factorized ML algorithms <ref type="bibr">[44,</ref><ref type="bibr">61]</ref> rely on batch optimization of aggregation queries, suitable for models with closed-form solutions (e.g., ridge regression <ref type="bibr">[61]</ref>). However, tree-based model training is iterative: queries for child nodes depend on the parent node's split, making batching ahead impossible. Consequently, previous algorithms either miss work-sharing opportunities between tree nodes or batch all possible splits, which is impractical due to exponential growth in tree depth. Our key observation is that messages as intermediate results can also be shared among tree nodes. Consider the example: Example 6. Following Example 2, suppose the best root split is &gt;1 . The split applies and &#172; to (as it contains ), creating two leaf nodes with relations: ( ) and &#172; ( ). Then, for each leaf node, we compute the batch of aggregation queries and identify the next split through Message Passing. As shown in Figure <ref type="figure">6</ref>, messages (dotted blue) along the path &#8594; &#8594; are the same for both leaf and root nodes. We only need to recompute message &#8594; in each tree leaf, skipping &#8594; as 's attributes aren't used in this model.</p><p>In general, after a split on , all messages along the path to can be re-used. This is orthogonal to prior batch optimization work <ref type="bibr">[61]</ref> as we can cache and reuse messages after batching for future nodes, further improving batch optimization by &gt;3&#215; (Section 6.4).  data and features, and aggregates (e.g., averages) their predictions during inference. Feature sampling can be easily implemented by using a random subset of X &#8242; &#8838;X features for Algorithm 1. The main challenge is efficiently sampling over non-materialized : Naively sampling each relation is ( <ref type="formula">1</ref>) not uniform and independent, and (2) may produce non-joinable and hence empty samples. To address these, we use ancestral sampling <ref type="bibr">[23,</ref><ref type="bibr">63,</ref><ref type="bibr">71]</ref> over join. Minor Optimizations. First, we coalesce the messages for COUNT query with those for the tree criterion. For instance, the element in Variance Semi-ring captures the COUNT statistics. Second, for snowflake schemas where the fact table has N-to-1 relationships with the rest of the tables, we sample the fact table directly <ref type="bibr">[69]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.2">Sampling for Random</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.3">Inter-query Parallelism.</head><p>Parallelism is widely used in ML libraries like LightGBM, which implements parallelized sorting, aggregation, residual updates, and split candidate evaluations. For JoinBoost, most DBMSes offer intra-query parallelism, but there can be diminishing returns for individual queries or operations. Thus, JoinBoost also aggressively parallelizes across trees, leaf nodes, candidate splits, and messages. However, there are dependencies between these queries: a message's query relies on upstream messages, absorption depends on incoming messages, tree nodes depend on ancestor node queries, and gradient boosting trees rely on preceding trees. To handle this, JoinBoost employs a simple scheduler. Each query tracks its dependent queries, and when finishes, it sets the ready bit for these dependencies. If all dependencies for a query are set, it's added to a FIFO run queue. Empirically, 4 threads work best for intra-query parallelism, while the rest are used for inter-query parallelism. This reduces gradient boosting training time by 28% and random forest by 35% (Appendix C).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EXPERIMENTS</head><p>We focus on the fastest alternative: ML libraries (XGBoost, LightGBM, Sklearn). We start with a single-node setting, and then evaluate scalability to the number of features, types of joins, and database size on multiple nodes. Finally, we compare with factorized (LMFAO) and non-factorized (MADLib) in-DB ML techniques. Datasets. We primarily report results using the Favorita <ref type="bibr">[3]</ref> dataset used in prior factorized ML work <ref type="bibr">[61,</ref><ref type="bibr">62]</ref> (Figure <ref type="figure">7</ref>). Sales is the fact table (2.7 , 80M rows), and has N-to-1 relationships with the other dimensions (&lt;2 each). There are 13 features. We report TPC-DS results for scalability experiments and use IMDB for galaxy schema experiments; additional TPC-H/DS results are in the Appendix C. Preprocess. Although Favorita and TPC-H/DS are standard benchmarks in prior factorized learning evaluations <ref type="bibr">[61]</ref>, their features are non-predictive and lead to highly unbalanced trees. This artificially favors JoinBoost since all but one leaf node in the decision tree would contain very few records, and thus take negligible training time-the performance differences with other systems are fully dominated by join materialization. To ensure balanced trees and fair comparison, we impute one feature attribute in each of the 5 dimension tables with random integers drawn from <ref type="bibr">[1,</ref><ref type="bibr">1000]</ref>. Then we impute the target variable as sum of transformed features 5 . Finally, we dictionary encode strings into 32-bit unsigned integers <ref type="bibr">[16,</ref><ref type="bibr">58]</ref> to avoid parsing errors in ML libraries like LightGBM. Models. We evaluate decision tree, random forest, and gradient boosting. JoinBoost is intended to complement other DBMS workloads, so all experiments start with data persistent on disk but not in memory. We assume by default that data are already persisted in the disk-based DBMSes (DBMS-X and disk-based DuckDB). We report the end-to-end training time for decision tree of max depth 10, and vary the number of trees (iterations) in the random forest and gradient boosting. For JoinBoost, the main cost is from DBMSes, and the Python codes introduce negligible (&lt;0.1 ) overhead. Methods. We evaluate JoinBoost with different DBMS backends. DBMS-X (X-col and X-row for column and row-oriented storage and execution engines) and DuckDB-disk (D-disk) are disk-based and directly execute queries on the base DBMS, whereas memory-based DuckDB (D-mem) first loads the DBMS from disk. DP refers to the diskbased DuckDB backend using Pandas updates through the DuckDB's relational API, and D-Swap refers to the modified memory-based DuckDB for efficient residual updates (Section 5.4). By default, we use D-Swap as the backend as it has the best performance (Section 6.3). ML libraries 6 (LightGBM, XGBoost, Sklearn) expect a single CSV as input, so incur the cost to materialize and export the join result (&#8764;7 for Favorita), load the CSV, and train the model. DuckDB joins and exports the data faster than DBMS-X, so we report its numbers. Hardware: We use Azure VM: 16 cores, 128 GB memory, and SSD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Comparison With ML Libraries</head><p>We compare with SOTA ML libraries for training tree-based models (LightGBM <ref type="bibr">[42]</ref>, XGBoost <ref type="bibr">[25]</ref> and Sklearn <ref type="bibr">[56]</ref>). Sklearn implements the standard and histogram-based gradient boosting with algorithms similar to LightGBM, so we report both implementations. We set the number of bins to 1000 for LightGBM and XGBoost, and 255 for Sklearn (its limit). By default, we train gradient boosting and random forest with best-first growth, with a maximum of 8 leaves per tree. The gradient boosting learning rate is 0.1. The random forest data sampling rate without replacement is 10%, and feature sampling rate is 80%. We report up to 100 iterations. The 0 &#8462; iteration reports the join materialization, export, and load costs.</p><p>Figure <ref type="figure">8a</ref> shows random forest results. JoinBoost is &#8764;3&#215; faster than LightGBM by avoiding materialization and export costs (dotted black line), and loading costs; it also parallelizes across trees. In fact, JoinBoost finishes 100 iterations before the export is done. Sklearn also parallelizes across trees, but is so slow that we terminate after 32 iterations. The final model error (</p><p>) is nearly identical (&#8764;2350) for JoinBoost, LightGBM and XGBoost.</p><p>Figure <ref type="figure">8b</ref> shows gradient boosting results. JoinBoost is &#8764;1.1&#215; faster than LightGBM, and is &#8764;1.2&#215; faster than XGBoost by avoiding materialization and export costs. Figure <ref type="figure">8c</ref> shows the model . JoinBoost and LightGBM have equivalent as both employ the same algorithm, while the final is similar across all. Note that 5 Favorita applies: = ( ) + ( ) -10 -10 + 2 6 Python version LightGBM and XGBoost have memory issues [4] so we use CLI version.   the models begin to converge &#8764; 60 iterations; JoinBoost converges by the time LightGBM and XGBoost have loaded their data.</p><p>To interpret the JoinBoost cost, we zoom in on the 1 iteration of JoinBoost gradient boosting in Figure <ref type="figure">9</ref>, which displays the total number of queries for passing messages (orange) and finding the best split (blue) along with a histogram for the query execution time distribution. With a tree of 8 leaves and 15 nodes, there are 270 = 15 &#215; 18 (number of features) queries for feature split, and 75 = 15 &#215; 5 (number of join edges) queries for message passing, as expected. All feature split queries are efficient, taking &lt; 10 . The performance bottlenecks are the queries passing messages from the fact table, which require join-aggregation, result materialization, and take &gt; 200 . This highlights the importance of aggressively reusing messages across nodes (Section 5.5.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Scalability</head><p>We now study scalability to DB size, and join complexity.  Single-node Scalability. Favorita is a fixed dataset, so we use TPC-DS (145 features) to scale the database ( &#8712; [10, 25]). Figure 10 shows that both systems scale linearly, but JoinBoost has a lower slope (&#8764;10&#215; lower at 10 &#8462; iteration, and &#8764;2&#215; lower at 50 &#8462; ). LightGBM runs out of memory at = 25. Multi-node Scalability. We use the Dask version [59] of LightGBM and XGBoost, and Dask-SQL for JoinBoost to train gradient boosting on multiple machines for 10 iterations on TPC-DS &#8712;[30, 40]. We use 4 n1-highmem-16 GCP instances (16 vCPUs, 104 GB RAM each) with data replicated across all instances; at least 4 machines are needed for LightGBM to run on</p><p>= 30 due to large join sizes and inefficient memory usage <ref type="bibr">[4]</ref>. Figure <ref type="figure">11 (a)</ref> shows that, on 4 machines, all systems scale linearly , but JoinBoost is &gt;9&#215; faster with a &#8764;5&#215; lower slope. For =40, LightGBM runs out of memory even on 4 machines. In contrast, Figure <ref type="figure">11 (b)</ref> shows that for =40, JoinBoost can train on a single machine, outperforms XGBoost (using 4 machines), and speeds up with more machines. Cloud-warehouse Scalability. We use a cloud data warehouse DW-X train a decision tree with max depth 3 over TPC-DS SF=1000, and study multi-machine scalability. Each machine has 74 cores, 300GB of memory, and SSDs. We replicate dimensional tables across the machines, and hash-partition the fact table. Figure <ref type="figure">12</ref> shows that 2 machines introduce a shu e stage that slows training, and increasing to 4 (6) machines reduces training by 10% (25%). Galaxy Schemas. Galaxy schemas have N-to-N relationships that are prohibitive to materialize. We use clustered predicate trees (Section 4.2.2) to train gradient boosting on the IMDB dataset (Figure <ref type="figure">3</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cast_Info is &#8764;1</head><p>and the total DB is 1.2 . JoinBoost scales linearly with the number of iterations (Figure <ref type="figure">13</ref>); it trains one tree and updates residuals in each cluster's fact table within &#8764;5 . LightGBM cannot run because the join result is &gt;1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Effect of DBMSes</head><p>We now use Favorita to train 1 iteration of gradient boosting, and compare train and update costs for different DBMSes (DBMS-X   and DuckDB). To study the potential benefits of column swap (Section 5.4) in commercial DBMSes, X-Swap* reports the theoretical cost using the time to create the updated column as a new table (since column swap is then "free").</p><p>Figure <ref type="figure">14</ref> breaks down train and update costs. Decision trees and random forests only require training (blue bar), which is dominated by columnar execution: X-col and DuckDB take 3.2 -3.9 versus 14.5 for X-row. Gradient boosting introduces high update costs across all of the baseline DBMSes. We see that using Pandas to perform the update (DP) reduces residual updates by &#8764;15&#215; (17.8 &#8594;1.2 ), but slows training by 60% (3.2 &#8594;5.1 ) due to DuckDB-Pandas interop overhead. D-opt implements column swapping inside DuckDB and improves training. We also see that adding column swapping to DBMS-X (X-Swap*) leads to respectable gradient boosting performance (&#8764;3&#215; of D-Swap) but can benefit from its out-of-memory and multi-node scalability features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Comparison With In-DB ML Techniques</head><p>We first compare with the dominant factorized approach (LMFAO <ref type="bibr">[61]</ref>), followed by the non-factorized approach (MADLib). Factorized ML. We first compare with LMFAO <ref type="bibr">[61]</ref>, which supports decision trees (but not gradient boosting or random forests). We train a decision tree (max depth=10) with best-first growth; the trained tree is balanced and has 1024 leaves. Via correspondence, the LMFAO authors shared a version that compiles a program for the queries used to split the root tree node, and reuses it for growing the rest of the tree. We set the highest optimization level for LMFAO, and exclude the time for query compilation (&#8764;15 ) and data loading. To separate algorithmic vs implementation differences, we implement two variations of JoinBoost. Naive materializes the join result without factorized ML. Batch implements LMFAO's core logical optimizations (Multi Root, Aggregate Push-down and Merge View) for decision tree; this corresponds to message passing with message re-use within, but not between tree nodes. Figure <ref type="figure">15a</ref> reports the training time. Even with the custom engine and specialized optimizations, LMFAO is still &#8764;1.9&#215; slower than JoinBoost due to the lack of message caching (Section 5.5.1). By eliminating the implementation differences, Batch demonstrates that message caching improves training by &#8764;3&#215;. The improvement is because half the messages across tree nodes are cached, and the sizes of the cached messages tend to be much larger; the intuition is that the messages outgoing from the relation containing the split attribute will be smaller, since the split predicate has been applied. In theory, two-pass semi-join reduction <ref type="bibr">[17,</ref><ref type="bibr">52]</ref> could reduce message sizes, but the high overheads outweigh benefits <ref type="bibr">[16]</ref>. Batch is &#8764;2&#215; faster than Naive due to factorization and shared work. Non-Factorized ML. MADLib <ref type="bibr">[32]</ref> is a PostgreSQL extension that supports ML using user-defined types and functions. MADLib doesn't apply factorized ML, so the join has to be materialized. MADLib times out after 1 hour when training a decision tree model (max depth=10) on the full Favorita, so we reduced the training data size to 10k rows; for JoinBoost, we reduced the fact table to 10k rows. Figure <ref type="figure">15b</ref> shows that JoinBoost is &#8764;16&#215; faster than MADLib.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>JoinBoost is the first In-DB factorized ML system for tree models (decision trees, random forests, and gradient boosting) with only SQL. JoinBoost is comparable or faster than the SOTA LightGBM ML library on in-memory datasets, but scales well beyond LightGBM's capabilities in terms of # of features, database size, and join graph complexity. JoinBoost exposes a Python API that mimics LightGBM's API and is portable to any DBMS and dataframe library.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We assume no missing join keys, or use left outer join to maintain 1-to-1 relationship.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>We treat left semi-joins as filters over the left relation so its annotations don't change.</p></note>
		</body>
		</text>
</TEI>
