Distributed Task-Based Training of Tree Models

Yan, Da; Chowdhury, Md Mashiur; Guo, Guimu; Kahlil, Jalal; Jiang, Zhe; Prasad, Sushil

Citation Details

Decision trees and tree ensembles are popular supervised learning models on tabular data. Two recent research trends on tree models stand out: (1) bigger and deeper models with many trees, and (2) scalable distributed training frameworks. However, existing implementations on distributed systems are IO-bound leaving CPU cores underutilized. They also only find best node-splitting conditions approximately due to row-based data partitioning scheme. In this paper, we target the exact training of tree models by effectively utilizing the available CPU cores. The resulting system called TreeServer adopts a column-based data partitioning scheme to minimize communication, and a node-centric task-based engine to fully explore the CPU parallelism. Experiments show that TreeServer is up to 10x faster than models in Spark MLlib. We also showcase TreeServer's high training throughput by using it to build big "deep forest" models. more »

Award ID(s):: 1755464

NSF-PAR ID:: 10331910

Author(s) / Creator(s):: Yan, Da; Chowdhury, Md Mashiur; Guo, Guimu; Kahlil, Jalal; Jiang, Zhe; Prasad, Sushil

Date Published:: 2022-01-01

Journal Name:: Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this