skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Better Decision Tree: The Max-Cut Decision Tree with Modified PCA Improves Accuracy and Running Time
Abstract Decision trees are a widely used method for classification, both alone and as the building blocks of multiple different ensemble learning methods. The Max Cut decision tree introduced here involves novel modifications to a standard, baseline variant of a classification decision tree, CART Gini. One modification involves an alternative splitting metric, Max Cut, based on maximizing the distance between all pairs of observations that belong to separate classes and separate sides of the threshold value. The other modification, Node Means PCA, selects the decision feature from a linear combination of the input features constructed using an adjustment to principal component analysis (PCA) locally at each node. Our experiments show that this node-based, localized PCA with the Max Cut splitting metric can dramatically improve classification accuracy while also significantly decreasing computational time compared to the CART Gini decision tree. These improvements are most significant for higher-dimensional datasets. For the example dataset CIFAR-100, the modifications enabled a 49% improvement in accuracy, relative to CART Gini, while providing a$$6.8 \times$$ 6.8 × speed up compared to the Scikit-Learn implementation of CART Gini. These introduced modifications are expected to dramatically advance the capabilities of decision trees for difficult classification tasks.  more » « less
Award ID(s):
1760102 2112533
PAR ID:
10368165
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
SN Computer Science
Volume:
3
Issue:
4
ISSN:
2661-8907
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Given graph$$G=(V,E)$$ G = ( V , E ) with vertex setVand edge setE, the maxk-cut problem seeks to partition the vertex setVinto at mostksubsets that maximize the weight (number) of edges with endpoints in different parts. This paper proposes a graph folding procedure (i.e., a procedure that reduces the number of the vertices and edges of graphG) for the weighted maxk-cut problem that may help reduce the problem’s dimensionality. While our theoretical results hold for any$$k \ge 2$$ k 2 , our computational results show the effectiveness of the proposed preprocessonlyfor$$k=2$$ k = 2 and on two sets of instances. Furthermore, we observe that the preprocess improves the performance of a MIP solver on a set of large-scale instances of the max cut problem. 
    more » « less
  2. Abstract In the spanning tree congestion problem, given a connected graphG, the objective is to compute a spanning treeTinGthat minimizes its maximum edge congestion, where the congestion of an edgeeofTis the number of edges inGfor which the unique path inTbetween their endpoints traversese. The problem is known to be$$\mathbb{N}\mathbb{P}$$ N P -hard, but its approximability is still poorly understood, and it is not even known whether the optimum solution can be efficiently approximated with ratioo(n). In the decision version of this problem, denoted$${\varvec{K}-\textsf {STC}}$$ K - STC , we need to determine ifGhas a spanning tree with congestion at mostK. It is known that$${\varvec{K}-\textsf {STC}}$$ K - STC is$$\mathbb{N}\mathbb{P}$$ N P -complete for$$K\ge 8$$ K 8 , and this implies a lower bound of 1.125 on the approximation ratio of minimizing congestion. On the other hand,$${\varvec{3}-\textsf {STC}}$$ 3 - STC can be solved in polynomial time, with the complexity status of this problem for$$K\in { \left\{ 4,5,6,7 \right\} }$$ K 4 , 5 , 6 , 7 remaining an open problem. We substantially improve the earlier hardness results by proving that$${\varvec{K}-\textsf {STC}}$$ K - STC is$$\mathbb{N}\mathbb{P}$$ N P -complete for$$K\ge 5$$ K 5 . This leaves only the case$$K=4$$ K = 4 open, and improves the lower bound on the approximation ratio to 1.2. Motivated by evidence that minimizing congestion is hard even for graphs of small constant radius, we also consider$${\varvec{K}-\textsf {STC}}$$ K - STC restricted to graphs of radius 2, and we prove that this variant is$$\mathbb{N}\mathbb{P}$$ N P -complete for all$$K\ge 6$$ K 6
    more » « less
  3. This paper proposes a new mixed-integer programming (MIP) formulation to optimize split rule selection in the decision tree induction process and develops an efficient search algorithm that is able to solve practical instances of the MIP model faster than commercial solvers. The formulation is novel for it directly maximizes the Gini reduction, an effective split selection criterion that has never been modeled in a mathematical program for its nonconvexity. The proposed approach differs from other optimal classification tree models in that it does not attempt to optimize the whole tree; therefore, the flexibility of the recursive partitioning scheme is retained, and the optimization model is more amenable. The approach is implemented in an open-source R package named bsnsing. Benchmarking experiments on 75 open data sets suggest that bsnsing trees are the most capable of discriminating new cases compared with trees trained by other decision tree codes including the rpart, C50, party, and tree packages in R. Compared with other optimal decision tree packages, including DL8.5, OSDT, GOSDT, and indirectly more, bsnsing stands out in its training speed, ease of use, and broader applicability without losing in prediction accuracy. History: Accepted by RamRamesh, Area Editor for Data Science & Machine Learning. Funding: This work was supported by the National Science Foundation Division of Civil, MechanicalandManufacturing Innovation [Grant 1944068]. Supplemental Material: Data are available at https://doi.org/10.1287/ijoc.2022.1225 . 
    more » « less
  4. A<sc>bstract</sc> We present a complete computation of superstring scattering amplitudes at tree level, for the case of Neveu-Schwarz insertions. Mathematically, this is to say that we determine explicitly the superstring measure on the moduli space$$ {\mathcal{M}}_{0,n,0} $$ M 0 , n , 0 of super Riemann surfaces of genus zero withn≥ 3 Neveu-Schwarz punctures. While, of course, an expression for the measure was previously known, we do this from first principles, using the canonically defined super Mumford isomorphism [1]. We thus determine the scattering amplitudes, explicitly in the global coordinates on$$ {\mathcal{M}}_{0,n,0} $$ M 0 , n , 0 , without the need for picture changing operators or ghosts, and are also able to determine canonically the value of the coupling constant. Our computation should be viewed as a step towards performing similar analysis on$$ {\mathcal{M}}_{0,0,n} $$ M 0 , 0 , n , to derive explicit tree-level scattering amplitudes with Ramond insertions. 
    more » « less
  5. Abstract We investigate the geometry of the space of immersed closed curves equipped with reparametrization-invariant Riemannian metrics; the metrics we consider are Sobolev metrics of possible fractional-order$$q\in [0,\infty )$$ q [ 0 , ) . We establish the critical Sobolev index on the metric for several key geometric properties. Our first main result shows that the Riemannian metric induces a metric space structure if and only if$$q>1/2$$ q > 1 / 2 . Our second main result shows that the metric is geodesically complete (i.e., the geodesic equation is globally well posed) if$$q>3/2$$ q > 3 / 2 , whereas if$$q<3/2$$ q < 3 / 2 then finite-time blowup may occur. The geodesic completeness for$$q>3/2$$ q > 3 / 2 is obtained by proving metric completeness of the space of$$H^q$$ H q -immersed curves with the distance induced by the Riemannian metric. 
    more » « less