skip to main content


Title: Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success
Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.  more » « less
Award ID(s):
2015400
PAR ID:
10226126
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
21
Issue:
171
ISSN:
1533-7928
Page Range / eLocation ID:
1-36
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. In this work, we show that a large class of random forests with general split directions also achieve minimax rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, as well as random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory. 
    more » « less
  2. ABSTRACT

    Recent advances in data complexity and availability present both challenges and opportunities for automated data exploration. Tree‐based methods, known for their interpretability, are widely used for building regression and classification models. However, they often lag behind the best supervised learning approaches in terms of prediction accuracy. To address this limitation, ensemble methods, such as random forests, combine multiple trees to improve prediction accuracy, though at the cost of interpretability. While tree‐based methods have seen extensive use in various fields, their application in the context of complex survey data has been relatively limited. This article provides an overview of the state‐of‐the‐art tree‐based approaches for analyzing complex survey data. It distinguishes methods explicitly designed for survey contexts from those adapted from other domains. The discussion covers applications in model‐assisted approaches, disclosure limitation, and small area estimation, as well as other recent methodological developments tailored to survey data. Additionally, the article explores aggregated tree models that sacrifice interpretability for improved prediction accuracy. These models, such as Bagging, Random Forests, and Boosting, are explained, along with the concept of out‐of‐bag error for model evaluation. Finally, this article provides the history and development of tree models, from their origins in regression trees to more recent Bayesian approaches, and aggregated tree models. This overview sheds light on the potential utility of tree‐based methods in survey methodology and provides insights into future research directions in this evolving field.

     
    more » « less
  3. Guyon, Isabelle (Ed.)
    As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among variables. Here, motivated by recent insights into random forest behavior, we introduce the simple idea of augmented bagging (AugBagg), a procedure that operates in an identical fashion to classical bagging and random forests, but which operates on a larger, augmented space containing additional randomly generated noise features. Surprisingly, we demonstrate that this simple act of including extra noise variables in the model can lead to dramatic improvements in out-of-sample predictive accuracy, sometimes outperforming even an optimally tuned traditional random forest. As a result, intuitive notions of variable importance based on improved model accuracy may be deeply flawed, as even purely random noise can routinely register as statistically significant. Numerous demonstrations on both real and synthetic data are provided along with a proposed solution. 
    more » « less
  4. Chiappa, Silvia ; Calandra, Roberto (Ed.)
    Random forests are powerful non-parametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on a new estimating equation that adapts to censoring and leads to quantile score whenever the data do not exhibit censoring. The proposed procedure named censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure. 
    more » « less
  5. Epidemiological models must be calibrated to ground truth for downstream tasks such as producing forward projections or running what-if scenarios. The meaning of calibration changes in case of a stochastic model since output from such a model is generally described via an ensemble or a distribution. Each member of the ensemble is usually mapped to a random number seed (explicitly or implicitly). With the goal of finding not only the input parameter settings but also the random seeds that are consistent with the ground truth, we propose a class of Gaussian process (GP) surrogates along with an optimization strategy based on Thompson sampling. This Trajectory Oriented Optimization (TOO) approach produces actual trajectories close to the empirical observations instead of a set of parameter settings where only the mean simulation behavior matches with the ground truth. 
    more » « less