skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Trees, forests, chickens, and eggs: when and why to prune trees in a random forest
Abstract Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels toU‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.  more » « less
Award ID(s):
2015400
PAR ID:
10370278
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
16
Issue:
1
ISSN:
1932-1864
Page Range / eLocation ID:
p. 45-64
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided. 
    more » « less
  2. Morin, P; Suri, S (Ed.)
    We define simple variants of zip trees, called zip-zip trees, which provide several advantages over zip trees, including overcoming a bias that favors smaller keys over larger ones. We analyze zip-zip trees theoretically and empirically, showing, e.g., that the expected depth of a node in an n-node zip-zip tree is at most 1.3863log n -1 + o(1), which matches the expected depth of treaps and binary search trees built by uniformly random insertions. Unlike these other data structures, however, zip-zip trees achieve their bounds using only O(loglog n) bits of metadata per node, w.h.p., as compared to the O(log n) bits per node required by treaps. In fact, we even describe a “just-in-time” zip-zip tree variant, which needs just an expected O(1) number of bits of metadata per node. Moreover, we can define zip-zip trees to be strongly history independent, whereas treaps are generally only weakly history independent. We also introduce biased zip-zip trees, which have an explicit bias based on key weights, so the expected depth of a key, k, with weight, w, is O(log W/w), where W is the weight of all keys in the weighted zip-zip tree. Finally, we show that one can easily make zip-zip trees partially persistent with only O(n) space overhead w.h.p. 
    more » « less
  3. Strand-based hair simulations have recently become increasingly popular for a range of real-time applications. However, accurately simulating the full number of hair strands remains challenging. A commonly employed technique involves simulating a subset of guide hairs to capture the overall behavior of the hairstyle. Details are then enriched by interpolation using linear skinning. Hair interpolation enables fast real-time simulations but frequently leads to various artifacts during runtime. As the skinning weights are often pre-computed, substantial variations between the initial and deformed shapes of the hair can cause severe deviations in fine hair geometry. Straight hairs may become kinked, and curly hairs may become zigzags. This work introduces a novel physical-driven hair interpolation scheme that utilizes existing simulated guide hair data. Instead of directly operating on positions, we interpolate the internal forces from the guide hairs before efficiently reconstructing the rendered hairs based on their material model. We formulate our problem as a constraint satisfaction problem for which we present an efficient solution. Further practical considerations are addressed using regularization terms that regulate penetration avoidance and drift correction. We have tested various hairstyles to illustrate that our approach can generate visually plausible rendered hairs with only a few guide hairs and minimal computational overhead, amounting to only about 20% of conventional linear hair interpolation. This efficiency underscores the practical viability of our method for real-time applications. 
    more » « less
  4. Abstract MotivationAs genome-wide reconstruction of phylogenetic trees becomes more widespread, limitations of available data are being appreciated more than ever before. One issue is that phylogenomic datasets are riddled with missing data, and gene trees, in particular, almost always lack representatives from some species otherwise available in the dataset. Since many downstream applications of gene trees require or can benefit from access to complete gene trees, it will be beneficial to algorithmically complete gene trees. Also, gene trees are often unrooted, and rooting them is useful for downstream applications. While completing and rooting a gene tree with respect to a given species tree has been studied, those problems are not studied in depth when we lack such a reference species tree. ResultsWe study completion of gene trees without a need for a reference species tree. We formulate an optimization problem to complete the gene trees while minimizing their quartet distance to the given set of gene trees. We extend a seminal algorithm by Brodal et al. to solve this problem in quasi-linear time. In simulated studies and on a large empirical data, we show that completion of gene trees using other gene trees is relatively accurate and, unlike the case where a species tree is available, is unbiased. Availability and implementationOur method, tripVote, is available at https://github.com/uym2/tripVote. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Summary Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix. 
    more » « less