skip to main content


Title: Trees, forests, chickens, and eggs: when and why to prune trees in a random forest
Abstract

Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels toU‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.

 
more » « less
Award ID(s):
2015400
NSF-PAR ID:
10370278
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
16
Issue:
1
ISSN:
1932-1864
Page Range / eLocation ID:
p. 45-64
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided. 
    more » « less
  2. Premise

    Joshua trees (Yucca brevifoliaandY. jaegeriana) and their yucca moth pollinators (Tegeticula syntheticaandT. antithetica) are a model system for studies of plant–pollinator coevolution and, they are thought to be one of the only cases in which there is compelling evidence for cospeciation driven by coevolution. Previous work attempted to evaluate whether divergence between the plant and their pollinators was contemporaneous. That work concluded that the trees diverged more than 5 million years ago—well before the pollinators. However, clear inferences were hampered by a lack of data from the nuclear genome and low genetic variation in chloroplast genes. As a result, divergence times in the trees could not be confidently estimated.

    Methods

    We present an analysis of whole chloroplast genome sequence data and RADseq data from >5000 loci in the nuclear genome. We developed a molecular clock for the Asparagales and the Agavoideae using multiple fossil calibration points. Using Bayesian inference, we produced new estimates for the age of the genusYuccaand for Joshua trees. We used calculated summary statistics describing genetic variation and used coalescent‐based methods to estimate population genetic parameters.

    Results

    We find that the Joshua trees are moderately genetically differentiated, but that they diverged quite recently (~100–200 kya), and much more recently than their pollinators.

    Conclusions

    The results argue against the notion that coevolution directly contributed to speciation in this system, suggesting instead that coevolution with pollinators may have reinforced reproductive isolation following initial divergence in allopatry.

     
    more » « less
  3. Abstract

    In savannas, partitioning of below‐ground resources by depth could facilitate tree–grass coexistence and shape vegetation responses to changing rainfall patterns. However, most studies assessing tree versus grass root‐niche partitioning have focused on one or two sites, limiting generalization about how rainfall and soil conditions influence the degree of rooting overlap across environmental gradients.

    We used two complementary stable isotope techniques to quantify variation (a) in water uptake depths and (b) in fine‐root biomass distributions among dominant trees and grasses at eight semi‐arid savanna sites in Kruger National Park, South Africa. Sites were located on contrasting soil textures (clayey basaltic soils vs. sandy granitic soils) and paired along a gradient of mean annual rainfall.

    Soil texture predicted variation in mean water uptake depths and fine‐root allocation. While grasses maintained roots close to the surface and consistently used shallow water, trees on sandy soils distributed roots more evenly across soil depths and used deeper soil water, resulting in greater divergence between tree and grass rooting on sandy soils. Mean annual rainfall predicted some variation among sites in tree water uptake depth, but had a weaker influence on fine‐root allocation.

    Synthesis. Savanna trees overlapped more with shallow‐rooted grasses on clayey soils and were more distinct in their use of deeper soil layers on sandy soils, consistent with expected differences in infiltration and percolation. These differences, which could allow trees to escape grass competition more effectively on sandy soils, may explain observed differences in tree densities and rates of woody encroachment with soil texture. Differences in the degree of root‐niche separation could also drive heterogeneous responses of savanna vegetation to predicted shifts in the frequency and intensity of rainfall.

     
    more » « less
  4. null (Ed.)
    SUMMARY We have developed a linear 3-D gravity inversion method capable of modelling complex geological regions such as subduction margins. Our procedure inverts satellite gravity to determine the best-fitting differential densities of spatially discretized subsurface prisms in a least-squares sense. We use a Bayesian approach to incorporate both data error and prior constraints based on seismic reflection and refraction data. Based on these data, Gaussian priors are applied to the appropriate model parameters as absolute equality constraints. To stabilize the inversion and provide relative equality constraints on the parameters, we utilize a combination of first and second order Tikhonov regularization, which enforces smoothness in the horizontal direction between seismically constrained regions, while allowing for sharper contacts in the vertical. We apply this method to the nascent Puysegur Trench, south of New Zealand, where oceanic lithosphere of the Australian Plate has underthrust Puysegur Ridge and Solander Basin on the Pacific Plate since the Miocene. These models provide insight into the density contrasts, Moho depth, and crustal thickness in the region. The final model has a mean standard deviation on the model parameters of about 17 kg m–3, and a mean absolute error on the predicted gravity of about 3.9 mGal, demonstrating the success of this method for even complex density distributions like those present at subduction zones. The posterior density distribution versus seismic velocity is diagnostic of compositional and structural changes and shows a thin sliver of oceanic crust emplaced between the nascent thrust and the strike slip Puysegur Fault. However, the northern end of the Puysegur Ridge, at the Snares Zone, is predominantly buoyant continental crust, despite its subsidence with respect to the rest of the ridge. These features highlight the mechanical changes unfolding during subduction initiation. 
    more » « less
  5. Abstract

    Mean annual air temperatures are projected to increase, while the winter snowpack is expected to shrink in depth and duration for many mid‐ and high‐latitude temperate forest ecosystems over the next several decades. Together, these changes will lead to warmer growing season soil temperatures and an increased frequency of soil freeze–thaw cycles (FTCs) in winter. We took advantage of the Climate Change Across Seasons Experiment (CCASE) at the Hubbard Brook Experimental Forest in the White Mountains of New Hampshire, USA, to determine how these changes in soil temperature affect foliar nitrogen (N) and carbon metabolism of red maple (Acer rubrum) trees in 2015 and 2017. Earlier work from this study revealed a similar increase in foliar N concentrations with growing season soil warming, with or without the occurrence of soil FTCs in winter. However, these changes in soil warming could differentially affect the availability of cellular nutrients, concentrations of primary and secondary metabolites, and the rates of photosynthesis that are all responsive to climate change. We found that foliar concentrations of phosphorus (P), potassium (K), N, spermine (a polyamine), amino acids (alanine, histidine, and phenylalanine), chlorophyll, carotenoids, sucrose, and rates of photosynthesis increased with growing season soil warming. Despite similar concentrations of foliar N with soil warming with and without soil FTCs in winter, winter soil FTCs affected other foliar metabolic responses. The combination of growing season soil warming and winter soil FTCs led to increased concentrations of two polyamines (putrescine and spermine) and amino acids (alanine, proline, aspartic acid, γ‐aminobutyric acid, valine, leucine, and isoleucine). Treatment‐specific metabolic changes indicated that while responses to growing season warming were more connected to their role as growth modulators, soil warming + FTC treatment‐related effects revealed their dual role in growth and stress tolerance. Together, the results of this study demonstrate that growing season soil warming has multiple positive effects on foliar N and cellular metabolism in trees and that some of these foliar responses are further modified by the addition of stress from winter soil FTCs.

     
    more » « less