Scalable and Efficient Hypothesis Testing with Random Forests

Coleman, Tim; Peng, Wei; Mentch, Lucas

Citation Details

Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the door to traditional inference procedures, all formal methods suggested thus far place severe restrictions on the testing framework and their computational overhead often precludes their practical scientific use. Here we propose a hypothesis test to formally assess feature significance, which uses permutation tests to circumvent computationally infeasible estimates of nuisance parameters. This test is intended to be analogous to the F-test for linear regression. We establish asymptotic validity of the test via exchangeability arguments and show that the test maintains high power with orders of magnitude fewer computations. Importantly, the procedure scales easily to big data settings where large training and testing sets may be employed, conducting statistically valid inference without the need to construct additional models. Simulations and applications to ecological data, where random forests have recently shown promise, are provided. more »

Award ID(s):: 2015400

PAR ID:: 10422081

Author(s) / Creator(s):: Coleman, Tim; Peng, Wei; Mentch, Lucas

Editor(s):: Allen, Genevra

Date Published:: 2022-06-01

Journal Name:: Journal of machine learning research

Volume:: 23

Issue:: 170

ISSN:: 1533-7928

Page Range / eLocation ID:: 1-35

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this