skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scalable and Efficient Hypothesis Testing with Random Forests
Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the door to traditional inference procedures, all formal methods suggested thus far place severe restrictions on the testing framework and their computational overhead often precludes their practical scientific use. Here we propose a hypothesis test to formally assess feature significance, which uses permutation tests to circumvent computationally infeasible estimates of nuisance parameters. This test is intended to be analogous to the F-test for linear regression. We establish asymptotic validity of the test via exchangeability arguments and show that the test maintains high power with orders of magnitude fewer computations. Importantly, the procedure scales easily to big data settings where large training and testing sets may be employed, conducting statistically valid inference without the need to construct additional models. Simulations and applications to ecological data, where random forests have recently shown promise, are provided.  more » « less
Award ID(s):
2015400
PAR ID:
10422081
Author(s) / Creator(s):
; ;
Editor(s):
Allen, Genevra
Date Published:
Journal Name:
Journal of machine learning research
Volume:
23
Issue:
170
ISSN:
1533-7928
Page Range / eLocation ID:
1-35
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts. 
    more » « less
  2. Abstract Hardy–Weinberg proportions (HWP) are often explored to evaluate the assumption of random mating. However, in autopolyploids, organisms with more than two sets of homologous chromosomes, HWP and random mating are different hypotheses that require different statistical testing approaches. Currently, the only available methods to test for random mating in autopolyploids (i) heavily rely on asymptotic approximations and (ii) assume genotypes are known, ignoring genotype uncertainty. Furthermore, these approaches are all frequentist, and so do not carry the benefits of Bayesian analysis, including ease of interpretability, incorporation of prior information, and consistency under the null. Here, we present Bayesian approaches to test for random mating, bringing the benefits of Bayesian analysis to this problem. Our Bayesian methods also (i) do not rely on asymptotic approximations, being appropriate for small sample sizes, and (ii) optionally account for genotype uncertainty via genotype likelihoods. We validate our methods in simulations and demonstrate on two real datasets how testing for random mating is more useful for detecting genotyping errors than testing for HWP (in a natural population) and testing for Mendelian segregation (in an experimental S1 population). Our methods are implemented in Version 2.0.2 of thehwepR package on the Comprehensive R Archive Networkhttps://cran.r‐project.org/package=hwep. 
    more » « less
  3. Abstract Fusion learning methods, developed for the purpose of analyzing datasets from many different sources, have become a popular research topic in recent years. Individualized inference approaches through fusion learning extend fusion learning approaches to individualized inference problems over a heterogeneous population, where similar individuals are fused together to enhance the inference over the target individual. Both classical fusion learning and individualized inference approaches through fusion learning are established based on weighted aggregation of individual information, but the weight used in the latter is localized to thetargetindividual. This article provides a review on two individualized inference methods through fusion learning,iFusion andiGroup, that are developed under different asymptotic settings. Both procedures guarantee optimal asymptotic theoretical performance and computational scalability. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Manifold LearningStatistical Learning and Exploratory Methods of the Data Sciences > Modeling MethodsStatistical and Graphical Methods of Data Analysis > Nonparametric MethodsData: Types and Structure > Massive Data 
    more » « less
  4. Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is crucial for the test quality. Thus, source test case generation strategy can make a big impact on the fault detection effectiveness of metamorphic testing. Most of the previous studies on metamorphic testing have used either random test data or existing test cases as source test cases. There has been limited research done on systematic source test case generation for metamorphic testing. This paper provides a comprehensive evaluation on the impact of source test case generation techniques on the fault finding effectiveness of metamorphic testing. We evaluated the effectiveness of line coverage, branch coverage, weak mutation and random test generation strategies for source test case generation. The experiments are conducted with 77 methods from 4 open source code repositories. Our results show that by systematically creating source test cases, we can significantly increase the fault finding effectiveness of metamorphic testing. Further, in this paper we introduce a simple metamorphic testing tool called "METtester" that we use to conduct metamorphic testing on these methods. 
    more » « less
  5. null (Ed.)
    Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided. 
    more » « less