skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Unbiased Measurement of Feature Importance in Tree-Based Methods
We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.  more » « less
Award ID(s):
1712554
PAR ID:
10298437
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ACM transactions on knowledge discovery from data
Volume:
15
Issue:
2
ISSN:
1556-4681
Page Range / eLocation ID:
1-21
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools. 
    more » « less
  2. null (Ed.)
    In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure. 
    more » « less
  3. Incremental learning is a challenging task in the field of machine learning, and it is a key step towards autonomous learning and adaptation. With the increasing attention on neuromorphic computing, there is an urgent need to investigate incremental learning techniques that can work in this paradigm to maintain energy efficiency while benefiting from flexibility and adaptability. In this paper, we present SEMINAR (sensitivity modulated importance networking and rehearsal), an incremental learning algorithm designed specifically for EMSTDP (Error Modulated Synaptic-Timing Dependent Plasticity), which performs supervised learning for multi-layer spiking neural networks (SNN) implemented on neuromorphic hardware, such as Loihi. SEMINAR uses critical synapse selection, differential learning rate and a replay buffer to enable the model to retain past knowledge while maintaining flexibility to learn new tasks. Our experimental results show that, when combined with the EMSTDP, SEMINAR outperforms different baseline incremental learning algorithms and gives more than 4% improvement on several widely used datasets such as Split-MNIST, Split-Fashion MNIST, Split-NMNIST and MSTAR. 
    more » « less
  4. Greene, Casey S. (Ed.)
    ABSTRACT UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected. 
    more » « less
  5. We calibrate and validate different methods of rest-frame color-color selection to identify galaxies in active star-forming and quiescent stages of their evolution. Our method is similar to the widely-used UVJ color-color diagram, which is an effective way to distinguish between quiescent and star-forming galaxies using their rest-frame U-V and V-J colors. UVJ colors suffer known systematics, and at z > 4 the method must be extrapolated because the rest-frame J-band moves beyond the coverage of the deepest bandpasses (typically IRAC 4.5 µm). This leads to biases: for example, spectroscopic campaigns have shown that UVJ-quiescent samples include ~10-30% contamination from galaxies with significant amounts of star formation. Alternative selection methods will be important not just to mitigate these biases, but also in the JWST era where NIRCam coverage is also limited to ~5 µm . In this poster, we present calibrations of alternative rest-frame filter combinations that are applicable for galaxies at redshifts z = 4 - 6. We apply our method to a stellar mass-limited sample of galaxies at 4 < z < 6 from the FLAMINGOS-2 Extragalactic Near-Infrared K-Split (FENIKS) survey. FENIKS is a deep (23.1 - 24.5 AB mag) survey employing two novel filters which split the Ks band ( λc = 2.2 µm) K-blue and K-red filters ( λc = 1.9 and 2.3 µm, respectively), allowing for finer sampling of the Balmer/4000 Å break of galaxies with evolved populations. We quantify the improvement in the selection of quiescent and star-forming galaxies using the alternative color-color selection methods. Furthermore, we investigate correlations between galaxy properties and their rest-frame colors, in particular examining purity and completeness of these selection methods. Finally, we explore the above for a wide range of synthetic filter combinations to inform accurate selections of various galaxy populations and rule out unphysical areas of parameter space for these populations. 
    more » « less