Title: Unbiased Measurement of Feature Importance in Tree-Based Methods
We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools. more »« less
Zhou, Zhengze; Hooker, Giles
(, ACM transactions on knowledge discovery from data)
null
(Ed.)
We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.
Joseph, V. Roshan; Vakayil, Akhil
(, Technometrics)
null
(Ed.)
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
Mei, Zaidao; Wang, Boyu; Rider, Daniel; Qiu, Qinru
(, International Conference on Neuromorphic Systems)
Incremental learning is a challenging task in the field of machine learning, and it is a key step towards autonomous learning and adaptation. With the increasing attention on neuromorphic computing, there is an urgent need to investigate incremental learning techniques that can work in this paradigm to maintain energy efficiency while benefiting from flexibility and adaptability. In this paper, we present SEMINAR (sensitivity modulated importance networking and rehearsal), an incremental learning algorithm designed specifically for EMSTDP (Error Modulated Synaptic-Timing Dependent Plasticity), which performs supervised learning for multi-layer spiking neural networks (SNN) implemented on neuromorphic hardware, such as Loihi. SEMINAR uses critical synapse selection, differential learning rate and a replay buffer to enable the model to retain past knowledge while maintaining flexibility to learn new tasks. Our experimental results show that, when combined with the EMSTDP, SEMINAR outperforms different baseline incremental learning algorithms and gives more than 4% improvement on several widely used datasets such as Split-MNIST, Split-Fashion MNIST, Split-NMNIST and MSTAR.
ABSTRACT UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected.
Antwi-Danso, J.
Papovich
(, Bulletin of the American Astronomical Society)
We calibrate and validate different methods of rest-frame color-color selection to identify galaxies in active star-forming and quiescent stages of their evolution. Our method is similar to the widely-used UVJ color-color diagram, which is an effective way to distinguish between quiescent and star-forming galaxies using their rest-frame U-V and V-J colors. UVJ colors suffer known systematics, and at z > 4 the method must be extrapolated because the rest-frame J-band moves beyond the coverage of the deepest bandpasses (typically IRAC 4.5 µm). This leads to biases: for example, spectroscopic campaigns have shown that UVJ-quiescent samples include ~10-30% contamination from galaxies with significant amounts of star formation. Alternative selection methods will be important not just to mitigate these biases, but also in the JWST era where NIRCam coverage is also limited to ~5 µm . In this poster, we present calibrations of alternative rest-frame filter combinations that are applicable for galaxies at redshifts z = 4 - 6. We apply our method to a stellar mass-limited sample of galaxies at 4 < z < 6 from the FLAMINGOS-2 Extragalactic Near-Infrared K-Split (FENIKS) survey. FENIKS is a deep (23.1 - 24.5 AB mag) survey employing two novel filters which split the Ks band ( λc = 2.2 µm) K-blue and K-red filters ( λc = 1.9 and 2.3 µm, respectively), allowing for finer sampling of the Balmer/4000 Å break of galaxies with evolved populations. We quantify the improvement in the selection of quiescent and star-forming galaxies using the alternative color-color selection methods. Furthermore, we investigate correlations between galaxy properties and their rest-frame colors, in particular examining purity and completeness of these selection methods. Finally, we explore the above for a wide range of synthetic filter combinations to inform accurate selections of various galaxy populations and rule out unphysical areas of parameter space for these populations.
Zhou, Zhengze, and Hooker, Giles. Unbiased Measurement of Feature Importance in Tree-Based Methods. Retrieved from https://par.nsf.gov/biblio/10298502. ACM Transactions on Knowledge Discovery from Data 15.2 Web. doi:10.1145/3429445.
@article{osti_10298502,
place = {Country unknown/Code not available},
title = {Unbiased Measurement of Feature Importance in Tree-Based Methods},
url = {https://par.nsf.gov/biblio/10298502},
DOI = {10.1145/3429445},
abstractNote = {We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.},
journal = {ACM Transactions on Knowledge Discovery from Data},
volume = {15},
number = {2},
author = {Zhou, Zhengze and Hooker, Giles},
editor = {null}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.