SPlit: An Optimal Method for Data Splitting

Joseph, V. Roshan; Vakayil, Akhil

doi:10.1080/00401706.2021.1921037

Citation Details

SPlit: An Optimal Method for Data Splitting

In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure. more »

Award ID(s):: 1921873

PAR ID:: 10293256

Author(s) / Creator(s):: Joseph, V. Roshan; Vakayil, Akhil

Date Published:: 2021-01-01

Journal Name:: Technometrics

ISSN:: 0040-1706

Page Range / eLocation ID:: 1 to 11

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1080/00401706.2021.1921037

More Like this