skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SPlit: An Optimal Method for Data Splitting
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.  more » « less
Award ID(s):
1921873
PAR ID:
10293256
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Technometrics
ISSN:
0040-1706
Page Range / eLocation ID:
1 to 11
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is , where is the number of parameters in a linear regression model that explains the data well. 
    more » « less
  2. Abstract In this work, we develop a method namedTwinningfor partitioning a dataset into statistically similar twin sets.Twinningis based onSPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinningis orders of magnitude faster than theSPlitalgorithm, which makes it applicable to Big Data problems such as data compression.Twinningcan also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk‐fold cross validation. 
    more » « less
  3. We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. By leveraging a functional steepest descent idea, we derive a simple criterion for deciding the best subset of neurons to split and a splitting gradient for optimally updating the off-springs. Theoretically, our splitting strategy is a second-order functional steepest descent for escaping saddle points in an infty-Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent. Our method provides a new computationally efficient approach for optimizing neural network structures, especially for learning lightweight neural architectures in resource-constrained settings. 
    more » « less
  4. This paper addresses the mitigation of spatial-wideband (SW) and the resulting beam-split (B-SP) effects in intelligent reflecting surface (IRS)-aided wideband systems. The SW effect occurs when the signal delay across the IRS aperture exceeds the system’s sampling duration, causing the user equipment’s (UE) channel angle to vary with frequency. This leads to the B-SP effect, wherein the IRS cannot coherently beamform to a given UE over the entire bandwidth, reducing array gain and throughput. We first show that partitioning a single IRS into multiple smaller IRSs and distributing them in the environment can naturally mitigate the SW effect (and hence the B-SP effect) by parallelizing the spatial delays and exploiting angle diversity benefits. Next, by determining the maximum number of elements at each smaller IRS to limit B-SP effects and analyzing the achievable sum-rate, we demonstrate that our approach ensures a minimum positive rate over the entire bandwidth of operation. However, distributed IRSs may introduce temporal delay spread (TDS) due to the differences in the path lengths through the IRSs and this may reduce the achievable flat channel gain. To minimize TDS and maintain the full array gain, we show that the optimal placement of the IRSs is on an ellipse with the base station (BS) and UE as the focal points. We also analyze the impact of the optimal IRS placement on TDS and throughput for a UE that is located within a hotspot served by the IRSs. Finally, we illustrate that distributed IRSs enhance angle diversity, which exponentially reduces the outage probability due to B-SP effects as the number of IRSs increases. Numerical results validate the efficacy and simplicity of our method compared to the existing solutions. 
    more » « less
  5. Microfluidic devices are defined by the application of fluid flow to micron-scale features. Inherent to most experiments involving microfluidic devices is the need to precisely and reproducibly control fluid flow at the microliter scale, often through multiple inlet ports on a single device. While the number of fluid channels per device varies, perfusing multiple inputs requires either the use of multiple flow controllers (often syringe or peristaltic pumps) or the ability to evenly divide fluid across outlets. Towards the latter approach, while a handful of commercial systems exist for splitting fluid flow, these set-ups require significant financial investment, multiple flow control and sensing components, and restrict the user to a predetermined perfusion control system. Simple in-line splitting devices, such a manifolds or T junctions, fail to achieve flow splitting at low flow rates often used in microfluidic systems. To increase capabilities for flow-controlled experiments, we performed experimental analyses of the physical considerations governing even flow splitting under low flow, leading to the design of a microdevice (µ-Split) that can be directly inserted into existing microfluidic set-ups. The µ-Split allows for reproducible, even flow splitting from 10 uL/min to > 2.5 mL/min. By testing multiple device geometries in combination with multiphysics simulations, we identified the design and fabrication features underlying the splitting precision achieved by the µ-Split. Overall, this work provides a useful tool to simplify microfluidic experiments that require evenly divided flow streams, while minimizing the overall device footprint. 
    more » « less