skip to main content


Title: Optimal ratio for data splitting
Abstract

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is , where is the number of parameters in a linear regression model that explains the data well.

 
more » « less
Award ID(s):
1921646 1921873
NSF-PAR ID:
10445061
Author(s) / Creator(s):
 
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
15
Issue:
4
ISSN:
1932-1864
Page Range / eLocation ID:
p. 531-538
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.

    Methods

    Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction.

    Results

    Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance.

    Conclusions

    We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

     
    more » « less
  2. ABSTRACT

    We present a machine learning (ML) approach for the prediction of galaxies’ dark matter halo masses which achieves an improved performance over conventional methods. We train three ML algorithms (XGBoost, random forests, and neural network) to predict halo masses using a set of synthetic galaxy catalogues that are built by populating dark matter haloes in N-body simulations with galaxies and that match both the clustering and the joint distributions of properties of galaxies in the Sloan Digital Sky Survey (SDSS). We explore the correlation of different galaxy- and group-related properties with halo mass, and extract the set of nine features that contribute the most to the prediction of halo mass. We find that mass predictions from the ML algorithms are more accurate than those from halo abundance matching (HAM) or dynamical mass estimates (DYN). Since the danger of this approach is that our training data might not accurately represent the real Universe, we explore the effect of testing the model on synthetic catalogues built with different assumptions than the ones used in the training phase. We test a variety of models with different ways of populating dark matter haloes, such as adding velocity bias for satellite galaxies. We determine that, though training and testing on different data can lead to systematic errors in predicted masses, the ML approach still yields substantially better masses than either HAM or DYN. Finally, we apply the trained model to a galaxy and group catalogue from the SDSS DR7 and present the resulting halo masses.

     
    more » « less
  3. Abstract

    Although infrequent, large (Mw7.5+) earthquakes can be extremely damaging and occur on subduction and intraplate faults worldwide. Earthquake early warning (EEW) systems aim to provide advanced warning before strong shaking and tsunami onsets. These systems estimate earthquake magnitude using the early metrics of waveforms, relying on empirical scaling relationships of abundant past events. However, both the rarity and complexity of great events make it challenging to characterize them, and EEW algorithms often underpredict magnitude and the resulting hazards. Here, we propose a model, M‐LARGE, that leverages deep learning to characterize crustal deformation patterns of large earthquakes for a specific region in real‐time. We demonstrate the algorithm in the Chilean Subduction Zone by training it with more than six million different simulated rupture scenarios recorded on the Chilean GNSS network. M‐LARGE performs reliable magnitude estimation on the testing data set with an accuracy of 99%. Furthermore, the model successfully predicts the magnitude of five real Chilean earthquakes that occurred in the last 11 years. These events were damaging, large enough to be recorded by the modern high rate global navigation satellite system instrument, and provide valuable ground truth. M‐LARGE tracks the evolution of the source process and can make faster and more accurate magnitude estimation, significantly outperforming other similar EEW algorithms. This is the first demonstration of our approach. Future work toward generalization is outstanding and will include the addition of more training and testing data, interfacing with existing EEW methods, and applying the method to different tectonic settings to explore performance in these regions.

     
    more » « less
  4. Summary

    Spatiotemporal patterns ofSpartina alterniflorabelowground biomass (BGB) are important for evaluating salt marsh resiliency. To solve this, we created the BERM (Belowground Ecosystem Resiliency Model), which estimates monthly BGB (30‐m spatial resolution) from freely available data such as Landsat‐8 and Daymet climate summaries.

    Our modeling framework relied on extreme gradient boosting, and used field observations from four Georgia salt marshes as ground‐truth data. Model predictors included estimated tidal inundation, elevation, leaf area index, foliar nitrogen, chlorophyll, surface temperature, phenology, and climate data. The final model included 33 variables, and the most important variables were elevation, vapor pressure from the previous four months, Normalized Difference Vegetation Index (NDVI) from the previous five months, and inundation.

    Root mean squared error for BGB from testing data was 313 g m−2(11% of the field data range), explained variance (R2) was 0.62–0.77. Testing data results were unbiased across BGB values and were positively correlated with ground‐truth data across all sites and years (r = 0.56–0.82 and 0.45–0.95, respectively).

    BERM can estimate BGB withinSpartina alterniflorasalt marshes where environmental parameters are within the training data range, and can be readily extended through a reproducible workflow. This provides a powerful approach for evaluating spatiotemporal BGB and associated ecosystem function.

     
    more » « less
  5. ABSTRACT

    In this paper, we introduce a novel data augmentation methodology based on Conditional Progressive Generative Adversarial Networks (CPGAN) to generate diverse black hole (BH) images, accounting for variations in spin and electron temperature prescriptions. These generated images are valuable resources for training deep learning algorithms to accurately estimate black hole parameters from observational data. Our model can generate BH images for any spin value within the range of [−1, 1], given an electron temperature distribution. To validate the effectiveness of our approach, we employ a convolutional neural network to predict the BH spin using both the GRMHD images and the images generated by our proposed model. Our results demonstrate a significant performance improvement when training is conducted with the augmented data set while testing is performed using GRMHD simulated data, as indicated by the high R2 score. Consequently, we propose that GANs can be employed as cost-effective models for black hole image generation and reliably augment training data sets for other parametrization algorithms.

     
    more » « less