skip to main content


Title: Optimal ratio for data splitting
Abstract

It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is , where is the number of parameters in a linear regression model that explains the data well.

 
more » « less
Award ID(s):
1921646 1921873
PAR ID:
10445061
Author(s) / Creator(s):
 
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume:
15
Issue:
4
ISSN:
1932-1864
Page Range / eLocation ID:
p. 531-538
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Person Re-IDentification (P-RID), as an instance-level recognition problem, still remains challenging in computer vision community. Many P-RID works aim to learn faithful and discriminative features/metrics from offline training data and directly use them for the unseen online testing data. However, their performance is largely limited due to the severe data shifting issue between training and testing data. Therefore, we propose an online joint multi-metric adaptation model to adapt the offline learned P-RID models for the online data by learning a series of metrics for all the sharing-subsets. Each sharing-subset is obtained from the proposed novel frequent sharing-subset mining module and contains a group of testing samples which share strong visual similarity relationships to each other. Unlike existing online P-RID methods, our model simultaneously takes both the sample-specific discriminant and the set-based visual similarity among testing samples into consideration so that the adapted multiple metrics can refine the discriminant of all the given testing samples jointly via a multi-kernel late fusion framework. Our proposed model is generally suitable to any offline learned P-RID baselines for online boosting, the performance improvement by our model is not only verified by extensive experiments on several widely-used P-RID benchmarks (CUHK03, Market1501, DukeMTMC-reID and MSMT17) and state-of-the-art P-RID baselines but also guaranteed by the provided in-depth theoretical analyses. 
    more » « less
  2. ABSTRACT

    In this paper, we introduce a novel data augmentation methodology based on Conditional Progressive Generative Adversarial Networks (CPGAN) to generate diverse black hole (BH) images, accounting for variations in spin and electron temperature prescriptions. These generated images are valuable resources for training deep learning algorithms to accurately estimate black hole parameters from observational data. Our model can generate BH images for any spin value within the range of [−1, 1], given an electron temperature distribution. To validate the effectiveness of our approach, we employ a convolutional neural network to predict the BH spin using both the GRMHD images and the images generated by our proposed model. Our results demonstrate a significant performance improvement when training is conducted with the augmented data set while testing is performed using GRMHD simulated data, as indicated by the high R2 score. Consequently, we propose that GANs can be employed as cost-effective models for black hole image generation and reliably augment training data sets for other parametrization algorithms.

     
    more » « less
  3. Abstract

    The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, e.g. 72–74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this ‘middle exclusion’ protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.

     
    more » « less
  4. The investigation of brain health development is paramount, as a healthy brain underpins cognitive and physical well-being, and mitigates cognitive decline, neurodegenerative diseases, and mental health disorders. This study leverages the UK Biobank dataset containing static functional network connectivity (sFNC) data derived from resting-state functional magnetic resonance imaging (rs-fMRI) and assessment data. We introduce a novel approach to forecasting a brain health index (BHI) by deploying three distinct models, each capitalizing on different modalities for training and testing. The first model exclusively employs psychological assessment measures, while the second model harnesses both neuroimaging and assessment data for training but relies solely on assessment data during testing. The third model encompasses a holistic strategy, utilizing neuroimaging and assessment data for the training and testing phases. The proposed models employ a two-step approach for calculating the BHI. In the first step, the input data is subjected to dimensionality reduction using principal component analysis (PCA) to identify critical patterns and extract relevant features. The resultant concatenated feature vector is then utilized as input to variational autoencoders (VAE). This network generates a low-dimensional representation of the input data used for calculating BHI in new subjects without requiring imaging data. The results suggest that incorporating neuroimaging data into the BHI model, even when predicting from assessments alone, enhances its ability to accurately evaluate brain health. The VAE model exemplifies this improvement by reconstructing the sFNC matrix more accurately than the assessment data. Moreover, these BHI models also enable us to identify distinct behavioral and neural patterns. Hence, this approach lays the foundation for larger-scale efforts to monitor and enhance brain health, aiming to build resilient brain systems.

     
    more » « less
  5. ABSTRACT

    We present a machine learning (ML) approach for the prediction of galaxies’ dark matter halo masses which achieves an improved performance over conventional methods. We train three ML algorithms (XGBoost, random forests, and neural network) to predict halo masses using a set of synthetic galaxy catalogues that are built by populating dark matter haloes in N-body simulations with galaxies and that match both the clustering and the joint distributions of properties of galaxies in the Sloan Digital Sky Survey (SDSS). We explore the correlation of different galaxy- and group-related properties with halo mass, and extract the set of nine features that contribute the most to the prediction of halo mass. We find that mass predictions from the ML algorithms are more accurate than those from halo abundance matching (HAM) or dynamical mass estimates (DYN). Since the danger of this approach is that our training data might not accurately represent the real Universe, we explore the effect of testing the model on synthetic catalogues built with different assumptions than the ones used in the training phase. We test a variety of models with different ways of populating dark matter haloes, such as adding velocity bias for satellite galaxies. We determine that, though training and testing on different data can lead to systematic errors in predicted masses, the ML approach still yields substantially better masses than either HAM or DYN. Finally, we apply the trained model to a galaxy and group catalogue from the SDSS DR7 and present the resulting halo masses.

     
    more » « less