In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and k-fold cross validation. more »« less
Joseph, V. Roshan; Vakayil, Akhil
(, Technometrics)
null
(Ed.)
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
Petti, Samantha; Eddy, Sean R.
(, PLOS Computational Biology)
Kann, Maricel G
(Ed.)
Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
Magnesium (Mg) alloys are promising lightweight structural materials whose limited strength and room‐temperature ductility limit applications. Precise control of deformation‐induced twinning through microstructural alloy design is being investigated to overcome these deficiencies. Motivated by the need to understand and control twin formation during deformation in Mg alloys, a series of magnesium‐yttrium (Mg–Y) alloys are investigated using electron backscatter diffraction (EBSD). Analysis of EBSD maps produces a large dataset of microstructural information for >40000 grains. To quantitatively determine how processing parameters and microstructural features are correlated with twin formation, interpretable machine learning (ML) is employed to statistically analyze the individual effects of microstructural features on twinning. An ML classifier is trained to predict the likelihood of twin formation, given inputs including grain microstructural information and synthesis and deformation conditions. Then, feature selection is used to score the relative importance of these inputs for twinning in Mg–Y alloys. It is determined that using information only about grain size, grain orientation, and total applied strain, the ML model can predict the presence of twinning and that other parameters do not significantly contribute to increasing the model's predictive accuracy. Herein, the utility of ML for gaining new fundamental insights into materials processing is illustrated.
Self‐supervised learning has shown great promise because of its ability to train deep learning (DL) magnetic resonance imaging (MRI) reconstruction methods without fully sampled data. Current self‐supervised learning methods for physics‐guided reconstruction networks split acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network, while the other is used to define the training loss. In this study, we propose an improved self‐supervised learning strategy that more efficiently uses the acquired data to train a physics‐guided reconstruction network without a database of fully sampled data. The proposed multi‐mask self‐supervised learning via data undersampling (SSDU) applies a holdout masking operation on the acquired measurements to split them into multiple pairs of disjoint sets for each training sample, while using one of these pairs for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi‐mask SSDU is applied on fully sampled 3D knee and prospectively undersampled 3D brain MRI datasets, for various acceleration rates and patterns, and compared with the parallel imaging method, CG‐SENSE, and single‐mask SSDU DL‐MRI, as well as supervised DL‐MRI when fully sampled data are available. The results on knee MRI show that the proposed multi‐mask SSDU outperforms SSDU and performs as well as supervised DL‐MRI. A clinical reader study further ranks the multi‐mask SSDU higher than supervised DL‐MRI in terms of signal‐to‐noise ratio and aliasing artifacts. Results on brain MRI show that multi‐mask SSDU achieves better reconstruction quality compared with SSDU. The reader study demonstrates that multi‐mask SSDU at R = 8 significantly improves reconstruction compared with single‐mask SSDU at R = 8, as well as CG‐SENSE at R = 2.
Zhu, Yicheng; Wu, Xiaoyi; Tan, Sylvia; Sun, Cuiling; Saha, Sulagna; Su, Yun-Hsuan; Huang, Kevin
(, 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM))
This paper proposes a modified method for training tool segmentation networks for endoscopic images by parsing training images into two disjoint sets: one for rectangular representations of endoscopic images and one for polar. Previous work [1], [2] demonstrated that certain endoscopic images may be better segmented by a U-Net network trained on the original rectangular representation of images alone, and others performed better with polar representations. This work extends that observation to the training images and seeks to intelligently decompose the aggregate training data into disjoint image sets — one ideal for training a network to segment original, rectangular endoscopic images and the other for training a polar segmentation network. The training set decomposition consists of three stages: (1) initial data split and models, (2) image reallocation and transition mechanisms with retraining, and (3) evaluation. In (2), two separate frameworks for parsing polar vs. rectangular training images were investigated, with three switching metrics utilized in both. Experiments comparatively evaluated the segmentation performance (via Sørenson Dice coefficient) of the in-group and out-of-group images between the set-decomposed models. Results are encouraging, showing improved aggregate in-group Dice scores as well as image sets trending towards convergence.
Vakayil, Akhil, and Joseph, V. Roshan. Data Twinning. Retrieved from https://par.nsf.gov/biblio/10353689. Statistical Analysis and Data Mining: The ASA Data Science Journal . Web. doi:10.1002/sam.11574.
Vakayil, Akhil, & Joseph, V. Roshan. Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, (). Retrieved from https://par.nsf.gov/biblio/10353689. https://doi.org/10.1002/sam.11574
@article{osti_10353689,
place = {Country unknown/Code not available},
title = {Data Twinning},
url = {https://par.nsf.gov/biblio/10353689},
DOI = {10.1002/sam.11574},
abstractNote = {In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and k-fold cross validation.},
journal = {Statistical Analysis and Data Mining: The ASA Data Science Journal},
author = {Vakayil, Akhil and Joseph, V. Roshan},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.