Abstract. This study investigates the inability of two popular data splitting techniques: train/test split and k-fold cross-validation that are to create training and validation data sets, and to achieve sufficient generality for supervised deep learning (DL) methods. This failure is mainly caused by their limited ability of new data creation. In response, the bootstrap is a computer based statistical resampling method that has been used efficiently for estimating the distribution of a sample estimator and to assess a model without having knowledge about the population. This paper couples cross-validation and bootstrap to have their respective advantages in view of data generation strategy and to achieve better generalization of a DL model. This paper contributes by: (i) developing an algorithm for better selection of training and validation data sets, (ii) exploring the potential of bootstrap for drawing statistical inference on the necessary performance metrics (e.g., mean square error), and (iii) introducing a method that can assess and improve the efficiency of a DL model. The proposed method is applied for semantic segmentation and is demonstrated via a DL based classification algorithm, PointNet, through aerial laser scanning point cloud data.
more » « less- Award ID(s):
- 1940145
- NSF-PAR ID:
- 10522677
- Editor(s):
- NA
- Publisher / Repository:
- ISPRS archives chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://isprs-archives.copernicus.org/articles/XLVIII-4-W3-2022/111/2022/isprs-archives-XLVIII-4-W3-2022-111-2022.pdf
- Date Published:
- Journal Name:
- International archives of the photogrammetry remote sensing and spatial information sciences
- Volume:
- XLVIII-4/W3-2022
- ISSN:
- 2194-9034
- Page Range / eLocation ID:
- 111 to 118
- Subject(s) / Keyword(s):
- Lidar laser scanning machine learning object detection segmentation
- Format(s):
- Medium: X Size: 2mb Other: pdf
- Size(s):
- 2mb
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for high-performance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (lower-level problem) and various hyperparameters such as the configuration of the architecture over the validation data (upper-level problem). This paper explores the statistical aspects of such problems with train-validation splits. In practice, the lower-level problem is often overparameterized and can easily achieve zero loss. Thus, a-priori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of train-validation split. To this aim this work establishes the following results: • We show that refined properties of the validation loss such as risk and hyper-gradients are indicative of those of the true test loss. This reveals that the upper-level problem helps select the most generalizable model and prevent overfitting with a near-minimal validation sample size. Importantly, this is established for continuous spaces – which are highly relevant for popular differentiable search schemes. • We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradient-descent, we show that the train-validation procedure returns the best (model, architecture) pair even if all architectures can perfectly fit the training data to achieve zero error. • Finally, we highlight rigorous connections between NAS, multiple kernel learning, and low-rank matrix learning. The latter leads to novel algorithmic insights where the solution of the upper problem can be accurately learned via efficient spectral methods to achieve near-minimal risk.more » « less
-
Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for high-performance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (lower-level problem) and various hyperparameters such as the configuration of the architecture over the validation data (upper-level problem). This paper explores the statistical aspects of such problems with train-validation splits. In practice, the lower-level problem is often overparameterized and can easily achieve zero loss. Thus, a-priori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of train-validation split. To this aim this work establishes the following results: • We show that refined properties of the validation loss such as risk and hyper-gradients are indicative of those of the true test loss. This reveals that the upper-level problem helps select the most generalizable model and prevent overfitting with a near-minimal validation sample size. Importantly, this is established for continuous spaces – which are highly relevant for popular differentiable search schemes. • We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradient-descent, we show that the train-validation procedure returns the best (model, architecture) pair even if all architectures can perfectly fit the training data to achieve zero error. • Finally, we highlight rigorous connections between NAS, multiple kernel learning, and low-rank matrix learning. The latter leads to novel algorithmic insights where the solution of the upper problem can be accurately learned via efficient spectral methods to achieve near-minimal risk.more » « less
-
Abstract In this work, we develop a method named
Twinning for partitioning a dataset into statistically similar twin sets.Twinning is based onSPlit , a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinning is orders of magnitude faster than theSPlit algorithm, which makes it applicable to Big Data problems such as data compression.Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk ‐fold cross validation. -
Summary We derive non-parametric confidence intervals for the eigenvalues of the Hessian at modes of a density estimate. This provides information about the strength and shape of modes and can also be used as a significance test. We use a data-splitting approach in which potential modes are identified by using the first half of the data and inference is done with the second half of the data. To obtain valid confidence sets for the eigenvalues, we use a bootstrap based on an elementary symmetric polynomial transformation. This leads to valid bootstrap confidence sets regardless of any multiplicities in the eigenvalues. We also suggest a new method for bandwidth selection, namely choosing the bandwidth to maximize the number of significant modes. We show by example that this method works well. Even when the true distribution is singular, and hence does not have a density (in which case cross-validation chooses a zero bandwidth), our method chooses a reasonable bandwidth.
-
Abstract Deep learning (DL) models trained on hydrologic observations can perform extraordinarily well, but they can inherit deficiencies of the training data, such as limited coverage of in situ data or low resolution/accuracy of satellite data. Here we propose a novel multiscale DL scheme learning simultaneously from satellite and in situ data to predict 9 km daily soil moisture (5 cm depth). Based on spatial cross‐validation over sites in the conterminous United States, the multiscale scheme obtained a median correlation of 0.901 and root‐mean‐square error of 0.034 m3/m3. It outperformed the Soil Moisture Active Passive satellite mission's 9 km product, DL models trained on in situ data alone, and land surface models. Our 9 km product showed better accuracy than previous 1 km satellite downscaling products, highlighting limited impacts of improving resolution. Not only is our product useful for planning against floods, droughts, and pests, our scheme is generically applicable to geoscientific domains with data on multiple scales, breaking the confines of individual data sets.