NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Global-Local Approximation Framework for Large-Scale Gaussian Process Modeling

https://doi.org/10.1080/00401706.2023.2296451

Vakayil, Akhil; Joseph, V Roshan (April 2024, Technometrics)

In this work, we propose a novel framework for large-scale Gaussian process (GP) modeling. Contrary to the global, and local approximations proposed in the literature to address the computational bottleneck with exact GP modeling, we employ a combined global-local approach in building the approximation. Our framework uses a subset-of-data approach where the subset is a union of a set of global points designed to capture the global trend in the data, and a set of local points specific to a given testing location to capture the local trend around the testing location. The correlation function is also modeled as a combination of a global, and a local kernel. The predictive performance of our framework, which we refer to as TwinGP, is comparable to the state-of-the-art GP modeling methods, but at a fraction of their computational cost.
more » « less
Full Text Available
SPlit: An Optimal Method for Data Splitting

https://doi.org/10.1080/00401706.2021.1921037

Joseph, V. Roshan; Vakayil, Akhil (January 2021, Technometrics)
null (Ed.)
In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), whichwas initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
more » « less
Full Text Available
Data Twinning

https://doi.org/10.1002/sam.11574

Vakayil, Akhil; Joseph, V_Roshan (February 2022, Statistical Analysis and Data Mining: The ASA Data Science Journal)

Abstract In this work, we develop a method namedTwinningfor partitioning a dataset into statistically similar twin sets.Twinningis based onSPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets.Twinningis orders of magnitude faster than theSPlitalgorithm, which makes it applicable to Big Data problems such as data compression.Twinningcan also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures andk‐fold cross validation.
more » « less

Search for: All records