Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more dataefficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for dataefficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.
Coresets for Dataefficient Training of Machine Learning Models
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more dataefficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for dataefficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.
 Publication Date:
 NSFPAR ID:
 10198855
 Journal Name:
 International Conference on Machine Learning (ICML)
 Sponsoring Org:
 National Science Foundation
More Like this


Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix highperformance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from lowend consumer grade devices such as the Nvidia GTX 970 to higherend devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely computeintensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floatingpoint calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »

Abstract Motivation Despite numerous RNAseq samples available at large databases, most RNAseq analysis tools are evaluated on a limited number of RNAseq samples. This drives a need for methods to select a representative subset from all available RNAseq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequencebased approaches for representative set selection (e.g. a kmer counting approach that selects a subset based on kmer similarities between RNAseq samples), because of the large numbers of available RNAseq samples and of kmers/sequences in each sample, computing the full similarity matrix using kmers/sequences for the entire set of RNAseq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources. Results We developed a novel computational method called ‘hierarchical representative set selection’ to handle this challenge. Hierarchical representative set selection is a divideandconquerlike algorithm that breaks representative set selection into subselections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4× runtime reduction and 5.35×more »

Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for highperformance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (lowerlevel problem) and various hyperparameters such as the configuration of the architecture over the validation data (upperlevel problem). This paper explores the statistical aspects of such problems with trainvalidation splits. In practice, the lowerlevel problem is often overparameterized and can easily achieve zero loss. Thus, apriori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of trainvalidation split. To this aim this work establishes the following results: • We show that refined properties of the validation loss such as risk and hypergradients are indicative of those of the true test loss. This reveals that the upperlevel problem helps select the most generalizable model and prevent overfitting with a nearminimal validation sample size. Importantly, this is established for continuous spaces – which are highly relevant for popular differentiable search schemes. • We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradientdescent, we showmore »

Neural Architecture Search (NAS) is a popular method for automatically designing optimized architectures for highperformance deep learning. In this approach, it is common to use bilevel optimization where one optimizes the model weights over the training data (lowerlevel problem) and various hyperparameters such as the configuration of the architecture over the validation data (upperlevel problem). This paper explores the statistical aspects of such problems with trainvalidation splits. In practice, the lowerlevel problem is often overparameterized and can easily achieve zero loss. Thus, apriori it seems impossible to distinguish the right hyperparameters based on training loss alone which motivates a better understanding of the role of trainvalidation split. To this aim this work establishes the following results: • We show that refined properties of the validation loss such as risk and hypergradients are indicative of those of the true test loss. This reveals that the upperlevel problem helps select the most generalizable model and prevent overfitting with a nearminimal validation sample size. Importantly, this is established for continuous spaces – which are highly relevant for popular differentiable search schemes. • We establish generalization bounds for NAS problems with an emphasis on an activation search problem. When optimized with gradientdescent, we showmore »