Tensor decomposition is an effective approach to compress overparameterized neural networks and to enable their deployment on resourceconstrained hardware platforms. However, directly applying tensor compression in the training process is a challenging task due to the difficulty of choosing a proper tensor rank. In order to address this challenge, this paper proposes a lowrank Bayesian tensorized neural network. Our Bayesian method performs automatic model compression via an adaptive tensor rank determination. We also present approaches for posterior density calculation and maximum a posteriori (MAP) estimation for the endtoend training of our tensorized neural network. We provide experimental validation on a twolayer fully connected neural network, a 6layer CNN and a 110layer residual neural network where our work produces 7.4X to 137X more compact neural networks directly from the training while achieving high prediction accuracy.
This content will become publicly available on January 7, 2023
GeneralPurpose Bayesian Tensor Learning With Automatic Rank Determination and Uncertainty Quantification
A major challenge in many machine learning tasks is that the model expressive power depends on model size. Lowrank tensor methods are an efficient tool for handling the curse of dimensionality in many largescale machine learning models. The major challenges in training a tensor learning model include how to process the highvolume data, how to determine the tensor rank automatically, and how to estimate the uncertainty of the results. While existing tensor learning focuses on a specific task, this paper proposes a generic Bayesian framework that can be employed to solve a broad class of tensor learning problems such as tensor completion, tensor regression, and tensorized neural networks. We develop a lowrank tensor prior for automatic rank determination in nonlinear problems. Our method is implemented with both stochastic gradient Hamiltonian Monte Carlo (SGHMC) and Stein Variational Gradient Descent (SVGD). We compare the automatic rank determination and uncertainty quantification of these two solvers. We demonstrate that our proposed method can determine the tensor rank automatically and can quantify the uncertainty of the obtained results. We validate our framework on tensor completion tasks and tensorized neural network training tasks.
 Award ID(s):
 1817037
 Publication Date:
 NSFPAR ID:
 10345686
 Journal Name:
 Frontiers in Artificial Intelligence
 Volume:
 4
 ISSN:
 26248212
 Sponsoring Org:
 National Science Foundation
More Like this


Recently decentralized optimization attracts much attention in machine learning because it is more communicationefficient than the centralized fashion. Quantization is a promising method to reduce the communication cost via cutting down the budget of each single communication using the gradient compression. To further improve the communication efficiency, more recently, some quantized decentralized algorithms have been studied. However, the quantized decentralized algorithm for nonconvex constrained machine learning problems is still limited. FrankWolfe (a.k.a., conditional gradient or projectionfree) method is very efficient to solve many constrained optimization tasks, such as lowrank or sparsityconstrained models training. In this paper, to fill the gap of decentralized quantized constrained optimization, we propose a novel communicationefficient Decentralized Quantized Stochastic FrankWolfe (DQSFW) algorithm for nonconvex constrained learning models. We first design a new counterexample to show that the vanilla decentralized quantized stochastic FrankWolfe algorithm usually diverges. Thus, we propose DQSFW algorithm with the gradient tracking technique to guarantee the method will converge to the stationary point of nonconvex optimization safely. In our theoretical analysis, we prove that to achieve the stationary point our DQSFW algorithm achieves the same gradient complexity as the standard stochastic FrankWolfe and centralized FrankWolfe algorithms, but has much less communication cost. Experiments onmore »

Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix highperformance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from lowend consumer grade devices such as the Nvidia GTX 970 to higherend devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely computeintensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floatingpoint calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »

Learning nonlinear functions from inputoutput data pairs is one of the most fundamental problems in machine learning. Recent work has formulated the problem of learning a general nonlinear multivariate function of discrete inputs, as a tensor completion problem with smooth latent factors. We build upon this idea and utilize two ensemble learning techniques to enhance its prediction accuracy. Ensemble methods can be divided into two main groups, parallel and sequential. Bagging also known as bootstrap aggregation is a parallel ensemble method where multiple base models are trained in parallel on different subsets of the data that have been chosen randomly with replacement from the original training data. The output of these models is usually combined and a single prediction is computed using averaging. One of the most popular bagging techniques is random forests. Boosting is a sequential ensemble method where a sequence of base models are fit sequentially to modified versions of the data. Popular boosting algorithms include AdaBoost and Gradient Boosting. We develop two approaches based on these ensemble learning techniques for learning multivariate functions using the Canonical Polyadic Decomposition. We showcase the effectiveness of the proposed ensemble models on several regression tasks and report significant improvements compared tomore »

Deep Neural Networks (or DNNs) must constantly cope with distribution changes in the input data when the task of interest or the data collection protocol changes. Retraining a network from scratch to combat this issue poses a significant cost. Metalearning aims to deliver an adaptive model that is sensitive to these underlying distribution changes, but requires many tasks during the metatraining process. In this paper, we propose a tAskauGmented actIve metaLEarning (AGILE) method to efficiently adapt DNNs to new tasks by using a small number of training examples. AGILE combines a metalearning algorithm with a novel task augmentation technique which we use to generate an initial adaptive model. It then uses Bayesian dropout uncertainty estimates to actively select the most difficult samples when updating the model to a new task. This allows AGILE to learn with fewer tasks and a few informative samples, achieving high performance with a limited dataset. We perform our experiments using the brain cell classification task and compare the results to a plain metalearning model trained from scratch. We show that the proposed taskaugmented metalearning framework can learn to classify new cell types after a single gradient step with a limited number of training samples. Wemore »