This paper addresses the design of accelerators using systolic architectures to train convolutional neural networks using a novel gradient interleaving approach. Training the neural network involves computation and backpropagation of gradients of error with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weightstationary systolic array, while the gradient with respect to the weights can be computed using an outputstationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients on the same configurable systolic array. This results in the reuse of the variables from one computation to the other and eliminates unnecessary memory accesses and energy consumption associated with these memory accesses. The proposed approach leads to 1.4−2.2× savings in terms of the number of cycles and 1.9× savings in terms of memory accesses in the fullyconnected layer. Furthermore, the proposed method uses up to 25% fewer cycles and memory accesses, and 16% less energy than baseline implementations for stateoftheart CNNs. Under isoarea comparisons, for Inceptionv4, compared to weightstationary (WS), Intergrad achieves 12% savings in energy, 17% savings in memory, and 4% savings in cycles. Savings for Densenet264 are 18% , 26% , and 27% with respect to energy, memory, and cycles, respectively. Thus, the proposed novel accelerator architecture reduces the latency and energy consumption for training deep neural networks.
more »
« less
A GradientInterleaved Scheduler for EnergyEfficient Backpropagation for Training Neural Networks
This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weightstationary systolic array while the gradient with respect to the weights can be computed using an outputstationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients to the same configurable systolic array. This results in reuse of the variables from one computation to the other and eliminates unnecessary memory accesses. The proposed approach leads to 1.4–2.2× savings in terms of number of cycles and 1.9× savings in terms of memory accesses. Thus, the proposed accelerator reduces latency and energy consumption.
more »
« less
 Award ID(s):
 1814759
 NSFPAR ID:
 10195386
 Date Published:
 Journal Name:
 2020 IEEE International Symposium on Circuits and Systems (ISCAS)
 Page Range / eLocation ID:
 1 to 5
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve interlayer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intralayer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as interlayer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal interprocessor communication overhead. LayerPipe achieves an average speedup of 25 % and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream.more » « less

This work examines the deep disconnect between existing theoretical analyses of gradientbased algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in largescale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network’s weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.more » « less

null (Ed.)The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reversemode automatic differentiation (AD) to compute gradients of megadimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a lowmemory stochastic gradient method for optimizing the control parameters of a linear reactiondiffusion PDE representing a fission reactor.more » « less

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular fullprecision counterparts. To maintain the same performance level especially at low bitwidths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the fullprecision weights and their quantization (the socalled blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bitwidth such as binarization. In full quantization of ResNet18 for ImageNet classification task, BCGD gives 64.36% top1 accuracy with binary weights across all layers and 4bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a twolinearlayer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.more » « less