This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients to the same configurable systolic array. This results in reuse of the variables from one computation to the other and eliminates unnecessary memory accesses. The proposed approach leads to 1.4–2.2× savings in terms of number of cycles and 1.9× savings in terms of memory accesses. Thus, the proposed accelerator reduces latency and energy consumption.
more »
« less
InterGrad: Energy-Efficient Training of Convolutional Neural Networks via Interleaved Gradient Scheduling
This paper addresses the design of accelerators using systolic architectures to train convolutional neural networks using a novel gradient interleaving approach. Training the neural network involves computation and backpropagation of gradients of error with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array, while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients on the same configurable systolic array. This results in the reuse of the variables from one computation to the other and eliminates unnecessary memory accesses and energy consumption associated with these memory accesses. The proposed approach leads to 1.4−2.2× savings in terms of the number of cycles and 1.9× savings in terms of memory accesses in the fully-connected layer. Furthermore, the proposed method uses up to 25% fewer cycles and memory accesses, and 16% less energy than baseline implementations for state-of-the-art CNNs. Under iso-area comparisons, for Inception-v4, compared to weight-stationary (WS), Intergrad achieves 12% savings in energy, 17% savings in memory, and 4% savings in cycles. Savings for Densenet-264 are 18% , 26% , and 27% with respect to energy, memory, and cycles, respectively. Thus, the proposed novel accelerator architecture reduces the latency and energy consumption for training deep neural networks.
more »
« less
- Award ID(s):
- 1954749
- NSF-PAR ID:
- 10398496
- Date Published:
- Journal Name:
- IEEE Transactions on Circuits and Systems I: Regular Papers
- ISSN:
- 1549-8328
- Page Range / eLocation ID:
- 1 to 14
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intra-layer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as inter-layer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal inter-processor communication overhead. LayerPipe achieves an average speedup of 25 % and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream.more » « less
-
Recent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM's in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4x higher energy efficiency and 7.3x speedup over recent processing-in-DRAM acceleration, or roughly 5x higher energy-efficiency and 20.5x speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.more » « less
-
null (Ed.)Spiking neural networks (SNNs) have emerged as a new generation of neural networks, presenting a brain-inspired event-driven model with advantages in spatiotemporal information processing. Due to the need for high power consumption of compute-intensive neural accelerators, adequate power delivery network (PDN) design is a key requirement to ensure power efficiency and integrity. However, PDN design for SNN accelerators has not been extensively studied despite its great potential benefit in energy efficiency. In this paper, we present the first study on dynamic heterogeneous voltage regulation (HVR) for spiking neural accelerators to maximize system energy efficiency while ensuring power integrity. We propose a novel sparse-workload-aware dynamic PDN control policy, which enables high energy efficiency of sparse spiking computation on a systolic array. By exploring sparse inputs and all-or-none nature of spiking computations for PDN control, we explore different types of PDNs to accelerate spiking convolutional neural networks (S-CNNs) trained with the dynamic vision sensor (DVS) gesture dataset. Furthermore, we demonstrate various power gating schemes to further optimize the proposed PDN architecture, which leads to a more than a three-fold reduction in total energy overhead for spiking neural computations on systolic array-based accelerators.more » « less
-
null (Ed.)With the growing performance and wide application of deep neural networks (DNNs), recent years have seen enormous efforts on DNN accelerator hardware design for platforms from mobile devices to data centers. The systolic array has been a popular architectural choice for many proposed DNN accelerators with hundreds to thousands of processing elements (PEs) for parallel computing. Systolic array-based DNN accelerators for datacenter applications have high power consumption and nonuniform workload distribution, which makes power delivery network (PDN) design challenging. Server-class multicore processors have benefited from distributed on-chip voltage regulation and heterogeneous voltage regulation (HVR) for improving energy efficiency while guaranteeing power delivery integrity. This paper presents the first work on HVR-based PDN architecture and control for systolic array-based DNN accelerators. We propose to employ a PDN architecture comprising heterogeneous on-chip and off-chip voltage regulators and multiple power domains. By analyzing patterns of typical DNN workloads via a modeling framework, we propose a DNN workload-aware dynamic PDN control policy to maximize system energy efficiency while ensuring power integrity. We demonstrate significant energy efficiency improvements brought by the proposed PDN architecture, dynamic control, and power gating, which lead to a more than five-fold reduction of leakage energy and PDN energy overhead for systolic array DNN accelerators.more » « less