Increasing processing requirements in the Artificial Intelligence (AI) realm has led to the emergence of domain-specific architectures for Deep Neural Network (DNN) applications. Tensor Processing Unit (TPU), a DNN accelerator by Google, has emerged as a front runner outclassing its contemporaries, CPUs and GPUs, in performance by 15×–30×. TPUs have been deployed in Google data centers to cater to the performance demands. However, a TPU’s performance enhancement is accompanied by a mammoth power consumption. In the pursuit of lowering the energy utilization, this paper proposes PREDITOR—a low-power TPU operating in the Near-Threshold Computing (NTC) realm. PREDITOR uses mathematical analysis to mitigate the undetectable timing errors by boosting the voltage of the selective multiplier-and-accumulator units at specific intervals to enhance the performance of the NTC TPU, thereby ensuring a high inference accuracy at low voltage. PREDITOR offers up to 3×–5× improved performance in comparison to the leading-edge error mitigation schemes with a minor loss in accuracy.
UPTPU: Improving Energy Efficiency of a Tensor Processing Unit through Underutilization Based Power-Gating
The AI boom is bringing a plethora of domain-specific architectures
for Neural Network computations. Google’s Tensor Processing Unit (TPU), a Deep
Neural Network (DNN) accelerator, has replaced the CPUs/GPUs in its data centers,
claiming more than 15X rate of inference. However, the unprecedented growth in
DNN workloads with the widespread use of AI services projects an increasing
energy consumption of TPU based data centers. In this work, we parametrize the
extreme hardware underutilization in TPU systolic array and propose UPTPU:
an intelligent, dataflow adaptive power-gating paradigm to provide a staggering
3.5X - 6.5X energy efficiency to TPU for different input batch sizes.
- Publication Date:
- NSF-PAR ID:
- 10347896
- Journal Name:
- 2021 58th ACM/IEEE Design Automation Conference (DAC)
- Page Range or eLocation-ID:
- 325 to 330
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
With the proliferation of low-cost sensors and the Internet of Things, the rate of producing data far exceeds the compute and storage capabilities of today’s infrastructure. Much of this data takes the form of time series, and in response, there has been increasing interest in the creation of time series archives in the last decade, along with the development and deployment of novel analysis methods to process the data. The general strategy has been to apply a plurality of similarity search mechanisms to various subsets and subsequences of time series data in order to identify repeated patterns and anomalies; however, the computational demands of these approaches renders them incompatible with today’s power-constrained embedded CPUs. To address this challenge, we present FA-LAMP, an FPGA-accelerated implementation of the Learned Approximate Matrix Profile (LAMP) algorithm, which predicts the correlation between streaming data sampled in real-time and a representative time series dataset used for training. FA-LAMP lends itself as a real-time solution for time series analysis problems such as classification. We present the implementation of FA-LAMP on both edge- and cloud-based prototypes. On the edge devices, FA-LAMP integrates accelerated computation as close as possible to IoT sensors, thereby eliminating the need to transmit andmore »
-
Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present ACCPAR, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, ACCPAR considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. ACCPAR optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in ACCPAR allows the flexible ratio of computations to be distributed amongmore »
-
With the growing performance and wide application of deep neural networks (DNNs), recent years have seen enormous efforts on DNN accelerator hardware design for platforms from mobile devices to data centers. The systolic array has been a popular architectural choice for many proposed DNN accelerators with hundreds to thousands of processing elements (PEs) for parallel computing. Systolic array-based DNN accelerators for datacenter applications have high power consumption and nonuniform workload distribution, which makes power delivery network (PDN) design challenging. Server-class multicore processors have benefited from distributed on-chip voltage regulation and heterogeneous voltage regulation (HVR) for improving energy efficiency while guaranteeing power delivery integrity. This paper presents the first work on HVR-based PDN architecture and control for systolic array-based DNN accelerators. We propose to employ a PDN architecture comprising heterogeneous on-chip and off-chip voltage regulators and multiple power domains. By analyzing patterns of typical DNN workloads via a modeling framework, we propose a DNN workload-aware dynamic PDN control policy to maximize system energy efficiency while ensuring power integrity. We demonstrate significant energy efficiency improvements brought by the proposed PDN architecture, dynamic control, and power gating, which lead to a more than five-fold reduction of leakage energy and PDN energy overhead formore »
-
The ever increasing size of deep neural network (DNN) models once implied that they were only limited to cloud data centers for runtime inference. Nonetheless, the recent plethora of DNN model compression techniques have successfully overcome this limit, turning into a reality that DNN-based inference can be run on numerous resource-constrained edge devices including mobile phones, drones, robots, medical devices, wearables, Internet of Things devices, among many others. Naturally, edge devices are highly heterogeneous in terms of hardware specification and usage scenarios. On the other hand, compressed DNN models are so diverse that they exhibit different tradeoffs in a multi-dimension space, and not a single model can achieve optimality in terms of all important metrics such as accuracy, latency and energy consumption. Consequently, how to automatically select a compressed DNN model for an edge device to run inference with optimal quality of experience (QoE) arises as a new challenge. The state-of-the-art approaches either choose a common model for all/most devices, which is optimal for a small fraction of edge devices at best, or apply device-specific DNN model compression, which is not scalable. In this paper, by leveraging the predictive power of machine learning and keeping end users in the loop,more »