The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centricmore »
MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings
The efficiency of an accelerator depends on three factors—mapping, deep
neural network (DNN) layers, and hardware—constructing extremely complicated design space of DNN accelerators. To demystify such complicated design space and guide the DNN accelerator design for better efficiency, we propose an analytical cost model, MAESTRO. MAESTRO receives DNN model description and hardware resources information as a list, and mapping described in a data-centric representation we propose as inputs. The data centric representation consists of three directives that enable concise description of mappings in a compiler-friendly form. MAESTRO analyzes various forms of data reuse in an accelerator based on inputs quickly and generates more than 20 statistics including total latency, energy, throughput, etc., as outputs. MAESTRO’s fast analysis enables various optimization tools for DNN accelerators such as hardware design exploration tool we present as an example.
- Award ID(s):
- 1909900
- Publication Date:
- NSF-PAR ID:
- 10168818
- Journal Name:
- IEEE micro
- Volume:
- 40
- Issue:
- 3
- ISSN:
- 0272-1732
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing of DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant DNNs in ReRAM to reap the benefits of light-weight models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicingmore »
-
With the growing performance and wide application of deep neural networks (DNNs), recent years have seen enormous efforts on DNN accelerator hardware design for platforms from mobile devices to data centers. The systolic array has been a popular architectural choice for many proposed DNN accelerators with hundreds to thousands of processing elements (PEs) for parallel computing. Systolic array-based DNN accelerators for datacenter applications have high power consumption and nonuniform workload distribution, which makes power delivery network (PDN) design challenging. Server-class multicore processors have benefited from distributed on-chip voltage regulation and heterogeneous voltage regulation (HVR) for improving energy efficiency while guaranteeingmore »
-
Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remainingmore »
-
Deconvolution is a key component in contemporary neural networks, especially generative adversarial networks (GANs) and fully convolutional networks (FCNs). Due to extra operations of deconvolution compared to convolution, considerable degradation of performance as well as energy efficiency is incurred when implementing deconvolution on the existing resistive random access memory (ReRAM)-based processing-in-memory (PIM) accelerators. In this work, we propose a ReRAM-based accelerator design, RED, for providing high-performance and low-energy deconvolution. We analyze the deconvolution execution on the existing ReRAM-based PIMs and utilize its interior computation pattern for design optimization. RED includes two major contributions: pixel-wise mapping scheme and zero-skipping data flow.more »