skip to main content

Title: MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings
The efficiency of an accelerator depends on three factors—mapping, deep neural network (DNN) layers, and hardware—constructing extremely complicated design space of DNN accelerators. To demystify such complicated design space and guide the DNN accelerator design for better efficiency, we propose an analytical cost model, MAESTRO. MAESTRO receives DNN model description and hardware resources information as a list, and mapping described in a data-centric representation we propose as inputs. The data centric representation consists of three directives that enable concise description of mappings in a compiler-friendly form. MAESTRO analyzes various forms of data reuse in an accelerator based on inputs quickly and generates more than 20 statistics including total latency, energy, throughput, etc., as outputs. MAESTRO’s fast analysis enables various optimization tools for DNN accelerators such as hardware design exploration tool we present as an example.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
IEEE micro
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator micro architecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflow, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points. 
    more » « less
  2. Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing of DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant DNNs in ReRAM to reap the benefits of light-weight models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map block-circulant DNN models onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two specific techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block- circulant DNNs to reduce data access and buffer size. In REBOC, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN accelerators, REBOC achieves averagely 4.1× speedup and 2.6× energy reduction. 
    more » « less
  3. A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers. 
    more » « less
  4. Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unifieddesign-mappingapproach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimalGNNsandmappingpairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both (i) a real hardware SoC platform (NVIDIA Xavier AGX) and (ii) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide1.57× latency speedup and is3.38× more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average0.11%accuracy reduction from the baseline.

    more » « less
  5. null (Ed.)
    With the growing performance and wide application of deep neural networks (DNNs), recent years have seen enormous efforts on DNN accelerator hardware design for platforms from mobile devices to data centers. The systolic array has been a popular architectural choice for many proposed DNN accelerators with hundreds to thousands of processing elements (PEs) for parallel computing. Systolic array-based DNN accelerators for datacenter applications have high power consumption and nonuniform workload distribution, which makes power delivery network (PDN) design challenging. Server-class multicore processors have benefited from distributed on-chip voltage regulation and heterogeneous voltage regulation (HVR) for improving energy efficiency while guaranteeing power delivery integrity. This paper presents the first work on HVR-based PDN architecture and control for systolic array-based DNN accelerators. We propose to employ a PDN architecture comprising heterogeneous on-chip and off-chip voltage regulators and multiple power domains. By analyzing patterns of typical DNN workloads via a modeling framework, we propose a DNN workload-aware dynamic PDN control policy to maximize system energy efficiency while ensuring power integrity. We demonstrate significant energy efficiency improvements brought by the proposed PDN architecture, dynamic control, and power gating, which lead to a more than five-fold reduction of leakage energy and PDN energy overhead for systolic array DNN accelerators. 
    more » « less