skip to main content


Title: The Accelerator Wall: Limits of Chip Specialization
Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall. In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs)1. Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them.  more » « less
Award ID(s):
1823222
NSF-PAR ID:
10096256
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture (HPCA '19)
Page Range / eLocation ID:
1 to 14
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The co-packaging of optics and electronics provides a potential path forward to achieving beyond 50 Tbps top of rack switch packages. In a co-packaged design, the scaling of bandwidth, cost, and energy is governed by the number of optical transceivers (TxRx) per package as opposed to transistor shrink. Due to the large footprint of optical components relative to their electronic counterparts, the vertical stacking of optical TxRx chips in a co-packaged optics design will become a necessity. As a result, development of efficient, dense, and wide alignment tolerance chip-to-chip optical couplers will be an enabling technology for continued TxRx scaling. In this paper, we propose a novel scheme to vertically couple into standard 220 nm silicon on insulator waveguides from 220 nm silicon nitride on glass waveguides using overlapping, inverse double tapers. Simulation results using Lumerical’s 3D Finite Difference Time Domain solver are presented, demonstrating insertion losses below -0.13 dB for an inter-chip spacing of 1µm; 1 dB vertical and lateral alignment tolerances of approximately 2.6µm and ± 2.8µm, respectively; a greater than 300 nm 1 dB bandwidth; and 1 dB twist and tilt tolerances of approximately ± 2.3 degrees and 0.4 degrees, respectively. These results demonstrate the potential of our coupler for use in co-packaged designs requiring high performance, high density, CMOS compatible out of plane optical connections.

     
    more » « less
  2. The demise of Dennard scaling has ushered in an era of un- precedented and ever-increasing heterogeneity, in pursuit of increasing performance via specialization. While CMOS scal- ing is believed to be approaching its end, continued increases in the number of transistors available on a chip have made specialized hardware an attractive alternative to increasing core counts or cache sizes. GPUs are commonplace in many computing domains , FPGAs are arriving in the cloud; smart storage, and networking hardware are commercially available. This paper argues for separating transport — the actual physical management of data, from the rest of the control plane by adding simple hardware specialized purely for this task, called TRANSPORTERS. TRANSPORTERS facilitate offloading accelerator scheduling, data movement, and inter- accelerator communication and co-ordination, through a management protocol called TALK TO MY NEIGHBORS TRANSPORT (TMNT). 
    more » « less
  3. Generative Adversarial Networks (GANs) recently demonstrated a great opportunity toward unsupervised learning with the intention to mitigate the massive human efforts on data labeling in supervised learning algorithms. GAN combines a generative model and a discriminative model to oppose each other in an adversarial situation to refine their abilities. Existing nonvolatile memory based machine learning accelerators, however, could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. In this work, we propose a novel computational deformation technique that synergistically optimizes the forward and backward functions in transposed convolution to eliminate the large resource underutilization. In addition, we present dedicated control units - a dataflow mapper and an operation scheduler, to support the proposed execution model with high parallelism and low energy consumption. ZARA is implemented with commodity ReRAM chips, and experimental results show that our design can improve GAN’s training performance by averagely 1.6x~23x over CMOS-based GAN accelerators. Compared to state-of-the-art ReRAM-based accelerator designs, ZARA also provides 1.15x~2.1x performance improvement. 
    more » « less
  4. Brain-computer interfaces (BCIs) enable direct communication with the brain, providing valuable information about brain function and enabling novel treatment of brain disorders. Our group has been building {\abssys}, a flexible and ultra-low-power processing architecture for BCIs. HALO can process up to 46Mbps of neural data, a significant increase over the interfacing bandwidth achievable by prior BCIs. HALO can also be programmed to support several applications, unlike most prior BCIs. Key to HALO's effectiveness is a hardware accelerator cluster, where each accelerator operates within its own clock domain. A configurable interconnect connects the accelerators to create data flow pipelines that realize neural signal processing algorithms. We have taped out our design in a 12nm CMOS process. The resulting chip runs at 0.88V, per-accelerator frequencies of 3--180MHz, and consumes at most 5.0mW for each signal processing pipeline. Evaluations using electrophysiological data collected from a non-human primate confirm HALO's flexibility and superior performance per watt. 
    more » « less
  5. With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX. 
    more » « less