skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Accelerator Wall: Limits of Chip Specialization
Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the number of transistors available on a chip will limit the accelerator design optimization space, leading to diminishing specialization returns, ultimately hitting an accelerator wall. In this work, we tackle the question of what are the limits of future accelerators and chip specialization? We do this by characterizing how current accelerators depend on CMOS scaling, based on a physical modeling tool that we constructed using datasheets of thousands of chips. We identify key concepts used in chip specialization, and explore case studies to understand how specialization has progressed over time in different applications and chip platforms (e.g., GPUs, FPGAs, ASICs)1. Utilizing these insights, we build a model which projects forward to see what future gains can and cannot be enabled from chip specialization. A quantitative analysis of specialization returns and technological boundaries is critical to help researchers understand the limits of accelerators and develop methods to surmount them.  more » « less
Award ID(s):
1823222
PAR ID:
10096256
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 25th IEEE International Symposium on High-Performance Computer Architecture (HPCA '19)
Page Range / eLocation ID:
1 to 14
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The demise of Dennard scaling has ushered in an era of un- precedented and ever-increasing heterogeneity, in pursuit of increasing performance via specialization. While CMOS scal- ing is believed to be approaching its end, continued increases in the number of transistors available on a chip have made specialized hardware an attractive alternative to increasing core counts or cache sizes. GPUs are commonplace in many computing domains , FPGAs are arriving in the cloud; smart storage, and networking hardware are commercially available. This paper argues for separating transport — the actual physical management of data, from the rest of the control plane by adding simple hardware specialized purely for this task, called TRANSPORTERS. TRANSPORTERS facilitate offloading accelerator scheduling, data movement, and inter- accelerator communication and co-ordination, through a management protocol called TALK TO MY NEIGHBORS TRANSPORT (TMNT). 
    more » « less
  2. Generative Adversarial Networks (GANs) recently demonstrated a great opportunity toward unsupervised learning with the intention to mitigate the massive human efforts on data labeling in supervised learning algorithms. GAN combines a generative model and a discriminative model to oppose each other in an adversarial situation to refine their abilities. Existing nonvolatile memory based machine learning accelerators, however, could not support the computational needs required by GAN training. Specifically, the generator utilizes a new operator, called transposed convolution, which introduces significant resource underutilization when executed on conventional neural network accelerators as it inserts massive zeros in its input before a convolution operation. In this work, we propose a novel computational deformation technique that synergistically optimizes the forward and backward functions in transposed convolution to eliminate the large resource underutilization. In addition, we present dedicated control units - a dataflow mapper and an operation scheduler, to support the proposed execution model with high parallelism and low energy consumption. ZARA is implemented with commodity ReRAM chips, and experimental results show that our design can improve GAN’s training performance by averagely 1.6x~23x over CMOS-based GAN accelerators. Compared to state-of-the-art ReRAM-based accelerator designs, ZARA also provides 1.15x~2.1x performance improvement. 
    more » « less
  3. The co-packaging of optics and electronics provides a potential path forward to achieving beyond 50 Tbps top of rack switch packages. In a co-packaged design, the scaling of bandwidth, cost, and energy is governed by the number of optical transceivers (TxRx) per package as opposed to transistor shrink. Due to the large footprint of optical components relative to their electronic counterparts, the vertical stacking of optical TxRx chips in a co-packaged optics design will become a necessity. As a result, development of efficient, dense, and wide alignment tolerance chip-to-chip optical couplers will be an enabling technology for continued TxRx scaling. In this paper, we propose a novel scheme to vertically couple into standard 220 nm silicon on insulator waveguides from 220 nm silicon nitride on glass waveguides using overlapping, inverse double tapers. Simulation results using Lumerical’s 3D Finite Difference Time Domain solver are presented, demonstrating insertion losses below -0.13 dB for an inter-chip spacing of 1µm; 1 dB vertical and lateral alignment tolerances of approximately 2.6µm and ± 2.8µm, respectively; a greater than 300 nm 1 dB bandwidth; and 1 dB twist and tilt tolerances of approximately ± 2.3 degrees and 0.4 degrees, respectively. These results demonstrate the potential of our coupler for use in co-packaged designs requiring high performance, high density, CMOS compatible out of plane optical connections. 
    more » « less
  4. Probabilistic computing is a computing scheme that offers a more efficient approach than conventional complementary metal-oxide–semiconductor (CMOS)-based logic in a variety of applications ranging from optimization to Bayesian inference, and invertible Boolean logic. The probabilistic bit (or p-bit, the base unit of probabilistic computing) is a naturally fluctuating entity that requires tunable stochasticity; by coupling low-barrier stochastic magnetic tunnel junctions (MTJs) with a transistor circuit, a compact implementation is achieved. In this work, by combining stochastic MTJs with 2D-MoS2field-effect transistors (FETs), we demonstrate an on-chip realization of a p-bit building block displaying voltage-controllable stochasticity. Supported by circuit simulations, we analyze the three transistor-one magnetic tunnel junction (3T-1MTJ) p-bit design, evaluating how the characteristics of each component influence the overall p-bit output. While the current approach has not reached the level of maturity required to compete with CMOS-compatible MTJ technology, the design rules presented in this work are valuable for future experimental implementations of scaled on-chip p-bit networks with reduced footprint. 
    more » « less
  5. Over a billion mobile consumer system-on-chip (SoC) chipsets ship each year. Of these, the mobile consumer market undoubtedly involving smartphones has a significant market share. Most modern smartphones comprise of advanced SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators connected via a complex hierarchy of interconnects with the goal of running a dozen or more critical software usecases under strict power, thermal and energy constraints. The steadily growing complexity of a modern SoC challenges hardware computer architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once the hardware is specified and accelerator software is written or ported. However, early-stage SoC design must often select accelerators before a single line of software is written. To help frame SoC thinking and guide early stage mobile SoC design, in this paper we contribute the Gables model that refines and retargets the Roofline model-designed originally for the performance and bandwidth limits of a multicore chip-to model each accelerator on a SoC, to apportion work concurrently among different accelerators (justified by our usecase analysis), and calculate a SoC performance upper bound. We evaluate the Gables model with an existing SoC and develop several extensions that allow Gables to inform early stage mobile SoC design. 
    more » « less