skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CANNON: C ommunication- A ware Sparse N eural N etwork O ptimizatio n
Sparse deep neural networks (DNNs) have the potential to deliver compelling performance and energy efficiency without significant accuracy loss. However, their benefits can quickly diminish if their training is oblivious to the target hardware. For example, fewer critical connections can have a significant overhead if they translate into long-distance communication on the target hardware. Therefore, hardware-aware sparse training is needed to leverage the full potential of sparse DNNs. To this end, we propose a novel and comprehensive communication-aware sparse DNN optimization framework for tile-based in-memory computing (IMC) architectures. The proposed technique, CANNON first maps the DNN layers onto the tiles of the target architecture. Then, it replaces the fully connected and convolutional layers with communication-aware sparse connections. After that, CANNON optimizes the communication cost with minimal impact on the DNN accuracy. Extensive experimental evaluations with a wide range of DNNs and datasets show up to 3.0× lower communication energy, 3.1× lower communication latency, and 6.8× lower energy-delay product compared to state-of-the-art pruning approaches with a negligible impact on the classification accuracy on IMC-based machine learning accelerators.  more » « less
Award ID(s):
2007284
PAR ID:
10468127
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
IEEE Transactions on Emerging Topics in Computing
ISSN:
2376-4562
Page Range / eLocation ID:
1 to 13
Subject(s) / Keyword(s):
Hardware-aware pruning, communication-aware pruning, mapping, sparse neural networks
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We present a novel deep neural network (DNN) training scheme and resistive RAM (RRAM) in-memory computing (IMC) hardware evaluation towards achieving high accuracy against RRAM device/array variations and enhanced robustness against adversarial input attacks. We present improved IMC inference accuracy results evaluated on state-of-the-art DNNs including ResNet-18, AlexNet, and VGG with binary, 2-bit, and 4-bit activation/weight precision for the CIFAR-10 dataset. These DNNs are evaluated with measured noise data obtained from three different RRAM-based IMC prototype chips. Across these various DNNs and IMC chip measurements, we show that our proposed hardware noise-aware DNN training consistently improves DNN inference accuracy for actual IMC hardware, up to 8% accuracy improvement for the CIFAR-10 dataset. We also analyze the impact of our proposed noise injection scheme on the adversarial robustness of ResNet-18 DNNs with 1-bit, 2-bit, and 4-bit activation/weight precision. Our results show up to 6% improvement in the robustness to black-box adversarial input attacks. 
    more » « less
  2. Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neural networks (DNNs). As the performance requirements of ML applications grow continuously, the hardware accelerators start playing a central role in DNN design. This trend makes NAS even more complicated and time-consuming for most real applications. This paper proposes FLASH, a very fast NAS methodology that co-optimizes the DNN accuracy and performance on a real hardware platform. As the main theoretical contribution, we first propose the NN-Degree, an analytical metric to quantify the topological characteristics of DNNs with skip connections (e.g., DenseNets, ResNets, Wide-ResNets, and MobileNets). The newly proposed NN-Degree allows us to do training-free NAS within one second and build an accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Second, by performing inference on the target hardware, we fine-tune and validate our analytical models to estimate the latency, area, and energy consumption of various DNN architectures while executing standard ML datasets. Third, we construct a hierarchical algorithm based on simplicial homology global optimization (SHGO) to optimize the model-architecture co-design process, while considering the area, latency, and energy consumption of the target hardware. We demonstrate that, compared to the state-of-the-art NAS approaches, our proposed hierarchical SHGO-based algorithm enables more than four orders of magnitude speedup (specifically, the execution time of the proposed algorithm is about 0.1 seconds). Finally, our experimental evaluations show that FLASH is easily transferable to different hardware architectures, thus enabling us to do NAS on a Raspberry Pi-3B processor in less than 3 seconds. 
    more » « less
  3. Deep neural networks (DNNs) have experienced unprecedented success in a variety of cognitive tasks due to which there has been a move to deploy DNNs in edge devices. DNNs are usually comprised of multiply-and-accumulate (MAC) operations and are both data and compute intensive. In-memory computing (IMC) methodologies have shown significant energy efficiency and throughput benefits for DNN workloads by reducing data movement and eliminating memory reads. Weight pruning in DNNs can further improve the energy/throughput of DNN hardware through reduced storage and compute. Recent IMC works [1]–[3], [6] have not explored such sparse compression techniques unlike ASIC counterparts to enable storage benefits and compute skipping. A recent work [4] attempted to exploit this by compressing weights using a binary map and a custom compression format. This is sub-optimal because the implementation requires a complex routing mechanism (butterfly routing), additional compute to decode compressed weights and has limited flexibility in supporting different sparse encodings. Fig. 1 illustrates our motivations and the challenges for implementing weight compression in digital IMC designs and the need for a new methodology to enable sparse compute directly on compressed weights. In this work, we present a novel sparsity-integrated IMC (SP-IMC) macro in 28nm CMOS which, for the first time, utilizes three popular sparse compression formats, i.e., coordinate representation (COO), run length encoding (RL) and N:m sparsity [7] all along the matrix column direction with tunable precisions. SP-IMC stores and directly processes the sparse compressed weights in the macro, achieving higher storage density, reduction in re-write operations to the macro and higher overall energy efficiency. 
    more » « less
  4. The record-breaking performance of deep neural networks (DNNs) comes with heavy parameter budgets, which leads to external dynamic random access memory (DRAM) for storage. The prohibitive energy of DRAM accesses makes it nontrivial for DNN deployment on resource-constrained devices, calling for minimizing the movements of weights and data in order to improve the energy efficiency. Driven by this critical bottleneck, we present SmartDeal, a hardware-friendly algorithm framework to trade higher-cost memory storage/access for lower-cost computation, in order to aggressively boost the storage and energy efficiency, for both DNN inference and training. The core technique of SmartDeal is a novel DNN weight matrix decomposition framework with respective structural constraints on each matrix factor, carefully crafted to unleash the hardware-aware efficiency potential. Specifically, we decompose each weight tensor as the product of a small basis matrix and a large structurally sparse coefficient matrix whose nonzero elements are readily quantized to the power-of-2. The resulting sparse and readily quantized DNNs enjoy greatly reduced energy consumption in data movement as well as weight storage, while incurring minimal overhead to recover the original weights thanks to the required sparse bit-operations and cost-favorable computations. Beyond inference, we take another leap to embrace energy-efficient training, by introducing several customized techniques to address the unique roadblocks arising in training while preserving the SmartDeal structures. We also design a dedicated hardware accelerator to fully utilize the new weight structure to improve the real energy efficiency and latency performance. We conduct experiments on both vision and language tasks, with nine models, four datasets, and three settings (inference-only, adaptation, and fine-tuning). Our extensive results show that 1) being applied to inference, SmartDeal achieves up to 2.44x improvement in energy efficiency as evaluated using real hardware implementations and 2) being applied to training, SmartDeal can lead to 10.56x and 4.48x reduction in the storage and the training energy cost, respectively, with usually negligible accuracy loss, compared to state-of-the-art training baselines. Our source codes are available at: https://github.com/VITA-Group/SmartDeal. 
    more » « less
  5. Deep Neural Networks (DNNs) have shown significant advantages in a wide variety of domains. However, DNNs are becoming computationally intensive and energy hungry at an exponential pace, while at the same time, there is a vast demand for running sophisticated DNN-based services on resource constrained embedded devices. In this paper, we target energy-efficient inference on embedded DNN accelerators. To that end, we propose an automated framework to compress DNNs in a hardware-aware manner by jointly employing pruning and quantization. We explore, for the first time, per-layer fine- and coarse-grained pruning, in the same DNN architecture, in addition to low bit-width mixed-precision quantization for weights and activations. Reinforcement Learning (RL) is used to explore the associated design space and identify the pruning-quantization configuration so that the energy consumption is minimized whilst the prediction accuracy loss is retained at acceptable levels. Using our novel composite RL agent we are able to extract energy-efficient solutions without requiring retraining and/or fine-tuning. Our extensive experimental evaluation over widely used DNNs and the CIFAR-10/100 and ImageNet datasets demonstrates that our framework achieves 39% average energy reduction for 1.7% average accuracy loss and outperforms significantly the state-of-the-art approaches. 
    more » « less