skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 30, 2026

Title: LiteNoC: Developing Low-Cost Network-on-Chips for Deep Neural Networks
With the rapid advance in Deep Neural Networks (DNNs), GPU’s role as a hardware accelerator becomes increasingly important. Due to the GPU’s significant power consumption, developing high- performance and power-efficient GPU systems is a critical challenge. DNN applications need to move a large amount of data between memory and the processing cores, which consumes a great amount of NoC power. However, prior proposed lossless data compressions cannot achieve optimal performance and energy efficiency because they did not take advantage of the error resilience of DNNs. In this work, we propose an NoC architecture that can reduce power consumption without compromising performance and accu- racy. Our technique takes advantage of the error resilience of DNNs as well as the data locality in the floating-point data representation of DNNs. Each data packet is reorganized by grouping data with similar bits such as in the exponents, and redundant bits are sent only once. We further compress the mantissa fields by appropri- ately selecting "proxy" values for data sharing the same exponent. Our evaluation results show that the proposed technique can ef- fectively reduce the amount of data transmitted and lead to better performance and power trade-offs while preserving accuracy.  more » « less
Award ID(s):
2051062 2046186 2008911
PAR ID:
10594870
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
Format(s):
Medium: X
Location:
New Orleans, LA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Network-on-Chips (NoCs) have emerged as the standard on-chip communication fabrics for multi/many core systems and system on chips. However, as the number of cores on chip increases, so does power consumption. Recent studies have shown that NoC power consumption can reach up to 40% of the overall chip power. Considerable research efforts have been deployed to significantly reduce NoC power consumption. In this paper, we build on approximate computing techniques and propose an approximate communication methodology called DEC-NoC for reducing NoC power consumption. The proposed DEC-NoC leverages applications' error tolerance and dynamically reduces the amount of error checking and correction in packet transmission, which results in a significant reduction in the number of retransmitted packets. The reduction in packet retransmission results in reduced power consumption. Our cycle accurate simulation using PARSEC benchmark suites shows that DEC-NoC achieves up to 56% latency reduction and up to 58% dynamic power reduction compared to NoC architectures with conventional error control techniques. 
    more » « less
  2. We propose CURE, a deep reinforcement learning (DRL)-based NoC design framework that simultaneously reduces network latency, improves energy-efficiency, and tolerates transient errors and permanent faults. CURE has several architectural innovations and a DRL-based hardware controller to manage design complexity and optimize trade-offs. First, in CURE, we propose reversible multi-function adaptive channels (RMCs) to reduce NoC power consumption and network latency. Second, we implement a new fault-secure adaptive error correction hardware in each router to enhance reliability for both transient errors and permanent faults. Third, we propose a router power-gating and bypass design that powers off NoC components to reduce power and extend chip lifespan. Further, for the complex dynamic interactions of these techniques, we propose using DRL to train a proactive control policy to provide improved fault-tolerance, reduced power consumption, and improved performance. Simulation using the PARSEC benchmark shows that CURE reduces end-to-end packet latency by 39%, improves energy efficiency by 92%, and lowers static and dynamic power consumption by 24% and 38%, respectively, over conventional solutions. Using mean-time-to-failure, we show that CURE is 7.7x more reliable than the conventional NoC design. 
    more » « less
  3. As technology scales, Network-on-Chips (NoCs), currently being used for on-chip communication in manycore architectures, face several problems including high network latency, excessive power consumption, and low reliability. Simultaneously addressing these problems is proving to be difficult due to the explosion of the design space and the complexity of handling many trade-offs. In this paper, we propose IntelliNoC, an intelligent NoC design framework which introduces architectural innovations and uses reinforcement learning to manage the design complexity and simultaneously optimize performance, energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three NoC architectural techniques: (1) multifunction adaptive channels (MFACs) to improve energy-efficiency; (2) adaptive error detection/correction and re-transmission control to enhance reliability; and (3) a stress-relaxing bypass feature which dynamically powers off NoC components to prevent overheating and fatigue. To handle the complex dynamic interactions induced by these techniques, we train a dynamic control policy using Q-learning, with the goal of providing improved fault-tolerance and performance while reducing power consumption and area overhead. Simulation using PARSEC benchmarks shows that our proposed IntelliNoC design improves energy-efficiency by 67% and mean-time-to-failure (MTTF) by 77%, and decreases end-to-end packet latency by 32% and area requirements by 25% over baseline NoC architecture. 
    more » « less
  4. Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture. 
    more » « less
  5. Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss. 
    more » « less