skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures
The increased computational capability in heterogeneous manycore architectures facilitates the concurrent execution of many applications. This requires, among other things, a flexible, high-performance, and energy-efficient communication fabric capable of handling a variety of traffic patterns needed for running multiple applications at the same time. Such stringent requirements are posing a major challenge for current Network-on-Chips (NoCs) design. In this paper, we propose Adapt-NoC, a flexible NoC architecture, along with a reinforcement learning (RL)-based control policy, that can provide efficient communication support for concurrent application execution. Adapt-NoC can dynamically allocate several disjoint regions of the NoC, called subNoCs, with different sizes and locations for the concurrently running applications. Each of the dynamically-allocated subNoCs is capable of adapting to a given topology such as a mesh, cmesh, torus, or tree thus tailoring the topology to satisfy application’s needs in terms of performance and power consumption. Moreover, we explore the use of RL to design an efficient control policy which optimizes the subNoC topology selection for a given application. As such, Adapt-NoC can not only provide several topology choices for concurrently running applications, but can also optimize the selection of the most suitable topology for a given application with the aim of improving performance and energy efficiency. We evaluate Adapt-NoC using both GPU and CPU benchmark suites. Simulation results show that the proposed Adapt-NoC can achieve up to 34% latency reduction, 10% overall execution time reduction and 53% NoC energy-efficiency improvement when compared to prior work.  more » « less
Award ID(s):
1702980
PAR ID:
10229855
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE International Symposium on High-Performance Computer Architecture (HPCA)
Page Range / eLocation ID:
723 to 735
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Heterogeneous manycore architectures are deployed to simultaneously run multiple and diverse applications. This requires various computing capabilities (CPUs, GPUs, and accelerators), and an efficient network-on-chip (NoC) architecture to concurrently handle diverse application communication behavior. However, supporting the concurrent communication requirements of diverse applications is challenging due to the dynamic application mapping, the complexity of handling distinct communication patterns and limited on-chip resources. In this paper, we propose Adapt-NoC, a versatile and flexible NoC architecture for chiplet-based manycore architectures, consisting of adaptable routers and links. Adapt-NoC can dynamically allocate disjoint regions of the NoC, called subNoCs, for concurrently-running applications, each of which can be optimized for different communication behavior. The adaptable routers and links are capable of providing various subNoC topologies, satisfying different latency and bandwidth requirements of various traffic patterns (e.g. all-to-all, one-to-many). Full system simulation shows that AdaptNoC can achieve 31% latency reduction, 24% energy saving and 10% execution time reduction on average, when compared to prior designs. 
    more » « less
  2. null (Ed.)
    Growth of the Internet-of-things has led to complex system-on-chips (SoCs) being used in the edge devices in IoT applications. The increased complexity is demanding designers to consider several critical factors, such as dynamic requirement changes, long application life, mass production, and tight time-to-market deadlines. These requirements lead to more complex security concerns. SoC manufacturers outsource some of the intellectual property cores integrated on the SoC to untrusted third-party vendors. The untrusted intellectual properties can contain malicious implants, which can launch attacks using the resources provided by the on-chip interconnection network, commonly known as the network-on-chip (NoC). Existing efforts on securing NoC have considered lightweight encryption, authentication, and other attack detection mechanisms such as denial-of-service and buffer overflows. Unfortunately, these approaches focus on designing statically optimized security solutions. As a result, they are not suitable for many IoT systems with long application life and dynamic requirement changes. There is a critical need to design reconfigurable security architectures that can be dynamically tuned based on changing requirements. In this article, we propose a tier-based reconfigurable security architecture that can adapt to different use-case scenarios. We explore how to design an efficient reconfigurable architecture that can support three popular NoC security mechanisms (encryption, authentication, and denial-of-service attack detection and localization) and implement suitable dynamic reconfiguration techniques. We evaluate our proposed framework by running standard benchmarks enabling different tiers of security and provide a comprehensive analysis of how different levels of security can affect application performance, energy efficiency, and area overhead. 
    more » « less
  3. Autonomous mobile robots (AMRs) have been widely utilized in industry to execute various on-board computer-vision applications including autonomous guidance, security patrol, object detection, and face recognition. Most of the applications executed by an AMR involve the analysis of camera images through trained machine learning models. Many research studies on machine learning focus either on performance without considering energy efficiency or on techniques such as pruning and compression to make the model more energy-efficient. However, most previous work do not study the root causes of energy inefficiency for the execution of those applications on AMRs. The computing stack on an AMR accounts for 33% of the total energy consumption and can thus highly impact the battery life of the robot. Because recharging an AMR may disrupt the application execution, it is important to efficiently utilize the available energy for maximized battery life. In this paper, we first analyze the breakdown of power dissipation for the execution of computer-vision applications on AMRs and discover three main root causes of energy inefficiency: uncoordinated access to sensor data, performance-oriented model inference execution, and uncoordinated execution of concurrent jobs. In order to fix these three inefficiencies, we propose E2M, an energy-efficient middleware software stack for autonomous mobile robots. First, E2M regulates the access of different processes to sensor data, e.g., camera frames, so that the amount of data actually captured by concurrently executing jobs can be minimized. Second, based on a predefined per-process performance metric (e.g., safety, accuracy) and desired target, E2M manipulates the process execution period to find the best energy-performance trade off. Third, E2M coordinates the execution of the concurrent processes to maximize the total contiguous sleep time of the computing hardware for maximized energy savings. We have implemented a prototype of E2M on a real-world AMR. Our experimental results show that, compared to several baselines, E2M leads to 24% energy savings for the computing platform, which translates into an extra 11.5% of battery time and 14 extra minutes of robot runtime, with a performance degradation lower than 7.9% for safety and 1.84% for accuracy. 
    more » « less
  4. Modern accelerators like GPUs increasingly execute independent operations concurrently to improve the device’s compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software GEMM support, their kernel implementations and optimizations typically assume each kernel executes inisolationand can utilize all GPU resources. This approach is highly efficient when kernels execute in isolation, but causes significant resource contention and slowdowns when kernels execute concurrently. Moreover, current approaches often onlystaticallyexpose and control parallelism within an application, without considering runtime information such as varying input size and concurrent applications – often exacerbating contention. These issues limit performance benefits from concurrently executing independent operations. Accordingly, we propose GOLDYLOC, which considers theglobalresources across all concurrent operations to identify performant GEMM kernels, which we call globally optimized (GO)-Kernels. GOLDYLOC also introduces a lightweight dynamic logic which considers thedynamicexecution environment for available parallelism and input sizes to execute performant combinations of concurrent GEMMs on the GPU. Overall, GOLDYLOC improves performance of concurrent GEMMs on a real GPU by up to 2 × (18% geomean per workload) versus the default concurrency approach and provides up to 2.5 × (43% geomean per workload) speedup over sequential execution. 
    more » « less
  5. null (Ed.)
    Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive kernels. Even among bandwidth-intensive kernels, more intensive kernels may unfairly consume much higher bandwidth than less-intensive ones. In this article, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach dynamically detects co-running kernels as latency sensitive or bandwidth intensive. As both the DRAM bandwidth and L2-to-L1 Network-on-Chip (NoC) bandwidth can be the critical resource, our approach partitions both bandwidth resources coordinately along with selecting proper CTA combinations. The key objective is to allocate more CTA resources for latency-sensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRAM-intensive kernels. We achieve it using a variation of dominant resource fairness (DRF). Compared with two state-of-the-art CKE optimization schemes, SMK [52] and WS [55], our approach improves the average harmonic speedup by 78% and 39%, respectively. Even compared to the best possible CTA combinations, which are obtained from an exhaustive search among all possible CTA combinations, our approach improves the harmonic speedup by up to 51% and 11% on average. 
    more » « less