skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on June 12, 2025

Title: Designing Reconfigurable Interconnection Network of Heterogeneous Chiplets Using Kalman Filter
Heterogeneous chiplets have been proposed for accelerating high-performance computing tasks. Integrated inside one package, CPU and GPU chiplets can share a common interconnection network that can be implemented through the interposer. However, CPU and GPU applications have very different traffic patterns in general. Without effective management of the network resource, some chiplets can suffer significant performance degradation because the network bandwidth is taken away by communication-intensive applications. Therefore, techniques need to be developed to effectively manage the shared network resources. In a chiplet-based system, resource management needs to not only react in real-time but also be cost-efficient. In this work, we propose a reconfigurable network architecture, leveraging Kalman Filter to make accurate predictions on network resources needed by the applications and then adaptively change the resource allocation. Using our design, the network bandwidth can be fairly allocated to avoid starvation or performance degradation. Our evaluation results show that the proposed reconfigurable interconnection network can dynamically react to the changes in traffic demand of the chiplets and improve the system performance with low cost and design complexity.  more » « less
Award ID(s):
2046186 2008911 2051062
NSF-PAR ID:
10511372
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
The 34th Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI
ISBN:
979-8-4007-0605-9
Subject(s) / Keyword(s):
Network-on-Chip, NoC, Chiplets, reconfigurable NoC
Format(s):
Medium: X
Location:
Tempa, FL, US
Sponsoring Org:
National Science Foundation
More Like this
  1. Edge Cloud (EC) is poised to brace massive machine type communication (mMTC) for 5G and IoT by providing compute and network resources at the edge. Yet, the EC being regionally domestic with a smaller scale, faces the challenges of bandwidth and computational throughput. Resource management techniques are considered necessary to achieve efficient resource allocation objectives. Software Defined Network (SDN) enabled EC architecture is emerging as a potential solution that enables dynamic bandwidth allocation and task scheduling for latency sensitive and diverse mobile applications in the EC environment. This study proposes a novel Heuristic Reinforcement Learning (HRL) based flowlevel dynamic bandwidth allocation framework and validates it through end-to-end implementation using OpenFlow meter feature. OpenFlow meter provides granular control and allows demand-based flow management to meet the diverse QoS requirements germane to IoT traffics. The proposed framework is then evaluated by emulating an EC scenario based on real NSF COSMOS testbed topology at The City College of New York. A specific heuristic reinforcement learning with linear-annealing technique and a pruning principle are proposed and compared with the baseline approach. Our proposed strategy performs consistently in both Mininet and hardware OpenFlow switches based environments. The performance evaluation considers key metrics associated with real-time applications: throughput, end-to-end delay, packet loss rate, and overall system cost for bandwidth allocation. Furthermore, our proposed linear annealing method achieves faster convergence rate and better reward in terms of system cost, and the proposed pruning principle remarkably reduces control traffic in the network. 
    more » « less
  2. null (Ed.)
    Resource disaggregation is a new architecture for data centers in which resources like memory and storage are decoupled from the CPU, managed independently, and connected through a high-speed network. Recent work has shown that although disaggregated data centers (DDCs) provide operational benefits, applications running on DDCs experience degraded performance due to extra network latency between the CPU and their working sets in main memory. DBMSs are an interesting case study for DDCs for two main reasons: (1) DBMSs normally process data-intensive workloads and require data movement between different resource components; and (2) disaggregation drastically changes the assumption that DBMSs can rely on their own internal resource management. We take the first step to thoroughly evaluate the query execution performance of production DBMSs in disaggregated data centers. We evaluate two popular open-source DBMSs (MonetDB and PostgreSQL) and test their performance with the TPC-H benchmark in a recently released operating system for resource disaggregation. We evaluate these DBMSs with various configurations and compare their performance with that of single-machine Linux with the same hardware resources. Our results confirm that significant performance degradation does occur, but, perhaps surprisingly, we also find settings in which the degradation is minor or where DDCs actually improve performance. 
    more » « less
  3. Multi-Agent Reinforcement Learning (MARL) is a key technology in artificial intelligence applications such as robotics, surveillance, energy systems, etc. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a state-of-the-art MARL algorithm that has been widely adopted and considered a popular baseline for novel MARL algorithms. However, existing implementations of MADDPG on CPU and CPU-GPU platforms do not exploit fine-grained parallelism between cooperative agents and handle inter-agent communication sequentially, leading to sub-optimal throughput performance in MADDPG training. In this work, we develop the first high-throughput MADDPG accelerator on a CPU-FPGA heterogeneous platform. Specifically, we develop dedicated hardware modules that enable parallel training of each agent's internal Deep Neural Networks (DNNs) and support low-latency inter-agent communication using an on-chip agent interconnection network. Our experimental results show that the speed performance of agent neural network training improves by a factor of 3.6×−24.3× and 1.5×−29.5× compared with state-of-the-art CPU and CPU-GPU implementations. Our design achieves up to a 1.99× and 1.93× improvement in overall system throughput compared with CPU and CPU-GPU implementations, respectively. 
    more » « less
  4. Chiplets are transforming computer system designs, allowing system designers to combine heterogeneous computing resources at unprecedented scales. Breaking larger, monolithic chips into smaller, connected chiplets helps performance continue scaling, avoids die size limitations, improves yield, and reduces design and integration costs. However, chiplet-based designs introduce an additional level of hierarchy, which causes indirection and non-uniformity. This clashes with typical heterogeneous systems: unlike CPU-based multi-chiplet systems, heterogeneous systems do not have significant OS support or complex coherence protocols to mitigate the impact of this indirection. Thus, exploiting locality across application phases is harder in multi-chiplet heterogeneous systems. We propose CPElide, which utilizes information already available in heterogeneous systems’ embedded microprocessor (the command processor) to track inter-chiplet data dependencies and aggressively perform implicit synchronization only when necessary, instead of conservatively like the state-of-the-art HMG. Across 24 workloads CPElide improves average performance (13%, 19%), energy (14%, 11%), and network traffic (14%, 17%), respectively, over current approaches and HMG. 
    more » « less
  5. Network emulation allows unmodified code execution on lightweight containers to enable accurate and scalable networked application testing. However, such testbeds cannot guarantee fidelity under high workloads, especially when many processes concurrently request resources (e.g., CPU, disk I/O, GPU, and network bandwidth) that are more than the underlying physical machine can offer. A virtual time system enables the emulated hosts to maintain their own notion of virtual time. A container can stop advancing its time when not running (e.g., in an idle or suspended state). The existing virtual time systems focus on precise time management for CPU-intensive applications but are not designed to handle other operations, such as disk I/O, network I/O, and GPU computation. In this paper, we develop a lightweight virtual time system that integrates precise I/O time for container-based network emulation. We model and analyze the temporal error during I/O operations and develop a barrier-based time compensation mechanism in the Linux kernel. We also design and implement Dynamic Load Monitor (DLM) to mitigate the temporal error during I/O resource contention. VT-IO enables accurate virtual time advancement with precise I/O time measurement and compensation. The experimental results demonstrate a significant improvement in temporal error with the introduction of DLM. The temporal error is reduced from 7.889 seconds to 0.074 seconds when utilizing the DLM in the virtual time system. Remarkably, this improvement is achieved with an overall overhead of only 1.36% of the total execution time. 
    more » « less