Edge Cloud (EC) is poised to brace massive
machine type communication (mMTC) for 5G and IoT by
providing compute and network resources at the edge. Yet,
the EC being regionally domestic with a smaller scale, faces
the challenges of bandwidth and computational throughput.
Resource management techniques are considered necessary
to achieve efficient resource allocation objectives. Software
Defined Network (SDN) enabled EC architecture is emerging as
a potential solution that enables dynamic bandwidth allocation
and task scheduling for latency sensitive and diverse mobile
applications in the EC environment. This study proposes a
novel Heuristic Reinforcement Learning (HRL) based flowlevel
dynamic bandwidth allocation framework and validates
it through end-to-end implementation using OpenFlow meter
feature. OpenFlow meter provides granular control and allows
demand-based flow management to meet the diverse QoS requirements
germane to IoT traffics. The proposed framework is
then evaluated by emulating an EC scenario based on real NSF
COSMOS testbed topology at The City College of New York. A
specific heuristic reinforcement learning with linear-annealing
technique and a pruning principle are proposed and compared
with the baseline approach. Our proposed strategy performs
consistently in both Mininet and hardware OpenFlow switches
based environments. The performance evaluation considers
key metrics associated with real-time applications: throughput,
end-to-end delay, packet loss rate, and overall system cost
for bandwidth allocation. Furthermore, our proposed linear
annealing method achieves faster convergence rate and better
reward in terms of system cost, and the proposed pruning
principle remarkably reduces control traffic in the network.
more »
« less
This content will become publicly available on June 12, 2025
Designing Reconfigurable Interconnection Network of Heterogeneous Chiplets Using Kalman Filter
Heterogeneous chiplets have been proposed for accelerating high-performance computing tasks. Integrated inside one package, CPU and GPU chiplets can share a common interconnection network that can be implemented through the interposer. However, CPU and GPU applications have very different traffic patterns in general. Without effective management of the network resource, some chiplets can suffer significant performance degradation because the network bandwidth is taken away by communication-intensive applications. Therefore, techniques need to be developed to effectively manage the shared network resources. In a chiplet-based system, resource management needs to not only react in real-time but also be cost-efficient. In this work, we propose a reconfigurable network architecture, leveraging Kalman Filter to make accurate predictions on network resources needed by the applications and then adaptively change the resource allocation. Using our design, the network bandwidth can be fairly allocated to avoid starvation or performance degradation. Our evaluation results show that the proposed reconfigurable interconnection network can dynamically
react to the changes in traffic demand of the chiplets and improve the system performance with low cost and design complexity.
more »
« less
- NSF-PAR ID:
- 10511372
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- The 34th Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI
- ISBN:
- 979-8-4007-0605-9
- Subject(s) / Keyword(s):
- Network-on-Chip, NoC, Chiplets, reconfigurable NoC
- Format(s):
- Medium: X
- Location:
- Tempa, FL, US
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Resource disaggregation is a new architecture for data centers in which resources like memory and storage are decoupled from the CPU, managed independently, and connected through a high-speed network. Recent work has shown that although disaggregated data centers (DDCs) provide operational benefits, applications running on DDCs experience degraded performance due to extra network latency between the CPU and their working sets in main memory. DBMSs are an interesting case study for DDCs for two main reasons: (1) DBMSs normally process data-intensive workloads and require data movement between different resource components; and (2) disaggregation drastically changes the assumption that DBMSs can rely on their own internal resource management. We take the first step to thoroughly evaluate the query execution performance of production DBMSs in disaggregated data centers. We evaluate two popular open-source DBMSs (MonetDB and PostgreSQL) and test their performance with the TPC-H benchmark in a recently released operating system for resource disaggregation. We evaluate these DBMSs with various configurations and compare their performance with that of single-machine Linux with the same hardware resources. Our results confirm that significant performance degradation does occur, but, perhaps surprisingly, we also find settings in which the degradation is minor or where DDCs actually improve performance.more » « less
-
Multi-Agent Reinforcement Learning (MARL) is a key technology in artificial intelligence applications such as robotics, surveillance, energy systems, etc. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a state-of-the-art MARL algorithm that has been widely adopted and considered a popular baseline for novel MARL algorithms. However, existing implementations of MADDPG on CPU and CPU-GPU platforms do not exploit fine-grained parallelism between cooperative agents and handle inter-agent communication sequentially, leading to sub-optimal throughput performance in MADDPG training. In this work, we develop the first high-throughput MADDPG accelerator on a CPU-FPGA heterogeneous platform. Specifically, we develop dedicated hardware modules that enable parallel training of each agent's internal Deep Neural Networks (DNNs) and support low-latency inter-agent communication using an on-chip agent interconnection network. Our experimental results show that the speed performance of agent neural network training improves by a factor of 3.6×−24.3× and 1.5×−29.5× compared with state-of-the-art CPU and CPU-GPU implementations. Our design achieves up to a 1.99× and 1.93× improvement in overall system throughput compared with CPU and CPU-GPU implementations, respectively.more » « less
-
Chiplets are transforming computer system designs, allowing system designers to combine heterogeneous computing resources at unprecedented scales. Breaking larger, monolithic chips into smaller, connected chiplets helps performance continue scaling, avoids die size limitations, improves yield, and reduces design and integration costs. However, chiplet-based designs introduce an additional level of hierarchy, which causes indirection and non-uniformity. This clashes with typical heterogeneous systems: unlike CPU-based multi-chiplet systems, heterogeneous systems do not have significant OS support or complex coherence protocols to mitigate the impact of this indirection. Thus, exploiting locality across application phases is harder in multi-chiplet heterogeneous systems. We propose CPElide, which utilizes information already available in heterogeneous systems’ embedded microprocessor (the command processor) to track inter-chiplet data dependencies and aggressively perform implicit synchronization only when necessary, instead of conservatively like the state-of-the-art HMG. Across 24 workloads CPElide improves average performance (13%, 19%), energy (14%, 11%), and network traffic (14%, 17%), respectively, over current approaches and HMG.more » « less
-
Network emulation allows unmodified code execution on lightweight containers to enable accurate and scalable networked application testing. However, such testbeds cannot guarantee fidelity under high workloads, especially when many processes concurrently request resources (e.g., CPU, disk I/O, GPU, and network bandwidth) that are more than the underlying physical machine can offer. A virtual time system enables the emulated hosts to maintain their own notion of virtual time. A container can stop advancing its time when not running (e.g., in an idle or suspended state). The existing virtual time systems focus on precise time management for CPU-intensive applications but are not designed to handle other operations, such as disk I/O, network I/O, and GPU computation. In this paper, we develop a lightweight virtual time system that integrates precise I/O time for container-based network emulation. We model and analyze the temporal error during I/O operations and develop a barrier-based time compensation mechanism in the Linux kernel. We also design and implement Dynamic Load Monitor (DLM) to mitigate the temporal error during I/O resource contention. VT-IO enables accurate virtual time advancement with precise I/O time measurement and compensation. The experimental results demonstrate a significant improvement in temporal error with the introduction of DLM. The temporal error is reduced from 7.889 seconds to 0.074 seconds when utilizing the DLM in the virtual time system. Remarkably, this improvement is achieved with an overall overhead of only 1.36% of the total execution time.more » « less