Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variability due to their hierarchical architecture and resource sharing design. Event-driven network simulators are indispensable tools for navigating complex system design. In this study, we quantitatively evaluate a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit. This study looks at the impact of communication interference from a user’s perspective. Specifically, for a given application submitted by a user, we examine how this application will behave with the existing workload running in the system under different job placement policies. Our simulation study considers hundreds of experiment configurations including four target applications with representative communication patterns under a variety of network traffic conditions. Our study shows that intra-job interference can cause severe performance degradation for communication-intensive applications. Inter-job interference can generally be reduced for applications with one-toone or one-to-many communication patterns through job isolation. Application with one-to-all communication pattern is resilient to network interference. 
                        more » 
                        « less   
                    
                            
                            Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System
                        
                    
    
            Dragonfly networks are being widely adopted in high-performance computing systems. On these networks, however, interference caused by resource sharing can lead to significant network congestion and performance variability. We present a comparative analysis exploring the trade-off between localizing communication and balancing network traffic. We conduct trace-based simulations for applications with different communication patterns, using multiple job placement policies and routing mechanisms. We perform an in-depth performance analysis on representative applications individually and show that different applications have distinct preferences regarding localized communication and balanced network traffic. We further demonstrate the effect of external network interference by introducing background traffic and show that localized communication can help reduce the application performance variation caused by network sharing. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1717763
- PAR ID:
- 10097521
- Date Published:
- Journal Name:
- IPDPS .... [proceedings]
- ISSN:
- 2332-1237
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            In high-performance computing (HPC), modern supercomputers typically provide exclusive computing resources to user applications. Nevertheless, the interconnect network is a shared resource for both inter-node communication and across-node I/O access, among co-running workloads, leading to inevitable network interference. In this study, we develop MFNetSim, a multi-fidelity modeling framework that enables simulation of multi-traffic simultaneously over the interconnect network, including inter-process communication and I/O traffic. By combining different levels of abstraction, MFNetSim can efficiently co-model the communication and I/O traffic occurring on HPC systems equipped with flash-based storage. We conduct simulation studies of hybrid workloads composed of traditional HPC applications and emerging ML applications on a 1,056-node Dragonfly system with various configurations. Our analysis provides various observations regarding how network interference affects communication and I/O traffic.more » « less
- 
            A proof of concept system that enables real-time geospatial spectrum sharing between 5G/6G networks and Earth Exploration Satellite Services (EESS) has been developed. A simple algorithm that pauses network transmissions when there is potential interference from 5G/6G transmitters provides 99.6% network availability in the 24 GHz NR2 band while protecting all currently working EESS radiometers operating in the 23.8 GHz band. A more sophisticated algorithm that modifies transmission power levels and (if necessary) network traffic (similar to the methodologies used by Citizens Broadband Radio Service) can reduce interference so that there is no adverse impact on network availability. In addition to preventing interference, RGSS provides other significant benefits to both the wireless and the weather/climate communities, including improving network performance and coverage, the ability to support changes in network architectures, network elements, endpoints, and new or more sensitive radiometers, and a simple mechanism to test and police compliance with out-of-band emission requirements. RGSS is also compatible with existing spectrum management systems.more » « less
- 
            Heterogeneous chiplets have been proposed for accelerating high-performance computing tasks. Integrated inside one package, CPU and GPU chiplets can share a common interconnection network that can be implemented through the interposer. However, CPU and GPU applications have very different traffic patterns in general. Without effective management of the network resource, some chiplets can suffer significant performance degradation because the network bandwidth is taken away by communication-intensive applications. Therefore, techniques need to be developed to effectively manage the shared network resources. In a chiplet-based system, resource management needs to not only react in real-time but also be cost-efficient. In this work, we propose a reconfigurable network architecture, leveraging Kalman Filter to make accurate predictions on network resources needed by the applications and then adaptively change the resource allocation. Using our design, the network bandwidth can be fairly allocated to avoid starvation or performance degradation. Our evaluation results show that the proposed reconfigurable interconnection network can dynamically react to the changes in traffic demand of the chiplets and improve the system performance with low cost and design complexity.more » « less
- 
            Recently, wireless communication technologies, such as Wireless Local Area Networks (WLANs), have gained increasing popularity in industrial control systems (ICSs) due to their low cost and ease of deployment, but communication delays associated with these technologies make it unsuitable for critical real-time and safety applications. To address concerns on network-induced delays of wireless communication technologies and bring their advantages into modern ICSs, wireless network infrastructure based on the Parallel Redundancy Protocol (PRP) has been proposed. Although application-specific simulations and measurements have been conducted to show that wireless network infrastructure based on PRP can be a viable solution for critical applications with stringent delay performance constraints, little has been done to devise an analytical framework facilitating the adoption of wireless PRP infrastructure in miscellaneous ICSs. Leveraging the deterministic network calculus (DNC) theory, we propose to analytically derive worst-case bounds on network- induced delays for critical ICS applications. We show that the problem of worst-case delay bounding for a wireless PRP network can be solved by performing network-calculus-based analysis on its non-feedforward traffic pattern. Closed-form expressions of worst-case delays are derived, which has not been found previously and allows ICS architects/designers to compute worst- case delay bounds for ICS tasks in their respective application domains of interest. Our analytical results not only provide insights into the impacts of network-induced delays on latency- critical tasks but also allow ICS architects/operators to assess whether proper wireless RPR network infrastructure can be adopted into their systems.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    