FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

Jamil, Hasibul; Alim, Abdul; Schares, Laurent; Maniotis, Pavlos; Schour, Liran; Sydney, Ali; Kayi, Abdullah; Kosar, Tevfik; Karacali, Bengi

doi:10.1109/ICC52391.2025.11160830

Citation Details

This content will become publicly available on June 8, 2026

FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi-Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. Unlike tools that introduce additional traffic, FlowTracer aids in debugging network inefficiencies by passively monitoring and correlating user workload flows. As a result, FlowTracer does not interfere with ongoing data transfers, enabling analysis with minimal overhead, which is an important factor when debugging and fine-tuning routing schemes in production systems. FlowTracer can provide detailed insights into traffic distribution and can help identify the root causes of performance degradation, such as hash collisions. With FlowTracer’s flow-level insights, system operators can optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce. more »

Award ID(s):: 2313061

PAR ID:: 10658028

Author(s) / Creator(s):: Jamil, Hasibul ; Alim, Abdul ; Schares, Laurent ; Maniotis, Pavlos ; Schour, Liran ; Sydney, Ali ; Kayi, Abdullah ; Kosar, Tevfik ; Karacali, Bengi

Publisher / Repository:: IEEE International Conference on Communications (ICC 2025)

Date Published:: 2025-06-08

Page Range / eLocation ID:: 6988 to 6993

Format(s):: Medium: X

Location:: Montreal, CA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on June 8, 2026
Conference Paper:
https://doi.org/10.1109/ICC52391.2025.11160830

More Like this