Congestion-Aware Path Selection for Load Balancing in AI ClustersErfan Nosrati, Majid Ghaderihttps://arxiv.org/abs/2506.08132 https://
Congestion-Aware Path Selection for Load Balancing in AI ClustersFast training of large machine learning models requires distributed training on AI clusters consisting of thousands of GPUs. The efficiency of distributed training crucially depends on the efficiency of the network interconnecting GPUs in the cluster. These networks are commonly built using RDMA following a Clos-like datacenter topology. To efficiently utilize the network bandwidth, load balancing is employed to distribute traffic across multiple redundant paths. While there exists numerous tec…