NSF PAR Search | NSF Public Access Repository

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Warraich, Ertza; Shabtai, Omer; Manaa, Khalid; Vargaftik, Shay; Piasetzky, Yonatan; Kadosh, Matty; Suresh, Lalith; Shahbaz, Muhammad (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.

Free, publicly-accessible full text available April 28, 2026

Search for: All records