OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

Warraich, Ertza; Shabtai, Omer; Manaa, Khalid; Vargaftik, Shay; Piasetzky, Yonatan; Kadosh, Matty; Suresh, Lalith; Shahbaz, Muhammad

Citation Details

This content will become publicly available on April 28, 2026

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models. Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively. more »

Award ID(s):: 2521510 2521196

PAR ID:: 10608556

Author(s) / Creator(s):: Warraich, Ertza; Shabtai, Omer; Manaa, Khalid; Vargaftik, Shay; Piasetzky, Yonatan; Kadosh, Matty; Suresh, Lalith; Shahbaz, Muhammad

Publisher / Repository:: 22nd USENIX Symposium on Networked Systems Design and Implementation

Date Published:: 2025-04-28

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on April 28, 2026
Conference Paper:
The DOI is not currently available.

More Like this