DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Ardalani, Newsha; Pal, Saptadeep; Gupta, Puneet

doi:10.1145/3635867

Citation Details

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack. more »

Award ID(s):: 2231097

PAR ID:: 10541688

Author(s) / Creator(s):: Ardalani, Newsha; Pal, Saptadeep; Gupta, Puneet

Publisher / Repository:: ACM

Date Published:: 2023-12-21

Journal Name:: ACM Transactions on Design Automation of Electronic Systems

ISSN:: 1084-4309

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1145/3635867

More Like this