Enhancing Resilience in Distributed ML Inference Pipelines for Edge Computing

Wu, Li; Hanafy, Walid A; Souza, Abel; Abdelzaher, Tarek; Verma, Gunjan; Shenoy, Prashant

doi:10.1109/MILCOM61039.2024.10773652

Citation Details

Enhancing Resilience in Distributed ML Inference Pipelines for Edge Computing

As edge computing and sensing devices continue to proliferate, distributed machine learning (ML) inference pipelines are becoming popular for enabling low-latency, real-time decision-making at scale. However, the geographically dispersed and often resource-constrained nature of edge devices makes them susceptible to various failures, such as hardware malfunctions, network disruptions, and device overloading. These edge failures can significantly affect the performance and availability of inference pipelines and the sensing-to-decision-making loops they enable. In addition, the complexity of task dependencies amplifies the difficulty of maintaining performant and reliable ML operations. To address these challenges and minimize the impact of edge failures on inference pipelines, this paper presents several fault-tolerant approaches, including sensing redundancy, structural resilience, failover replication, and pipeline reconfiguration. For each approach, we explain the key techniques and highlight their effectiveness and tradeoffs. Finally, we discuss the challenges associated with these approaches and outline future directions. more »

Award ID(s):: 2325956

PAR ID:: 10591318

Author(s) / Creator(s):: Wu, Li; Hanafy, Walid A; Souza, Abel; Abdelzaher, Tarek; Verma, Gunjan; Shenoy, Prashant

Publisher / Repository:: IEEE

Date Published:: 2024-10-28

ISBN:: 979-8-3503-7423-0

Page Range / eLocation ID:: 1 to 6

Format(s):: Medium: X

Location:: Washington, DC, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/MILCOM61039.2024.10773652

More Like this