By processing sensory data in the vicinity of its
generation, edge computing reduces latency, improves responsiveness,
and saves network bandwidth in data-intensive applications.
However, existing edge computing solutions operate under the
assumption that the edge infrastructure will comprise a set of
pre-deployed, custom-configured computing devices, connected
by a reliable local network. Although edge computing has great
potential to provision the necessary computational resources in
highly dynamic and volatile environments, including disaster
recovery scenes and wilderness expeditions, extant distributed
system architectures in this domain are not resilient against
partial failure, caused by network disconnections. In this paper,
we present a novel edge computing system architecture that
delivers failure-resistant and efficient applications by dynamically
adapting to handle failures; if the edge server becomes unreachable,
device clusters start executing the assigned tasks by communicating
P2P, until the edge server becomes reachable again.
Our experimental results with the reference implementation show
high responsiveness and resilience in the face of partial failure.
These results indicate that the presented solution can integrate
the individual capacities of mobile devices into powerful edge
clouds, providing efficient and reliable services for end-users in
highly dynamic and volatile environments.
more »
« less
Stop Rerouting!: Enabling ShareBackup for Failure Recovery in Data Center Networks
This paper introduces sharable backup as a novel solution to failure recovery in data center networks. It allows the entire network to share a small pool of backup devices. This proposal is grounded in three key observations. First, the traditional rerouting-based failure recovery is ineffective, because bandwidth loss from failures degrades application performance drastically. Therefore, failed devices should be replaced to restore bandwidth. Second, failures in data centers are rare but destructive [11], so it is desirable to seek cost-effective backup options. Third, the emergence of configurable data center network architectures promises feasibility of bringing backup devices online dynamically. We design the ShareBackup prototype architecture to realize this idea. Compared to rerouting-based solutions, ShareBackup provides more bandwidth with short path length at low cost.
more »
« less
- Award ID(s):
- 1718980
- PAR ID:
- 10058535
- Date Published:
- Journal Name:
- HotNets-XVI Proceedings of the 16th ACM Workshop on Hot Topics in Networks
- Page Range / eLocation ID:
- 171 to 177
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
With the rapid advancement of edge computing and network function virtualization, it is promising to provide flexible and low-latency network services at the edge. However, due to the vulnerability of edge services and the volatility of edge computing system states, i.e., service request rates, failure rates, and resource prices, it is challenging to minimize the online service cost while providing the availability guarantee. This paper considers the problem of online virtual network function backup under availability constraints (OVBAC) for cost minimization in edge environments. We formulate the problem based on the characteristics of the volatility system states derived from real-world data and show the hardness of the formulated problem. We use an online backup deployment scheme named Drift-Plus-Penalty (DPP) with provable near-optimal performance for the AVBAC problem. In particular, DPP needs to solve an integer programming problem at the beginning of each time slot. We propose a dynamic programming-based algorithm that can optimally solve the problem in pseudo-polynomial time. Extensive real-world data-driven simulations demonstrate that DPP significantly outperforms popular baselines used in practice.more » « less
-
Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.more » « less
-
Next-generation optical metro-access networks are expected to support end-to-end virtual network slices for critical 5G services. However, disasters affecting physical infrastructures upon which network slices are mapped can cause significant disruption in these services. Operators can deploy recovery units or trucks to restore services based on slice requirements. In this study, we investigate the problem of slice-aware service restoration in metro-access networks with specialized recovery trucks to restore services after a disaster failure. We model the problem based on classical vehicle-routing problem to find optimal routes for recovery trucks to failure sites to provide temporary backup service until the network components are repaired. Our proposed slice-aware service-restoration approach is formulated as a mixed integer linear program with the objective to minimize penalty of service disruption across different network slices.We compare our slice-aware approach with a slice-unaware approach and show that our proposed approach can achieve significant reduction in service-disruption penaltymore » « less
-
Accurate prediction of product failures and the need for repair services become critical for various reasons, including understanding the warranty performance of manufacturers, defining cost-efficient repair strategies, and compliance with safety standards. The purpose of this study is to use machine learning tools to analyze several parameters crucial for achieving a robust repair service system, including the number of repairs, the time of the next repair ticket or product failure, and the time to repair. A large dataset of over 530,000 repairs and maintenance of medical devices has been investigated by employing the Support Vector Machine (SVM) tool. SVM with four kernel functions is used to forecast the timing of the next failure or repair request in the system for two different products and two different failure types, namely random failure and physical damage. A frequency analysis is also conducted to explore the product quality level based on product failure and the time to repair it. Besides, the best probability distributions are fitted for the number of failures, the time between failures, and the time to repair. The results reveal the value of data analytics and machine learning tools in analyzing post-market product performance and the cost of repair and maintenance operations.more » « less