The networking industry is offering new services leveraging recent technological advances in connectivity, storage, and computing such as mobile communications and edge computing. In this regard, extended reality, a term encompassing virtual reality, augmented reality, and mixed reality, can provide unprecedented user experience and pioneering service opportunities such as: live concerts, sports, and other events; interactive gaming and entertainment; immersive education, training, and demos. These services require high-bandwidth, low-latency, and reliable connections, and are supported by next-generation ultra-reliable and low-latency communications in the vision of 6G mobile communication systems. In this work, we devise a novel scheme, called backup from different data centers with multicast and adaptive bandwidth provisioning, to admit reliable, low-latency, and high-bandwidth extended reality live streams in next-generation networks. We consider network services where contents are non-cacheable and investigate how backup services can be offered by different data centers with multicast and adaptive bandwidth provisioning. Our proposed service-provisioning scheme provides protection not only against link failures in the physical network but also against computing and storage failures in data centers. We develop scalable algorithms for the service-provisioning scheme and evaluate their performance on various complex network instances in a dynamic environment. Numerical results show that, compared to conventional service-provisioning schemes such as those seeking backup services from the same data center, our proposed service-provisioning scheme efficiently utilizes network resources, ensures higher reliability, and guarantees low latency; hence, it is highly suitable for extended reality live streams.
more »
« less
Stop Rerouting!: Enabling ShareBackup for Failure Recovery in Data Center Networks
This paper introduces sharable backup as a novel solution to failure recovery in data center networks. It allows the entire network to share a small pool of backup devices. This proposal is grounded in three key observations. First, the traditional rerouting-based failure recovery is ineffective, because bandwidth loss from failures degrades application performance drastically. Therefore, failed devices should be replaced to restore bandwidth. Second, failures in data centers are rare but destructive [11], so it is desirable to seek cost-effective backup options. Third, the emergence of configurable data center network architectures promises feasibility of bringing backup devices online dynamically. We design the ShareBackup prototype architecture to realize this idea. Compared to rerouting-based solutions, ShareBackup provides more bandwidth with short path length at low cost.
more »
« less
- Award ID(s):
- 1718980
- PAR ID:
- 10058535
- Date Published:
- Journal Name:
- HotNets-XVI Proceedings of the 16th ACM Workshop on Hot Topics in Networks
- Page Range / eLocation ID:
- 171 to 177
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
By processing sensory data in the vicinity of its generation, edge computing reduces latency, improves responsiveness, and saves network bandwidth in data-intensive applications. However, existing edge computing solutions operate under the assumption that the edge infrastructure will comprise a set of pre-deployed, custom-configured computing devices, connected by a reliable local network. Although edge computing has great potential to provision the necessary computational resources in highly dynamic and volatile environments, including disaster recovery scenes and wilderness expeditions, extant distributed system architectures in this domain are not resilient against partial failure, caused by network disconnections. In this paper, we present a novel edge computing system architecture that delivers failure-resistant and efficient applications by dynamically adapting to handle failures; if the edge server becomes unreachable, device clusters start executing the assigned tasks by communicating P2P, until the edge server becomes reachable again. Our experimental results with the reference implementation show high responsiveness and resilience in the face of partial failure. These results indicate that the presented solution can integrate the individual capacities of mobile devices into powerful edge clouds, providing efficient and reliable services for end-users in highly dynamic and volatile environments.more » « less
-
In the post-pandemic era, global working patterns have been reshaped, and the demand for online network services has increased significantly. Therefore, cross-data-center content migration has become a relevant problem to address, leading to higher attention in data backup/recovery planning. Beyond traditional pre-disaster content redundancy approaches, this work focuses on the challenge of rapid post-disaster content evacuation under the threat of cascading failures. In fact, due to the interdependence of data centers (DCs), inter-DC optical networks, and power grid networks, disasters may have a domino effect on these infrastructures, with their impact gradually expanding over time and space. In this paper, we propose two trajectory models that capture the dynamic evolution of cascading failures, and we propose a trajectory-based content evacuation (TCE) strategy that considers the spatiotemporal evolution of cascading failures to minimize content loss. Numerical results show that, when each DC needs to evacuate about 200 TB of massive content, TCE can reduce content loss by up to 25% compared to baseline strategies.more » « less
-
With the rapid advancement of edge computing and network function virtualization, it is promising to provide flexible and low-latency network services at the edge. However, due to the vulnerability of edge services and the volatility of edge computing system states, i.e., service request rates, failure rates, and resource prices, it is challenging to minimize the online service cost while providing the availability guarantee. This paper considers the problem of online virtual network function backup under availability constraints (OVBAC) for cost minimization in edge environments. We formulate the problem based on the characteristics of the volatility system states derived from real-world data and show the hardness of the formulated problem. We use an online backup deployment scheme named Drift-Plus-Penalty (DPP) with provable near-optimal performance for the AVBAC problem. In particular, DPP needs to solve an integer programming problem at the beginning of each time slot. We propose a dynamic programming-based algorithm that can optimally solve the problem in pseudo-polynomial time. Extensive real-world data-driven simulations demonstrate that DPP significantly outperforms popular baselines used in practice.more » « less
-
Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.more » « less
An official website of the United States government

