skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scheme of carrier cooperation with coordinated scheduling for faster and lower-cost failure/disaster recovery
Large-scale carrier networks are fundamental ICT infrastructures that support future 5G/6G services, and their resilience is a primary societal concern. Differently from single-carrier networks (in which one carrier owns multiple networks), in multi-carrier network ecosystems (in which the networks in the fields are operated by different carriers), cooperation among such different carriers is crucial to achieve resilience against large-scale failures. However, such cooperation is challenging since carriers may not disclose confidential information, e.g., detailed resource availability. In this study, we investigate how to perform carrier cooperative recovery in the case of large-scale failures/disasters. We propose two-stage carrier-carrier cooperative recovery planning by incorporating a coordinated scheduling for faster recovery. Through numerical evaluation, we confirm the potential benefit of carrier cooperation in terms of both recovery time and recovery cost reduction.  more » « less
Award ID(s):
2210384
PAR ID:
10531124
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Optical Society of America
Date Published:
Journal Name:
Journal of Optical Communications and Networking
Volume:
16
Issue:
5
ISSN:
1943-0620; JOCNBB
Format(s):
Medium: X Size: Article No. B45
Size(s):
Article No. B45
Sponsoring Org:
National Science Foundation
More Like this
  1. Large-scale network-cloud ecosystems are fundamental infrastructures to support future 5G/6G services, and their resilience is a primary societal concern for the years to come. Differently from a single-entity ecosystem (in which one entity owns the whole infrastructure), in multi-entity ecosystems (in which the networks and datacenters are owned by different entities) cooperation among such different entities is crucial to achieve resilience against large-scale failures. Such cooperation is challenging since diffident entities may not disclose confidential information, e.g., detailed resource availability. To enhance the resilience of multi-entity ecosystems, carriers are important as all the entities rely on carriers’ communication services. Thus, in this study we investigate how to perform carrier cooperative recovery in case of large-scale failures/disasters. We propose a two-stage cooperative recovery planning by incorporating a coordinated scheduling for swift recovery. Through preliminary numerical evaluation, we confirm the potential benefit of carrier cooperation in terms of both recovery time and recovery cost/burden reduction. 
    more » « less
  2. Cooperation among telecom carriers and datacenter (DC) providers (DCPs) is essential to ensure resiliency of network-cloud ecosystems. To enable efficient cooperative recovery in case of resource crunch, e.g., due to traffic congestion or network failures, we previously studied several frameworks for cooperative recovery among different stakeholders (e.g., telecom carriers and DCPs). Now, we introduce a novel Multi-entity Cooperation Platform (MCP) for implementing cooperative recovery planning, to achieve efficient use of carriers’ valuable optical-network resources during recovery. We adopt a Distributed Ledger Technology (DLT) that ensures decentralized and tamper-proof information exchange among stakeholders to achieve open and fair cooperation. To support diverse types of cooperation, we develop a state machine representing the MCP operation and define state transitions associated to stakeholders’ cooperation within the state machine. Moreover, we propose a signaling system in MCP to ensure simple and reliable state transitions for stakeholders during the cooperative recovery planning in large ecosystems. We experimentally demonstrate a proof-of-concept DLT-based MCP on a testbed. We showcase a DCP-carrier cooperative planning process, showing the flexibility of the proposed MCP to support diverse types of cooperation. 
    more » « less
  3. To accommodate the growing demand for cloud services, telecom carriers’ networks and datacenter (DC) facilities form large network–cloud ecosystems (ecosystems for short) physically supporting these services. These large-scale ecosystems are continuously evolving and must be highly resilient to support critical services. Open and disaggregated optical-networking technologies promise to enhance the interoperability across telecom carriers and DC operators, thanks to their open interfaces in both the data plane and control/management plane. In the first part of this paper, we focus on a single entity (e.g., a telecom carrier or an emerging telecom/DC partnership company) that owns both the network and DC infrastructures in the ecosystem. We introduce a solution by leveraging open and disaggregated technologies to enhance the resilience of the optical networks within a multi-vendor and multi-domain ecosystem. In the second part of this paper, we consider the case when the networks and DCs are owned by different entities. Also, in this case, cooperation among datacenter providers (DCPs) and carriers is crucial to provide failure/disaster resilience to today’s cloud services. However, such cooperation is more challenging since DCPs and carriers, being different entities, may not disclose confidential information, e.g., detailed resource availability. Hence, we introduce a solution to enhance the resilience of such multi-entity ecosystems through cooperation between DCPs and carriers without violating confidentiality. 
    more » « less
  4. Long-running scientific workflows, such as tomographic data analysis pipelines, are prone to a variety of failures, including hardware and network disruptions, as well as software errors. These failures can substantially degrade performance and increase turnaround times, particularly in large-scale, geographically distributed, and time-sensitive environments like synchrotron radiation facilities. In this work, we propose and evaluate resilience strategies aimed at mitigating the impact of failures in tomographic reconstruction workflows. Specifically, we introduce an asynchronous, non-blocking checkpointing mechanism and a dynamic load redistribution technique with lazy recovery, designed to enhance workflow reliability and minimize failure-induced overheads. These approaches facilitate progress preservation, balanced load distribution, and efficient recovery in error-prone environments. To evaluate their effectiveness, we implement a 3D tomographic reconstruction pipeline and deploy it across Argonne's leadership computing infrastructure and synchrotron facilities. Our results demonstrate that the proposed resilience techniques significantly reduce failure impact—by up to 500× —while maintaining negligible overhead (<3%). 
    more » « less
  5. Abstract The increased complexity of infrastructure systems has resulted in critical interdependencies between multiple networks—communication systems require electricity, while the normal functioning of the power grid relies on communication systems. These interdependencies have inspired an extensive literature on coupled multilayer networks, assuming a hard interdependence, where a component failure in one network causes failures in the other network, resulting in a cascade of failures across multiple systems. While empirical evidence of such hard failures is limited, the repair and recovery of a network requires resources typically supplied by other networks, resulting in documented interdependencies induced by the recovery process. In this work, we explore recovery coupling, capturing the dependence of the recovery of one system on the instantaneous functional state of another system. If the support networks are not functional, recovery will be slowed. Here we collected data on the recovery time of millions of power grid failures, finding evidence of universal nonlinear behavior in recovery following large perturbations. We develop a theoretical framework to address recovery coupling, predicting quantitative signatures different from the multilayer cascading failures. We then rely on controlled natural experiments to separate the role of recovery coupling from other effects like resource limitations, offering direct evidence of how recovery coupling affects a system’s functionality. 
    more » « less