skip to main content


This content will become publicly available on August 1, 2025

Title: Static and Dynamic Load-Triggered Cascading Failure Mitigation for Storage Area Networks

Storage area networks (SANs) are a widely used and dependable solution for data storage. Nevertheless, the occurrence of cascading failures caused by overloading has emerged as a significant risk to the reliability of SANs, impeding the delivery of the desired quality of service to users. This paper makes contributions by proposing both static and dynamic load-triggered redistribution strategies to alleviate the cascading failure risk during the mission time. Two types of node selection rules, respectively based on the load level and node reliability, are studied and compared. Based on the SAN component reliability evaluation using the accelerated failure-time model under the power law, the SAN reliability is evaluated using binary decision diagrams. A detailed case study of a mesh SAN is conducted to compare the performance of different cascading failure mitigation schemes using criteria of SAN reliability improvement ratio and resulting SAN reliability after the mitigation.

 
more » « less
Award ID(s):
2302094
NSF-PAR ID:
10520414
Author(s) / Creator(s):
; ;
Publisher / Repository:
Ram Arti Publisher
Date Published:
Journal Name:
International Journal of Mathematical, Engineering and Management Sciences
Volume:
9
Issue:
4
ISSN:
2455-7749
Page Range / eLocation ID:
697 to 713
Subject(s) / Keyword(s):
Cascading failure Dynamic scheme Load redistribution Mitigation Static scheme
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Security concerns have been raised about cascading failure risks in evolving power grids. This paper reveals, for the first time, that the risk of cascading failures can be increased at low network demand levels when considering security-constrained generation dispatch. This occurs because critical transmission cor- ridors become very highly loaded due to the presence of central- ized generation dispatch, e.g., large thermal plants far from de- mand centers. This increased cascading risk is revealed in this work by incorporating security-constrained generation dispatch into the risk assessment and mitigation of cascading failures. A se- curity-constrained AC optimal power flow, which considers eco- nomic functions and security constraints (e.g., network con- straints, 𝑵 − 𝟏 security, and generation margin), is used to pro- vide a representative day-ahead operational plan. Cascading fail- ures are simulated using two simulators, a quasi-steady state DC power flow model, and a dynamic model incorporating all fre- quency-related dynamics, to allow for result comparison and ver- ification. The risk assessment procedure is illustrated using syn- thetic networks of 200 and 2,000 buses. Further, a novel preventive mitigation measure is proposed to first identify critical lines, whose failures are likely to trigger cascading failures, and then to limit power flow through these critical lines during dispatch. Results show that shifting power equivalent to 1% of total demand from critical lines to other lines can reduce cascading risk by up to 80%. 
    more » « less
  2. This article formulates risk-based component importance measures (RCIMs) to identify critical components for a bridge subjected to earthquakes. RCIM, unlike traditional reliability-based importance measures, is well suited for complex systems like bridges as it uses a flexible system failure definition using bridge level risk consequences or performance objectives. Contrasting traditional notions of risk and reliability related to structural system failures, the RCIMs embrace a broader definition of risk which includes consequences of system failure to society and the environment. System failure is defined as exceedance of a user-defined threshold of system risk (e.g. based on cost, emissions, embodied energy, or other metrics). Our definition of system failure using global performance metrics helps relating component to system reliability explicitly and offers an alternative perspective to aggregate joint failure events of bridge components into relevant damage states. RCIM combines information about the reliability of a component and its contribution to the system performance and hence the importance measures developed offer risk mitigation indicators for decision makers as to which components may require upgrade to achieve a given performance objective. The proposed RCIMs generalize existing importance measures through analytical exploration of the entire space of system configurations with correlated component failures, while also considering multiple risk-based criteria beyond reliability. These RCIMs, demonstrated through a seismic bridge system case study, further show that as the bridge components age and the hazard intensity varies, the relative contribution of the components to system risk also shifts.

     
    more » « less
  3. L. Cromarty, R. Shirwaiker (Ed.)
    The growth of renewable energy technologies creates significant challenges for the stability of the system because of their intermittency. Nonetheless, we can value these technologies with storage systems. We model the supply by a renewable technology, wind, into a storage facility using the leaky bucket mechanism. The bucket is synonymous with storage while the leakage is equivalent to meeting load. Modelica is used to capture: (i) the time-dependence of the state of the bucket based on a physical model of storage; (ii) the stochastic representation of wind energy using wind speed data that is fed into a physical model of a wind technology; and (iii) the load, modeled as a resistor-inductor circuit. The strength of Modelica in using non-causal equations for basic sub-systems that are linked together is harnessed through its libraries. We find that there is a diminishing return to storage. Beyond a certain level of storage, the integration of a reliable baseload power supply is required to diminish the risk due to reduced reliability. The need for storage systems as a hedge against intermittency is dependent on the interplay between the supply volatilities and the stochastic load to guarantee an acceptable level of quality of service and reliability. 
    more » « less
  4. Abstract

    It is essential to study the robustness and centrality of interdependent networks for building reliable interdependent systems. Here, we consider a nonlinear load-capacity cascading failure model on interdependent networks, where the initial load distribution is not random, as usually assumed, but determined by the influence of each node in the interdependent network. The node influence is measured by an automated entropy-weighted multi-attribute algorithm that takes into account both different centrality measures of nodes and the interdependence of node pairs, then averaging for not only the node itself but also its nearest neighbors and next-nearest neighbors. The resilience of interdependent networks under such a more practical and accurate setting is thoroughly investigated for various network parameters, as well as how nodes from different layers are coupled and the corresponding coupling strength. The results thereby can help better monitoring interdependent systems.

     
    more » « less
  5. Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. Compared to the recovery phase of the data repair, which is widely studied and well optimized, the failure identification phase of the data repair is less investigated. Moreover, in a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high risk stripe (a stripe with many failed chunks), a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low risk stripe (one with only a few failed chunks), a longer identification time is adopted, thus reducing the repair network traffic. Therefore, the RAS can be improved simultaneously. We use both simulations and prototyping implementation to evaluate RAFI. Results collected from extensive simulations demonstrate the effectiveness and efficiency of RAFI on improving the RAS. We implement a prototype on HDFS to verify the correctness and evaluate the computational cost of RAFI. 
    more » « less