skip to main content


Title: Automating Failure Testing Research at Internet Scale
Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.  more » « less
Award ID(s):
1652368
NSF-PAR ID:
10053503
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the Seventh ACM Symposium on Cloud Computing
Page Range / eLocation ID:
17 to 28
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Continuous Integration (CI) practices encourage developers to frequently integrate code into a shared repository. Each integration is validated by automatic build and testing such that errors are revealed as early as possible. When CI failures or integration errors are reported, existing techniques are insufficient to automatically locate the root causes for two reasons. First, a CI failure may be triggered by faults in source code and/or build scripts, while current approaches consider only source code. Second, a tentative integration can fail because of build failures and/or test failures, while existing tools focus on test failures only. This paper presents UniLoc, the first unified technique to localize faults in both source code and build scripts given a CI failure log, without assuming the failure’s location (source code or build scripts) and nature (a test failure or not). Adopting the information retrieval (IR) strategy, UniLoc locates buggy files by treating source code and build scripts as documents to search and by considering build logs as search queries. However, instead of naïvely applying an off-the-shelf IR technique to these software artifacts, for more accurate fault localization, UniLoc applies various domain-specific heuristics to optimize the search queries, search space, and ranking formulas. To evaluate UniLoc, we gathered 700 CI failure fixes in 72 open-source projects that are built with Gradle. UniLoc could effectively locate bugs with the average MRR (Mean Reciprocal Rank) value as 0.49, MAP (Mean Average Precision) value as 0.36, and NDCG (Normalized Discounted Cumulative Gain) value as 0.54. UniLoc outperformed the state-of-the-art IR-based tool BLUiR and Locus. UniLoc has the potential to help developers diagnose root causes for CI failures more accurately and efficiently. 
    more » « less
  2. The widespread application of phasor measurement units has improved grid operational reliability. However, this has increased the risk of cyber threats such as false data injection attack that mislead time-critical measurements, which may lead to incorrect operator actions. While a single incorrect operator action might not result in a cascading failure, a series of actions impacting critical lines and transformers, combined with pre-existing faults or scheduled maintenance, might lead to widespread outages. To prevent cascading failures, controlled islanding strategies are traditionally implemented. However, islanding is effective only when the received data are trustworthy. This paper investigates two multi-objective controlled islanding strategies to accommodate data uncertainties under scenarios of lack of or partial knowledge of false data injection attacks. When attack information is not available, the optimization problem maximizes island observability using a minimum number of phasor measurement units for a more accurate state estimation. When partial attack information is available, vulnerable phasor measurement units are isolated to a smaller island to minimize the impacts of attacks. Additional objectives ensure steady-state and transient-state stability of the islands. Simulations are performed on 200-bus, 500-bus, and 2000-bus systems. 
    more » « less
  3. As multi-tenant FPGA applications continue to scale in size and complexity, their need for resilience against environmental effects and malicious actions continues to grow. To ensure continuously correct computation, faults in the compute fabric must be identified, isolated, and suppressed in the nanosecond to microsecond range. In this paper, we detail a circuit and system-level methodology to detect compute failure conditions due to on-FPGA voltage attacks. Our approach rapidly suppresses incorrect results and regenerates potentially-tainted results before they propagate, allowing time for an attacker to be suppressed. Instrumentation includes voltage sensors to detect error conditions induced by attackers. This analysis is paired with focused remediation approaches involving data buffering, fault suppression, results recalculation, and computation restart. Our approach has been demonstrated using an RSA encryption circuit implemented on a Stratix 10 FPGA. We show that a voltage attack using on-FPGA power wasters can be effectively detected and computation halted in 15 ns, preventing the injection of timing faults. Potentially tainted results are successfully regenerated, allowing for fault-free circuit operation. A full characterization of the latency and resource overheads of fault detection and recovery is provided. 
    more » « less
  4. We consider a parallel computational model, the Parallel Persistent Memory model, comprised of P processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault at any time (with bounded probability), and possibly restart. When a processor faults, all of its state and local ephemeral memory is lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are nearly as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. We present several results for the model, using an approach that breaks a computation into capsules, each of which can be safely run multiple times. For the single-processor version we describe how to simulate any program in the RAM, the external memory model, or the ideal cache model with an expected constant factor overhead. For the multiprocessor version we describe how to efficiently implement a work-stealing scheduler within the model such that it handles both soft faults, with a processor restarting, and hard faults, with a processor permanently failing. For any multithreaded fork-join computation that is race free, write-after-read conflict free and has W work, D depth, and C maximum capsule work in the absence of faults, the scheduler guarantees a time bound on the model of O(W/P_A+ (DP/P_A ) log_{1/(Cf )} W) in expectation, where P is the maximum number of processors, P_A is the average number, and f ≤ 1/(2C) is the probability a processor faults between successive persistent memory accesses. Within the model, and using the proposed methods, we develop efficient algorithms for parallel prefix sums, merging, sorting, and matrix multiply. 
    more » « less
  5. Electrical and computer engineering technologies have evolved into dynamic, complex systems that profoundly change the world we live in. Designing these systems requires not only technical knowledge and skills but also new ways of thinking and the development of social, professional and ethical responsibility. A large electrical and computer engineering department at a Midwestern public university is transforming to a more agile, less traditional organization to better respond to student, industry and society needs. This is being done through new structures for faculty collaboration and facilitated through departmental change processes. Ironically, an impetus behind this effort was a failed attempt at department-wide curricular reform. This failure led to the recognition of the need for more systemic change, and a project emerged from over two years of efforts. The project uses a cross-functional, collaborative instructional model for course design and professional formation, called X-teams. X-teams are reshaping the core technical ECE curricula in the sophomore and junior years through pedagogical approaches that (a) promote design thinking, systems thinking, professional skills such as leadership, and inclusion; (b) contextualize course concepts; and (c) stimulate creative, socio-technical-minded development of ECE technologies. An X-team is comprised of ECE faculty members including the primary instructor, an engineering education and/or design faculty member, an industry practitioner, context experts, instructional specialists (as needed to support the process of teaching, including effective inquiry and inclusive teaching) and student teaching assistants. X-teams use an iterative design thinking process and reflection to explore pedagogical strategies. X-teams are also serving as change agents for the rest of the department through communities of practice referred to as Y-circles. Y-circles, comprised of X-team members, faculty, staff, and students, engage in a process of discovery and inquiry to bridge the engineering education research-to-practice gap. Research studies are being conducted to answer questions to understand (1) how educators involved in X-teams use design thinking to create new pedagogical solutions; (2) how the middle years affect student professional ECE identity development as design thinkers; (3) how ECE students overcome barriers, make choices, and persist along their educational and career paths; and (4) the effects of department structures, policies, and procedures on faculty attitudes, motivation and actions. This paper will present the efforts that led up to the project, including failures and opportunities. It will summarize the project, describe related work, and present early progress implementing new approaches. 
    more » « less