SRAM-based FPGAs are frequently used for critical functions in space applications. Soft processors implemented within these FPGAs are often needed to satisfy the mission requirements. The open ISA, RISC-V, has allowed for the development of a wide range of open source processors. Like all SRAM-based FPGA digital designs, these soft processors are susceptible to SEUs. This paper presents an investigation of the performances and relative SEU sensitivity of a selection of newly available open source RISC-V processors. Utilizing dynamic partial reconfiguration, this novel automatic test equipment rapidly deployed different implementations and evaluated SEU sensitivity through fault injection. Using BYU’s new SpyDrNet tools, fine-grain TMR was also applied to each processor with results ranging from a 20× to 500× reduction in sensitivity.
more »
« less
TURTLE: A Low-Cost Fault Injection Platform for SRAM-based FPGAs
This paper presents the TURTLE fault injection platform for inserting faults into SRAM FPGAs. The TURTLE system is designed to gather significant fault injection data to test and validate radiation-induced single-event upset (SEU) mitigation techniques for FPGAs. The TURTLE is a low-cost fault injection platform that emulates upsets within the configuration memory (CRAM) of an FPGA through partial reconfiguration. This work successfully implemented the proposed architecture and performed several successful fault injection campaigns on multiple designs and SEU mitigation techniques. Results in this paper show large amounts of data collected from a fault injection campaign used to validate the PCMF SEU mitigation technique. Over 170 million injections were performed using the TURTLE for this campaign.
more »
« less
- Award ID(s):
- 1738550
- PAR ID:
- 10184039
- Date Published:
- Journal Name:
- 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
- Page Range / eLocation ID:
- 1 to 8
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques, such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14\(\times\)in mean time to failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This article applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping.” The application of these techniques improved MTTF of the mitigated design by a factor of 1.54\(\times\)and thus provides a 22.8X\(\times\)MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.more » « less
-
FPGAs are being used in large numbers within cloud computing to provide high-performance, low-power alternatives to more traditional computing structures. While FPGAs provide a number of important benefits to cloud computing environments, they are susceptible to radiation-induced soft errors, which can lead to silent data corruption or system instability. Although soft errors within a single FPGA occur infrequently, soft errors in large-scale FPGAs systems can occur at a relatively high rate. This paper investigates the failure rate of several FPGA applications running within an FPGA cloud computing node by performing fault injection experiments to determine the susceptibility of these applications to soft-errors. The results from these experiments suggest that silent data corruption will occur every few hours within a 100,000 node FPGA system and that such a system can only maintain high-levels of reliability for short periods of operation. These results suggest that soft-error detection and mitigation techniques may be needed in large-scale FPGA systems.more » « less
-
In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/over provisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.more » « less
-
Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average.more » « less
An official website of the United States government

