skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improving the Effectiveness of TMR Designs on FPGAs with SEU-Aware Incremental Placement
TMR combined with configuration scrubbing is an effective technique to mitigate against radiation-induced CRAM upsets on SRAM-based FPGAs. However, its effectiveness is limited by low-level common mode failures due to the physical mapping of a design to the FPGA device. This paper describes how common mode failures are introduced during the implementation process and introduces an approach for resolving them through a custom incremental placement tool for Xilinx 7-Series FPGAs. Multiple designs across multiple generations of devices are shown to be sensitive to common mode failures. Applying the incremental placement technique yields an improvement of 10,721x over an unmitigated design through fault-injection testing. Radiation testing is then performed to show that the of this technique is 91,500 days in GEO orbit, a 367x improvement over the unmitigated design and a 5x improvement over baseline TMR.  more » « less
Award ID(s):
1738550
PAR ID:
10110501
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Page Range / eLocation ID:
141 to 148
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques, such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14\(\times\)in mean time to failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This article applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping.” The application of these techniques improved MTTF of the mitigated design by a factor of 1.54\(\times\)and thus provides a 22.8X\(\times\)MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities. 
    more » « less
  2. FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments. 
    more » « less
  3. Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average. 
    more » « less
  4. Triple modular redundancy (TMR) is commonly employed to increase the reliability and mean time to failure (MTTF) of a system. This improvement can be shown by using a continuous time Markov chain. However, typical Markov chain models do not model common cause failures (CCF), which is a singular event that simultaneously causes failure in multiple redundant modules. This paper introduces a new Markov chain to model CCF in TMR with repair systems. This new model is compared to the idealized models of TMR with repair without CCF. The fundamental limitations that CCF imposes on the system are shown and discussed. In a motivating example, it is seen that CCF imposes a limitation of 51× on the reliability improvement in a system with TMR and repair compared to a simplex system, (i.e., without TMR). A case study is also presented where the likelihood of CCF is reduced by a factor of 18× using various mitigation techniques. Reducing the CCF compounds the reliability improvement of TMR with repair and leads to a overall system reliability improvement of 10,000× compared to the simplex system as supported by the proposed model. 
    more » « less
  5. This paper presents the TURTLE fault injection platform for inserting faults into SRAM FPGAs. The TURTLE system is designed to gather significant fault injection data to test and validate radiation-induced single-event upset (SEU) mitigation techniques for FPGAs. The TURTLE is a low-cost fault injection platform that emulates upsets within the configuration memory (CRAM) of an FPGA through partial reconfiguration. This work successfully implemented the proposed architecture and performed several successful fault injection campaigns on multiple designs and SEU mitigation techniques. Results in this paper show large amounts of data collected from a fault injection campaign used to validate the PCMF SEU mitigation technique. Over 170 million injections were performed using the TURTLE for this campaign. 
    more » « less