skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: The Impact of Terrestrial Radiation on FPGAs in Data Centers
Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average.  more » « less
Award ID(s):
1738550
PAR ID:
10316156
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ACM Transactions on Reconfigurable Technology and Systems
Volume:
15
Issue:
2
ISSN:
1936-7406
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. FPGAs are being used in large numbers within cloud computing to provide high-performance, low-power alternatives to more traditional computing structures. While FPGAs provide a number of important benefits to cloud computing environments, they are susceptible to radiation-induced soft errors, which can lead to silent data corruption or system instability. Although soft errors within a single FPGA occur infrequently, soft errors in large-scale FPGAs systems can occur at a relatively high rate. This paper investigates the failure rate of several FPGA applications running within an FPGA cloud computing node by performing fault injection experiments to determine the susceptibility of these applications to soft-errors. The results from these experiments suggest that silent data corruption will occur every few hours within a 100,000 node FPGA system and that such a system can only maintain high-levels of reliability for short periods of operation. These results suggest that soft-error detection and mitigation techniques may be needed in large-scale FPGA systems. 
    more » « less
  2. This paper presents the TURTLE fault injection platform for inserting faults into SRAM FPGAs. The TURTLE system is designed to gather significant fault injection data to test and validate radiation-induced single-event upset (SEU) mitigation techniques for FPGAs. The TURTLE is a low-cost fault injection platform that emulates upsets within the configuration memory (CRAM) of an FPGA through partial reconfiguration. This work successfully implemented the proposed architecture and performed several successful fault injection campaigns on multiple designs and SEU mitigation techniques. Results in this paper show large amounts of data collected from a fault injection campaign used to validate the PCMF SEU mitigation technique. Over 170 million injections were performed using the TURTLE for this campaign. 
    more » « less
  3. TMR combined with configuration scrubbing is an effective technique to mitigate against radiation-induced CRAM upsets on SRAM-based FPGAs. However, its effectiveness is limited by low-level common mode failures due to the physical mapping of a design to the FPGA device. This paper describes how common mode failures are introduced during the implementation process and introduces an approach for resolving them through a custom incremental placement tool for Xilinx 7-Series FPGAs. Multiple designs across multiple generations of devices are shown to be sensitive to common mode failures. Applying the incremental placement technique yields an improvement of 10,721x over an unmitigated design through fault-injection testing. Radiation testing is then performed to show that the of this technique is 91,500 days in GEO orbit, a 367x improvement over the unmitigated design and a 5x improvement over baseline TMR. 
    more » « less
  4. FPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The effects of CRAM upsets can be mitigated using triple-modular redundancy and configuration scrubbing. This work investigates the reliability of a soft RISC-V SoC system executing the Linux operating system mitigated by TMR and configuration scrubbing. In particular, this paper analyzes the failures of this triplicated system observed at a high-energy neutron radiation experiment. Using a bitstream fault analysis tool, the failures of this system caused by CRAM upsets are traced back to the affected FPGA resource and design logic. This fault analysis identifies the interconnect and I/O as the most vulnerable FPGA resources and the DDR controller logic as the design logic most likely to cause a failure. By identifying the FPGA resources and design logic causing failures in this TMR system, additional design enhancements are proposed to create a more reliable design for harsh radiation environments. 
    more » « less
  5. In recent years, Field Programmable Gate Arrays (FPGAs) have gained prominence in cloud computing data centers, driven by their capacity to offload compute-intensive tasks and contribute to the ongoing trend of data center disaggregation, as well as their ability to be directly connected to the network. While FPGAs offer numerous advantages, they also pose challenges in terms of configuration, programmability, and monitoring, particularly in the absence of an operating system with essential features like the TCP/IP networking stack. This paper introduces an In-band Network Telemetry (INT) approach based on the P4 language for FPGA data plane programming. The goal is to facilitate monitoring and network performance analysis by providing one-way packet delay information. The approach is demonstrated in the Open Cloud Testbed (OCT) and FABRIC testbeds, both offering open access to the research community with greater FPGA availability than commercial clouds. The workflow enables researchers to create custom P4 programs and bitstreams for installation on FPGAs. The paper presents a multi-step approach allowing experimentation within the New England Research Cloud (NERC), testing in OCT, and final deployment in FABRIC, well-suited for one-way delay measurements due to synchronized clocks via GPS time signals. Contributions include the provision of a P4 workflow for FPGAs in a research cloud, a novel FPGA clock-based INT approach, and a comprehensive evaluation through simulation and experiments in the Open Cloud and FABRIC testbeds. 
    more » « less