Radiation Induced Multi bit Upsets in Xilinx SRAM Based FPGAs

Similar documents
Radiation-Induced Multi-Bit Upsets in Xilinx SRAM-Based FPGAs

LA-UR- Title: Author(s): Submitted to: Approved for public release; distribution is unlimited.

Reliability Techniques for Data Communication and Storage in FPGA-Based Circuits

SEU-Induced Persistent Error Propagation in FPGAs

On-Orbit FPGA SEU Mitigation and Measurement Experiments on the Cibola Flight Experiment Satellite

Soft error rate estimations of the Kintex-7 FPGA within the ATLAS Liquid Argon (LAr) Calorimeter

Radiation Effects on Electronics. Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University

A scalable analytic model for single event upsets in radiation-hardened field. programmable gate arrays in the PHENIX interaction region

Predicting On-Orbit SEU Rates

Scheduling Considerations for Voter Checking in FPGA-based TMR Systems

EECS150 - Digital Design Lecture 26 Faults and Error Correction. Recap

APA750 and A54SX32A LANSCE Neutron Test Report. White Paper

RT54SX72SU Heavy Ion Beam Test Report. August 6, 2004 J.J. Wang (650)

Kintex-7 Irradiation, test bench and results

Elliptic Curve Group Core Specification. Author: Homer Hsing

The Rosetta Experiment: Atmospheric Soft Error Rate Testing in Differing Technology FPGAs

Reliable Computing I

Fault Tolerance. Dealing with Faults

TABLE I EXPECTED PARTICLE FLUX. Situation Flux (cm 2 s 1 ) average worst week worst day worst 5 minutes 1.

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

Combinational Techniques for Reliability Modeling

Markov Models for Reliability Modeling

Tate Bilinear Pairing Core Specification. Author: Homer Hsing

Multiple Bit Upsets and Error Mitigation in Ultra Deep Submicron SRAMs

Application of the RADSAFE Concept

NPSAT1 Solar Cell Measurement System

Exploiting Partial Dynamic Reconfiguration for On-Line On-Demand Detection of Permanent Faults in SRAM-based FPGAs

A method of space radiation environment reliability prediction

Vidyalankar S.E. Sem. III [CMPN] Digital Logic Design and Analysis Prelim Question Paper Solution

Fault Masking By Probabilistic Voting

Ch 9. Sequential Logic Technologies. IX - Sequential Logic Technology Contemporary Logic Design 1

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Protection and characterization of an open source soft core against radiation effects

Using Global Clock Networks

Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware

FPGA-Based Radiation Tolerant Computing. Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University

DIAGNOSIS OF FAULT IN TESTABLE REVERSIBLE SEQUENTIAL CIRCUITS USING MULTIPLEXER CONSERVATIVE QUANTUM DOT CELLULAR AUTOMATA

CS61C : Machine Structures

CS61C : Machine Structures

Sequential Logic Circuits

R&T PROTON DIRECT IONIZATION

EECS 579: Logic and Fault Simulation. Simulation

Side Channel Analysis and Protection for McEliece Implementations

Problem Set 9 Solutions

Single Event Effects: SRAM

L10 State Machine Design Topics

Novel Devices and Circuits for Computing

Design Methodology and Tools for NEC Electronics Structured ASIC ISSP

Chapter 4. Sequential Logic Circuits

ALU A functional unit

Trust-Based Design and Check of FPGA Circuits Using Two-Level Randomized ECC Structures

Sequential Circuit Analysis

Memory, Latches, & Registers

SINGLE Event Upset (SEU) is caused by radiation induced

Lecture #4: Potpourri

SDR Forum Technical Conference 2007

Introduction. Simulation methodology. Simulation results

Impact of Ion Energy and Species on Single Event Effect Analysis

Q: Examine the relationship between X and the Next state. How would you describe this circuit? A: An inverter which is synched with a clock signal.

15.1 Elimination of Redundant States

Low Power CMOS Design. Master thesis. Amir Hasanbegovic. Exploring Radiation Tolerance in a 90 nm Low Power Commercial Process

VORAGO TECHNOLOGIES. Cost-effective rad-hard MCU Solution for SmallSats Ross Bannatyne, VORAGO Technologies

arxiv: v1 [cs.pf] 12 Jan 2017

Theoretical Modeling of the Itoh-Tsujii Inversion Algorithm for Enhanced Performance on k-lut based FPGAs

System Reliability Analysis. CS6323 Networks and Systems

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Example: sending one bit of information across noisy channel. Effects of the noise: flip the bit with probability p.

Electron-induced Single-Event Upsets in integrated memory device

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

Evaluating the Reliability of Defect-Tolerant Architectures for Nanotechnology with Probabilistic Model Checking

Radiation Effects in Nano Inverter Gate

DPA-Resistance without routing constraints?

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Modeling and Optimization for Soft-Error Reliability of Sequential Circuits

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Read this before starting!

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials

PC100 Memory Driver Competitive Comparisons

4 An Introduction to Channel Coding and Decoding over BSC

Advanced Testing. EE5375 ADD II Prof. MacDonald

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

Programmable Logic Devices

Hold Time Illustrations

IMPACT OF COSMIC RADIATION ON AVIATION RELIABILITY AND SAFETY

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Midterm Examination CLOSED BOOK

Lecture 13: Sequential Circuits, FSM

The Growing Impact of Atmospheric Radiation. Effects on Semiconductor Devices and the. Associated Impact on Avionics Suppliers

Read this before starting!

Soft Error Rate Analysis in Combinational Logic

University of Minnesota Department of Electrical and Computer Engineering

Vidyalankar S.E. Sem. III [EXTC] Digital Electronics Prelim Question Paper Solution ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD = B

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 11, NOVEMBER

Total Ionizing Dose Test Report. No. 11T-RT3PE3000L-CG896-QHR8G

Chapter 15 SEQUENTIAL CIRCUITS ANALYSIS, STATE- MINIMIZATION, ASSIGNMENT AND CIRCUIT IMPLEMENTATION

Mitigating Multi-Bit-Upset With Well-Slits in 28 nm Multi-Bit-Latch

Transcription:

LA-UR-05-6725 Radiation Induced Multi bit Upsets in Xilinx SRAM Based FPGAs Heather Quinn, Paul Graham, Jim Krone, and Michael Caffrey Los Alamos National Laboratory Sana Rezgui and Carl Carmichael Xilinx Corporation Quinn 1

Overview Background Radiation induced single bit (SBU) and multi bit upsets (MBU) Domain crossing events MBU Data Proton cross sections: Virtex, Virtex II, Virtex 4 Heavy ion cross sections: Virtex, Virtex II, Virtex II Pro TMR Studies SBU and MBU simulations SEU simulation of SBU and MBUs Quinn 2

Background For SRAM based reconfigurable devices, the state and configuration data can be subjected to radiation induced changes Most upsets are single bit upsets Recently, in testing, multi bit upsets have been observed Multi bit upsets are particularly challenging for spacebased applications Common mitigation methods, such as triple modular redundancy, can fail in the presence of multi bit upsets Quinn 3

Domain Crossing Events DCE: two or more modules from a TMR'd circuit are affected by one radiation event Routing networks are particularly vulnerable to DCE Single bit upsets might cause two domains to fail by shorting the routing network Multi bit upsets might be even more likely to cause two domains to fail Quinn 4

Research Goals Determine the event rate: Radiation environment: multiple orbits Determine how frequently a single ionized particle causes multiple configuration bits to be upset Feature size: multiple families of FPGAs Radiation type: proton and heavy ion Determine how frequently DCEs occur Designs: image processing, hand TMR'd (Not BYU or Xilinx tool) Mitigation: multiple levels of TMR SBU vs. MBU: different size events Quinn 5

Progress: More Questions than Answers Determine the event rate: no progress Determine how frequently a single ionized particle causes multiple configuration bits to be upset Feature size: multiple families of FPGAs Radiation type: proton and heavy ion Determine how frequently DCEs occur Early simulation results Designs: non triplicated clocks or I/O Isolated I/O data from results Are there clock hits in CLB results? Are multiple domains getting packed into one slice? Probabilistic Model: given what we know about MBUs and our simulation results can we predict MBU DCEs? Quinn 6

Test Methodology Cross Sections: Sample configuration upsets for all architectures Perform upset clustering and resource classification to identify likely MBU targets and events Determine 1 bit and multi bit cross sections TMR Studies: For identified functions, simulate SBU and MBU events on TMR design suite Predict effect of SBUs and MBUs on TMR based on frequency of SBUs, MBUs and DCEs Quinn 7

Proton Analysis Highlights Monoenergic testing at 63.3 MeV MBU Events: Virtex: 0.04% of all events Virtex II: 1.08% of all events Virtex II Pro: 1.32% of all events Virtex 4: 3.06% of all events MBU Orientation: Virtex: row adjacent Virtex II, Virtex II Pro, Virtex 4: column adjacent No major column adjacencies Affected resources: MBUs: CLBs Quinn 8

Proton Bit Cross Sections Device XCV10 XC2V250 XC2V10 XC2VP40 XC4VLX25 1-Bit Bit-Cross- Section (cm 2 /bit) MBU Bit-Cross- Section (cm 2 /bit) SEFI Cross-Section (cm 2 /bit) 6.29x10-14 2.77x10-17 ~4.19x10-19 (config SEFI) 3.06x10-14 3.34x10-16 8.21x10-19 2.30x10-14 2.50x10-16 8.21x10-19 3.68x10-14 4.94x10-16 Unknown 1.22x10-14 3.85x10-16 Unknown Quinn 9

Heavy Ion Highlights At highest tested linear energy transfer: Virtex: saturated at 7.5% MBUs out of all events Virtex II: increasing at 35% MBUs out of all events Virtex II Pro: increasing at 34% MBUs out of all events MBU Orientation: All devices invert from proton results Very rarely does it affect to adjacent columns of resources (i.e., two adjacent CLB columns, adjacent CLB and BRAM columns, etc.) Affected resources: SBU: CLBs, BRAM MBU: CLBs, IOBs, BRAMi Quinn 10

Percent MBU Events Out of All Events Induced by Heavy Ion Radiation Quinn 11

Heavy Ion Radiation Bit Cross Sections Quinn 12

TMR Test Methodology Three TMR designs: edge detection, max filter, min filter Each design has 2 3 implementations that vary module granularity LANL/BYU SEU simulation extended to simulate column MBUs: Observe, characterize, and determine the DCE rate Test both 1 bit and 2 bit column events August Crocker test to observe domain crossing events in the wild Data unanalyzed Quinn 13

TMR Hypotheses: Bitwise Voting No way to recover from concurrent errors in two 1 bit modules feeding the same voter When two modules change values, then the correct value in the third module looks like the incorrect result to the voter If two modules feeding separate voters are affected, TMR should still work Operating Correctly Masking Vote Non-Masking Vote M1 M2 M3 M1 M2 M3 M1 M2 M3 01 10 10 10 Voter Voter Voter Quinn 14 10

TMR Hypotheses: Domain Crossing SBUs Most proton induced upsets are SBUs Even rare SBUs DCEs could be more common than MBU DCEs Routing network possibly vulnerable to domain crossing SBUs [Bernardi 24]: 13% of a design's used bits could cause DCEs Have SBU results but not fully analyzed Operating Correctly M1 A1 B1 M2 A2 01 B2 M3 A2 B2 Open Circuit M1 A1 B1 M2 A2 01 B2 M3 A2 B2 Bridging Short M1 A1 B1 M2 A2 01 B2 M3 A2 B2 01 Voter 01 Voter Floating 01 Voter? 01????? Quinn 15

DCE Cross Section MBU Faults XS SBU Faults CLK IO? Since we can't fully explain all of the SBU faults today, we're focusing on MBU faults only SBU DCE? More CLK? Paired FFs? Quinn 16

TMR Results: MBU Sensitive Bits Device: XC2V250 (Virtex-II) with 1.6 Million Bits Design Unique MBU Bits Max Filter (no TMR) 692 Max Filter (TMR Type 1) 122 Max Filter (TMR Type 2) 133 Max Filter (TMR Type 3) 117 Min Filter (no TMR) 626 Min Filter (TMR Type 1) 134 Min Filter (TMR Type 2) 135 Min Filter (TMR Type 3) 139 Sobel Edge (no TMR) 468 Sobel Edge (TMR Type 2) 151 Sobel Edge (TMR Type 3) 121 Even without triplicated clocks, TMR improves the cross-section 4-6 times Quinn 17

TMR Results: MBU Resource Allocation Design used no BRAM I/O failures have been removed Design CLBs BRAMi Max Filter (TMR Type 1) 88.43% 7.44% Max Filter (TMR Type 2) 85.59% 9.32% Max Filter (TMR Type 3) 82.24% 11.21% Min Filter (TMR Type 1) 90.91% 6.82% Min Filter (TMR Type 2) 78.42% 14.39% Min Filter (TMR Type 3) 83.22% 7.69% Edge Filter (TMR Type 2) 82.12% 9.27% Edge Filter (TMR Type 3) 80.99% 2.48% Quinn 18

Given an MBU what is the probability of a DCE? SBUs MBUs BRAM Voted CLB BRAM BRAMi IOB DCE Out CLB SBU and BRAMi MBU IOB Event Need to determine what percentage of the event space are DCEs Space Voted Out Quinn 19

Probability of an MBU DCE in CLBs and BRAMi: Proton Induced Radiation Events Design BRAMi CLBs Max Filter (no TMR) 0.10% 0.0964% Max Filter (TMR Type 1) 0.07% 0.0159% Max Filter (TMR Type 2) 0.09% 0.0150% Max Filter (TMR Type 3) 0.09% 0.0131% Min Filter (no TMR) 0.23% 0.0872% Min Filter (TMR Type 1) 0.07% 0.0178% Min Filter (TMR Type 2) 0.15% 0.0162% Min Filter (TMR Type 3) 0.09% 0.0176% Edge Filter (no TMR) 0.06% 0.0664% Edge Filter (TMR Type 2) 0.11% 0.0184% Edge Filter (TMR Type 3) 0.02% 0.0145% Quinn 20

Probability of a DCE in CLBs and BRAMi: Heavy Ion Induced Radiation Events Quinn 21

Lessons Learned Perfect TMR is hard to get Floorplanning is very important Pairs of flip flops from multiple domains could end up in one slice Half latch removal could cause single points of failure Would XTMR help? Probably, but you'll need to triplicate I/O and control clock skew Reasonable TMR helps some 4 6X improvement in the MBU cross section The most common V II MBU might not be bad Column MBUs very low (~150 DCEs) Row MBUs not tested Quinn 22

Conclusions and Future Work Space based applications need to be aware of possible domain crossing events TMR might not be perfect MBUs a factor; SBUs constant background noise of faults TMR aware floorplanning and routing tools [RoRa] need to be explored for high reliability systems Future work: Eliminate clocking issues from results Designs: triplicated clocks Analysis: identify routing network DCEs Radiation environments Reliability of reconfigurable architectures Optimized scrub rates Quinn 23