EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

Similar documents
EECS150 - Digital Design Lecture 26 Faults and Error Correction. Recap

EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs)

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

EECS150 - Digital Design Lecture 21 - Design Blocks

Linear Feedback Shift Registers (LFSRs) 4-bit LFSR

EECS Components and Design Techniques for Digital Systems. Lec 26 CRCs, LFSRs (and a little power)

EECS150 - Digital Design Lecture 27 - misc2

Lecture 16: Circuit Pitfalls

Chapter 2 Fault Modeling

DRAM Reliability: Parity, ECC, Chipkill, Scrubbing. Alpha Particle or Cosmic Ray. electron-hole pairs. silicon. DRAM Memory System: Lecture 13

Radiation Effects on Electronics. Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University

Review. EECS Components and Design Techniques for Digital Systems. Lec 26 CRCs, LFSRs (and a little power)

Lecture 5 Fault Modeling

ECE 512 Digital System Testing and Design for Testability. Model Solutions for Assignment #3

EECS150 - Digital Design Lecture 23 - FSMs & Counters

Information redundancy

ECE-470 Digital Design II Memory Test. Memory Cells Per Chip. Failure Mechanisms. Motivation. Test Time in Seconds (Memory Size: n Bits) Fault Types

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

A Low-Cost Methodology for Soft Error Free Logic Circuits

Lecture 16: Circuit Pitfalls

Chapter 7. Sequential Circuits Registers, Counters, RAM

Digital Circuits. 1. Inputs & Outputs are quantized at two levels. 2. Binary arithmetic, only digits are 0 & 1. Position indicates power of 2.

Logic BIST. Sungho Kang Yonsei University

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

RT54SX72SU Heavy Ion Beam Test Report. August 6, 2004 J.J. Wang (650)

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

PAST EXAM PAPER & MEMO N3 ABOUT THE QUESTION PAPERS:

Semiconductor Memories

APA750 and A54SX32A LANSCE Neutron Test Report. White Paper

Technology Mapping for Reliability Enhancement in Logic Synthesis

Parity Checker Example. EECS150 - Digital Design Lecture 9 - Finite State Machines 1. Formal Design Process. Formal Design Process

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Memory, Latches, & Registers

VLSI Design I. Defect Mechanisms and Fault Models

EECS150 - Digital Design Lecture 18 - Counters

EECS150 - Digital Design Lecture 18 - Counters

ECE321 Electronics I

MOSIS REPORT. Spring MOSIS Report 1. MOSIS Report 2. MOSIS Report 3

ELCT201: DIGITAL LOGIC DESIGN

Terminology and Concepts

Reliability Techniques for Data Communication and Storage in FPGA-Based Circuits

! Memory. " RAM Memory. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

EECS 579: Logic and Fault Simulation. Simulation

Tradeoff between Reliability and Power Management

Advanced Testing. EE5375 ADD II Prof. MacDonald

Vulnerabilities in Analog and Digital Electronics. Microelectronics and Computer Group University of Maryland & Boise State University

Semiconductor memories

Lecture 7: Logic design. Combinational logic circuits

Lecture 9: Digital Electronics

CMPEN 411. Spring Lecture 18: Static Sequential Circuits

EE650R: Reliability Physics of Nanoelectronic Devices Lecture 18: A Broad Introduction to Dielectric Breakdown Date:

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECE 4450:427/527 - Computer Networks Spring 2017

Radiation Effect Modeling

MODEL MECHANISM OF CMOS DEVICE FOR RELIBILITY ENHANCEMENT

ECE 3060 VLSI and Advanced Digital Design. Testing

Gate-Level Mitigation Techniques for Neutron-Induced Soft Error Rate

Fault Tolerance. Dealing with Faults

Application Note to inform ANVO Customers about the Soft Error Rate Phenomenon which could potentially occur in our nvsrams.

EECS150 - Digital Design Lecture 17 - Sequential Circuits 3 (Counters)

The Linear-Feedback Shift Register

Single Event Effects: SRAM

And device degradation. Slide 1

Sequential Logic. Rab Nawaz Khan Jadoon DCS. Lecturer COMSATS Lahore Pakistan. Department of Computer Science

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Final Examination

EECS Components and Design Techniques for Digital Systems. FSMs 9/11/2007

EE141Microelettronica. CMOS Logic

VHDL Implementation of Reed Solomon Improved Encoding Algorithm

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

DE58/DC58 LOGIC DESIGN DEC 2014

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Memory Density. Test Time in Seconds (Memory Size n Bits) 10/28/2014

Breakdown Characterization

Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG

Digital Electronics. Delay Max. FF Rate Power/Gate High Low (ns) (MHz) (mw) (V) (V) Standard TTL (7400)

Boolean Algebra and Digital Logic 2009, University of Colombo School of Computing

The Pennsylvania State University. The Graduate School. Department of Computer Science and Engineering

Reliable Computing I

For smaller NRE cost For faster time to market For smaller high-volume manufacturing cost For higher performance

CpE358/CS381. Switching Theory and Logical Design. Summer

SRAM Cell, Noise Margin, and Noise

Analysis of Soft Error Rates for future technologies

Introduction to Reliability Simulation with EKV Device Model

Topics. CMOS Design Multi-input delay analysis. John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut

Statistical Reliability Modeling of Field Failures Works!

Optimization of the Hamming Code for Error Prone Media

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Radiation Effect Modeling

Topics. Dynamic CMOS Sequential Design Memory and Control. John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut

F14 Memory Circuits. Lars Ohlsson

! Crosstalk. ! Repeaters in Wiring. ! Transmission Lines. " Where transmission lines arise? " Lossless Transmission Line.

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

Optimization of the Hamming Code for Error Prone Media

Introduction to Radiation Testing Radiation Environments

Integrated Circuits & Systems

Transcription:

EECS150 - Digital Design Lecture 26 - Faults and Error Correction April 25, 2013 John Wawrzynek 1 Types of Faults in Digital Designs Design Bugs (function, timing, power draw) detected and corrected at design time through testing and verification (simulation, static checks) Sometimes in 3rd party design blocks Manufacturing Defects (violation of design rules, impurities in processing, statistical variations) post production testing for sorting spare on-chip resources for repair Runtime Failures (physical effects and environmental conditions) assuming design is correct and no manufacturing defects 2

Runtime Faults All digital systems suffer occasional runtime faults. Fault tolerant design methodologies are employed to tolerate faults in critical applications (avionics, space exploration, medical,...) Error detection and correction is commonly used in memory systems and communication networks. Deeply scaled CMOS devices will suffer reliability problems due to a variety of physical effects (processing, aging, environmental susceptibility) Lower supply voltage for energy efficiency makes matters worse. 3 Physical Fault Mechanisms IC Faults can be classified as permanent, transient, and intermittent: Permanent faults reflect irreversible physical changes (like fused wire or shorted transistor) Transients are induced by temporary environmental conditions (like cosmic rays and electromagnetic interference) Intermittent faults occur due to unstable or marginal hardware (temporary ΔV t resulting in timing error) Intermittent faults often occurs repeatedly at the same location while transients affect random locations. Intermittents often occur in bursts. Intermittent faults track changes in voltage and temperature, and may become permanent. 4

Physical Fault Mechanisms Intermittent: Aging, Voltage/Temperature Dependent NBTI (negative bias temperature instability) & PBTI HCI (hot carrier injection) TDDB (time-dependent dielectric breakdown) Electromigration - 5 Physical Fault Mechanisms NBTI, PBTI, & HCI increase Vt and decrease mobility Leads to decreased performance, lower noise margins, mismatching (in SRAM), TDDB causes soft or hard gate shorts resulting in degraded transistor performance and can lead to complete transistor failure Electromigration reduces interconnect conductivity and can lead to open circuit. [Ghosh and Roy: Parameter Variation Tolerance and Error Resiliency: New Design Paradigm for the Nanoscale Era, Proceedings of the IEEE Vol. 98, No. 10, October 2010 ] 6

7 8

2. A Description of SEE s The Physical Mechanism CHARGE COLLECTION VOLUME The incident particle generates a dense track of electron hole pairs and this ionization cause a transient current pulse if the strike occurs near a sensitive volume. 8 9 10

3. Sources of SEE s Usually, SEE s have been associated with space missions because of the absence of the atmospheric shield Cosmic rays Protons from solar flares Unfortunately, our quiet oasis seems to be vanishing since the enemy is knocking on the door Alpha particle from vestigial U or Th traces Atmospheric neutrons and other cosmic rays 11 11 12

13 14

15 16

17 18

19 20

Low-level Fault Models: For logic circuit nodes 1. Permanent stuck at 0 or 1 2. Glitches 3. Slow transitions For memory blocks (and flip-flops, registers) 1. Permanent stuck at 0 or 1 2. Hold failure 3. Read upset 4. Slow read 5. Write failure Fault Models 21 A Fault-Tolerant Design Methodology Triple Modular Redundancy (TMR) Logic Block Logic Block Logic Block Logic Block Voting Circuit 22

Error Correction Codes (ECC) Memory systems exhibit errors (accidentally flipped-bits) Large concentration of sensitive nodes Soft errors occur occasionally when cells are struck by alpha particles or other environmental upsets. Less frequently, hard errors can occur when chips permanently fail. Where perfect memory is required servers, spacecraft/military computers, Memories are protected against failures with ECCs Extra bits are added to each data-word extra bits are used to detect and/or correct faults in the memory system in general, each possible data word value is mapped to a unique code word. A fault changes a valid code word to an invalid one - which can be detected. Spring 2013 EECS150 Lec26-fault Page 23 Simple Error Detection Coding Parity Bit Each data value, before it is written to memory is tagged with an extra bit to force the stored word to have even parity: b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 p Each word, as it is read from memory is checked by finding its parity (including the parity bit). b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 p + A non-zero parity indicates an error occurred: two errors (on different bits) is not detected (nor any even number of errors) odd numbers of errors are detected. Spring 2013 EECS150 Lec26-fault Page 24 + c

Hamming Error Correcting Code Use more parity bits to pinpoint bit(s) in error, so they can be corrected. Example: Single error correction (SEC) on 4-bit data use 3 parity bits, with 4-data bits results in 7-bit code word 3 parity bits sufficient to identify any one of 7 code word bits overlap the assignment of parity bits so that a single error in the 7-bit word can be corrected Procedure: group parity bits so they correspond to subsets of the 7 bits: p 1 protects bits 1,3,5,7 p 2 protects bits 2,3,6,7 p 3 protects bits 4,5,6,7 1 2 3 4 5 6 7 p 1 p 2 d 1 p 3 d 2 d 3 d 4 Bit position number 001 = 1 10 011 = 3 10 101 = 5 10 111 = 7 10 010 = 2 10 011 = 3 10 110 = 6 10 111 = 7 10 100 = 4 10 101 = 5 10 110 = 6 10 111 = 7 10 Spring 2013 EECS150 Lec26-fault Page 25 p 1 p 2 p 3 Note: number bits from left to right. 1 2 3 4 5 6 7 p 1 p 2 d 1 p 3 d 2 d 3 d 4 Hamming Code Example Note: parity bits occupy power-oftwo bit positions in code-word. On writing to memory: parity bits are assigned to force even parity over their respective groups. On reading from memory: check bits (c 3,c 2,c 1 ) are generated by finding the parity of the group and its parity bit. If an error occurred in a group, the corresponding check bit will be 1, if no error the check bit will be 0. check bits (c 3,c 2,c 1 ) form the position of the bit in error. Example: c = c 3 c 2 c 1 = 101 error in 4,5,6, or 7 (by c 3 =1) error in 1,3,5, or 7 (by c 1 =1) no error in 2, 3, 6, or 7 (by c 2 =0) Therefore error must be in bit 5. Note the check bits point to 5 By our clever positioning and assignment of parity bits, the check bits always address the position of the error! c=000 indicates no error Spring 2013 EECS150 Lec26-fault Page 26

Hamming Error Correcting Code Overhead involved in single error correction code: let p be the total number of parity bits and d the number of data bits in a p + d bit word. If p error correction bits are to point to the error bit (p + d cases) plus indicate that no error exists (1 case), we need: 2 p >= p + d + 1, thus p >= log(p + d + 1) for large d, p approaches log(d) Adding on extra parity bit covering the entire word can provide double error detection 1 2 3 4 5 6 7 8 p 1 p 2 d 1 p 3 d 2 d 3 d 4 p 4 On reading the C bits are computed (as usual) plus the parity over the entire word, P: C=0 P=0, no error C!=0 P=1, correctable single error C!=0 P=0, a double error occurred C=0 P=1, an error occurred in p 4 bit Typical modern codes in DRAM memory systems: 64-bit data blocks (8 bytes) with 72-bit code words (9 bytes), results in SEC, DED. Spring 2013 EECS150 Lec26-fault Page 27 LFSRs Spring 2013 EECS150 Lec26-fault Page 28

Linear Feedback Shift Registers (LFSRs) These are n-bit counters exhibiting pseudo-random behavior. Built from simple shift-registers with a small number of xor gates. Used for: random number generation counters error checking and correction Advantages: very little hardware high speed operation Example 4-bit LFSR: Spring 2013 EECS150 Lec26-fault Page 29 4-bit LFSR Circuit counts through 2 4-1 different non-zero bit patterns. Leftmost bit decides whether the 10011 xor pattern is used to compute the next value or if the register just shifts left. Can build a similar circuit with any number of FFs, may need more xor gates. In general, with n flip-flops, 2 n -1 different non-zero bit patterns. (Intuitively, this is a counter that wraps around many times and in a strange way.) Spring 2013 EECS150 Lec26-fault Page 30

x 2 + x +1 x 3 + x +1 x 4 + x +1 x 5 + x 2 +1 x 6 + x +1 x 7 + x 3 +1 x 8 + x 4 + x 3 + x 2 +1 x 9 + x 4 +1 x 10 + x 3 +1 x 11 + x 2 +1 Primitive Polynomials x 12 + x 6 + x 4 + x +1 x 13 + x 4 + x 3 + x +1 x 14 + x 10 + x 6 + x +1 x 15 + x +1 x 16 + x 12 + x 3 + x +1 x 17 + x 3 + 1 x 18 + x 7 + 1 x 19 + x 5 + x 2 + x+ 1 x 20 + x 3 + 1 x 21 + x 2 + 1 Galois Field Hardware Multiplication by x shift left Taking the result mod p(x) Obtaining all 2 n -1 non-zero elements by evaluating x k for k = 1,, 2 n -1 x 22 + x +1 x 23 + x 5 +1 x 24 + x 7 + x 2 + x +1 x 25 + x 3 +1 x 26 + x 6 + x 2 + x +1 x 27 + x 5 + x 2 + x +1 x 28 + x 3 + 1 x 29 + x +1 x 30 + x 6 + x 4 + x +1 x 31 + x 3 + 1 x 32 + x 7 + x 6 + x 2 +1 XOR-ing with the coefficients of p(x) when the most significant coefficient is 1. Shifting and XOR-ing 2 n -1 times. Spring 2013 EECS150 Lec26-fault Page 31 Building an LFSR from a Primitive Polynomial For k-bit LFSR number the flip-flops with FF1 on the right. The feedback path comes from the Q output of the leftmost FF. Find the primitive polynomial of the form x k + + 1. The x 0 = 1 term corresponds to connecting the feedback directly to the D input of FF 1. Each term of the form x n corresponds to connecting an xor between FF n and n +1. 4-bit example, uses x 4 + x + 1 x 4 FF4 s Q output x xor between FF1 and FF2 1 FF1 s D input To build an 8-bit LFSR, use the primitive polynomial x 8 + x 4 + x 3 + x 2 + 1 and connect xors between FF2 and FF3, FF3 and FF4, and FF4 and FF5. Spring 2013 EECS150 Lec26-fault Page 32

Error Correction with LFSRs Spring 2013 EECS150 Lec26-fault Page 33 Error Correction with LFSRs XOR Q4 with incoming bit sequence. Now values of shift-register don t follow a fixed pattern. Dependent on input sequence. Look at the value of the register after 15 cycles: 1010 Note the length of the input sequence is 2 4-1 = 15 (same as the number of different nonzero patters for the original LFSR) Binary message occupies only 11 bits, the remaining 4 bits are 0000. They would be replaced by the final result of our LFSR: 1010 If we run the sequence back through the LFSR with the replaced bits, we would get 0000 for the final result. 4 parity bits neutralize the sequence with respect to the LFSR. 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 If parity bits not all zero, an error occurred in transmission. If number of parity bits = log total number of bits, then single bit errors can be corrected. Using more parity bits allows more errors to be detected. Ethernet uses 32 parity bits per frame (packet) with 16-bit LFSR. Spring 2013 EECS150 Lec26-fault Page 34