Cause Analysis Focusing on Misoperations Ed Ruck, Rick Hackman WECC Misoperations Workshop June 5-6, 208
Why conduct a Root Cause Analysis 2
What is a Misoperation Informal definition System does not activate when it should o Example: A line experiences a phase to ground fault halfway between the two terminals, but the relay does not detect the fault. System takes action when it should not o Example: Buckhead-Chatooga trips high-speed for a fault on the Chatooga- Peachtree line. Can be associated with Protection System Control System Remedial Action Scheme (RAS) Not limited to what is reportable through NERC tools such as MIDAS 3
Selection of methods (cont d) METHOD WHEN TO USE ADVANTAGES DISADVANTAGES REMARKS Task Analysis Events and Causal Factor Analysis Change Analysis Use whenever the problem appears to be the result of steps taken in a task (just about all the time) Use for multi-faceted problems with long or complex causal factor chain Use when cause is obscure. Especially useful in evaluating equipment failures Shows the steps which should have been taken. Provides visual display of analysis process. Identifies probable contributors to condition. Simple 6-step process Requires personnel and (possibly) equipment time to be performed correctly and completely Time-consuming and requires familiarity with process to be effective. Limited value because of the danger of accepting wrong obvious answer. Should be conducted as both a Cognitive Task Analysis (what was the person thinking while conducting the task) and a Contextual Task Analysis (what was going on while the task was being done). Requires a broad perspective of the event to identify unrelated problems. Helps to identify where deviations occurred from acceptable methods. A singular problem technique that can be used in support of a larger investigation. All root causes may not be identified. Barrier Analysis Used to identify barrier and equipment failures, and procedural or administrative problems. Provides systematic approach. Requires familiarity with process to be effective. This process is based on the MORT Hazard/Target concept 4
Selection of methods (cont d) METHOD WHEN TO USE ADVANTAGES DISADVANTAGES REMARKS Used when there is a shortage of experts to Can be used with limited If this process fails to ask the right questions prior training. Provides a May only identify area of identify problem areas, and whenever the list of questions for cause, not specific seek additional help or problem is a recurring specific control and causes. use cause-and-effect one. Helpful in solving management factors. analysis. programmatic problems. MORT/Mini-MORT Human Performance Evaluations (HPE) Kepner-Tregoe Use whenever people have been identified as being involved in the problem cause. Use for major concerns where all aspects need thorough analysis Thorough analysis Highly structured approach focuses on all aspects of the occurrence and problem resolution. None if process is closely followed. More comprehensive than may be needed Requires HPE training. Requires Kepner-Tregoe training. Fault Tree Analysis Normally used for equipment-related problems Provides a visual display of causal relationships, Does not work well when human actions are inserted as a cause Uses Boolean algebra symbology to show how the causes may combine for an effect Cause and Effect Charting (e.g., Reality Charting ) Useful for any type of problem. Visual display showing cause sequence. Provides a direct approach to reach causes of primary effect(s). May be used with barrier/change analysis. Focus is on best solution generation. May not provide entire background to understand a complex problem. Requires experience/knowledge to ask all the right questions. Requires knowledge of the Apollo Root Cause Analysis techniques. Apollo RealityCharting software may be used as a tool to aid problem resolution. 5
Something happens Brief Description of Event: At 0645 a initiating fault on the 500 kv B-G # line resulted in: The 500kV B-G # line locking out at both ends The 500 kv C-B # and #2 lines being open ended Tripping of Generation C Units, 2, 3 and 4 on overspeed Note: The B-G # line was patrolled to investigate and determine the cause of the initiating fault and no cause found. The line was successfully returned to service. Identify contributing causes of the event to the extent known. Identify any Protection System misoperations to the extent known. Identify any GADS, DADS, TADS, or Protection System misoperations reports that will be submitted. An initiating fault on the 500 kv B G # line resulted in potential misoperation relay openings of CB #5 and CB #6 breakers at Station C. Station C CB #5 and #6 relays have been taken out of service and are being tested to confirm whether or not they mis-operated. Potential Relay Misoperation of CB #5 and CB #6 breakers at Station C. GADS and TADS will be reported. Any misoperations identified will be reported. Narrative A single line to ground fault on 500kV line BG resulted in the line locking out after an unsuccessful reclose attempt at the station B end. Coincident with the initial trip, Station C breaker #5 tripped. Coincident with the reclose at Station B, the Station C #6 breaker tripped. Upon losing both lines from Generating Station C, all four units tripped on overspeed. 6. If a one-line diagram is included, please provide an explanation. Single lines showing Stations C, B and G.
Initial line-up Generating Station C 5 7 Switching Station B 9 Switching Station G 2 3 2 3 6 8 0 2 4 4 7
06:45:00.000 Initial Fault Generating Station C Switching Station B 5 7 9 Switching Station G 2 3 2 3 6 8 0 Fault 2 4 4 8
What should happen next? If fault is temporary and clears:? If fault is permanent (does not clear): 9
06:45:00.000 Initial Fault Generating Station C Switching Station B 5 7 9 Switching Station G 2 3 2 3 6 8 0 Fault 2 4 4 0
06:45:00.005 Is this what you expected? Generating Station C 5 7 Switching Station B 9 Switching Station G 2 3 2 3 6 8 0 Fault 2 4 4
06:45:00.300 Testing fault 2 3 Generating Station C 2 3 5 7 6 Switching Station B 8 9 Reclose 0 Fault Switching Station G 2 4 4 2
06:45:00.305 Fault continues Generating Station C 5 7 Switching Station B 9 Switching Station G 2 3 2 3 6 8 0 Fault 2 4 4 3
06:45:00.305 Is this what you expected? Generating Station C 5 7 Switching Station B 9 Switching Station G 2 3 2 3 6 8 0 Fault 2 4 4 4
So what happened?. Initial fault occurred 2. CB5 tripped 3. CB6 tripped 4. Loss of 4 generators 5
Problem Statement What: When: Where: Loss of generation 0/0/207 at 06:45 Generating Plant C; ABC Utilities Significance: Loss of 2000MW of generation 6
How to visualize what is known 7
Where do you go next? 8
What are possible causes of these CB trips? 9 Breaker problems or actions Operator action from EMS Operator action locally Mechanism failure and release Animal intrusion Trip coil malfunction Improper latching Mechanical agitation of breaker cabinet Relay problems or actions Proper relay operation Correct Relay action due to faulted condition on bus, transformer, or line Incorrect relay operation Faulty relay Incorrect relay signal (no system condition exists); induced signal Relay settings issue Wiring issues Mechanical agitation of EM Relays Protections System problems or actions Relay settings issue (over-reach scenario) Brainstorming
SF6 Breaker Failure Modes & Mechanisms 20
What could you do to prove or disprove any of these potential cause(s)? Breaker problems or actions Relay problems or actions Protections System problems Brainstorming 2
What was learned? After initial review of the events it was determined the C end of the 500kV # and #2 line ground instantaneous relays mis-operated for the fault on the 500kV # between B and G. More extensive analysis revealed the cause of the SLYP/ SLCN mis-operations was a setting issue. A review of history shows the relay settings were correct when established, but over time, with system changes, had not been kept correct for the changing system conditions. 22
So what happened and why?. Initial fault occurred Unknown, after patrol of line with nothing found???? 2.CB5 tripped Misoperation of Instantaneous Ground-overcurrent, due to incorrect settings Settings were not updated as changes to system occurred 3.CB6 tripped Misoperation of Instantaneous Ground-overcurrent, due to incorrect settings Settings were not updated as changes to system occurred 4.Loss of 4 generators Overspeed trip (loss of outlet path) as designed 23
Visual representation of final results 24
final results drill down 25
Food for thought Why should we use Root Cause Analysis? Why not use Apparent Cause Analysis like the Why Staircase? What is the goal of a Apparent Cause Analysis? What is the goal of a Root Cause Analysis? What prevents your company from performing Root Cause Analysis? 26
DINNER! 27
Apparent Cause Analysis (ACA) Example of a Why Staircase for a Loss of cooling water Lost cooling water to equipment Why? Why? Pump Bearing Failed Why? Inadequate lubrication Why? Not enough oil in Reservoir Why? Operator did not replenish oil to the correct level Sight glass installed upside down Why? Why? Mechanic put in a new sight glass during overhaul - without a drawing in the work package Document Control Back Log 28
Apparent Cause Analysis (ACA) Example of a Why Staircase for a leak that occurred during International Space Station (ISS) EVA 22 Why? Water observed in helmet at end of EVA 22 Why? Water from drink bag leaked fix Bite valve not reclosing correctly Replace Bite Valve and water bag Solution? 29
ISS EVA 23 Information Another (larger) water leak occurred -.5 liters of water leaked in helmet during EVA 23 Could have been a fatal accident Inorganic materials causing blockage of the drum holes in the space suit water separator, resulting in water spilling into the vent loop 30
ISS EVA 23 Root Causes for Water Leak Program emphasis was to maximize crew time on orbit for utilization. The strong emphasis on utilization was leading team members to feel that requesting on-orbit time for anything non-science related was likely to be denied and therefore tended to assume their next course of action could not include on-orbit time. ISS Community perception was that drink bags leak. When presented with the suggestion that the crew member s drink bag leaked out the large amount of water that was found in EV2 s helmet after EVA 22, no one in the EVA community challenged this determination and investigated further. Flight Control Team s perception of the anomaly report process as being resource intensive made them reluctant to invoke it. Several ground team members were concerned that if the assumed drink bag anomaly experienced at the end of EVA 22 were to be investigated further, it would likely lead to a long, intensive process that would interfere with necessary work needed to prepare for the upcoming EVA 23, and that this issue would likely not uncover anything significant enough to justify the resources which would have to be spent. 3
ISS EVA 23 Root Causes Cont. No one applied knowledge of the physics of water behavior in zero-g to water coming from the PLSS vent loop. The ground teams did not properly understand how the physics of water behavior inside the complex environment of the EMU helmet would manifest itself. The occurrence of minor amounts of water in the helmet was normalized. Despite the fact that water carryover into the helmet presented a known hazard the ground teams grew to accept this as normal EMU behavior. 32
Reliability Drifting to Failure* Hi Expectations: Desired approach to work (as imagined) Normal Practices: Work as actually performed (allowed by mgmt!) Stated Expectations Real Margin for Error Drift Error Normal Practice Hidden hazards, threats, unusual conditions, & system weaknesses Latent Error Inconspicuous and seemingly harmless buildup of hidden error and organizational weaknesses Lo 33 Time * Adapted from Muschara Error Management Consulting, LLC
Additional info: ISS EVA 23 RCA Additional reading https://spaceflight.nasa.gov/outreach/significantincidentseva/assets/suit_water_intrus ion_mishap_investigation_report.pdf https://sma.nasa.gov/docs/default-source/safety-messages/safetymessage-204-04-07- isseva23suitwaterintrusion-vits.pdf?sfvrsn=6 34
In another place, another time Brief description of event: On June 23, 207, the Station A 38kV breaker B breaker failure protection operated when Capacitor Bank C was closed. The breaker failure operation cleared the 38kV bus by opening breakers A, C, D, E and F. 35
Initial -up (23 June 207) Station A A B C D E F Cap C 36
Shunt Capacitor C Closed in, for voltage control Station A A B C D E F Cap C 37
Breaker Failure, B Station A X A B C D E F Cap C 38
Bus Cleared Station A X A B C D E F Cap C 39
What could cause this?? Discussion 40
Final Report will this happen again? Brief description of event: On June 23, 207, the Station A 38kV breaker B breaker failure protection operated when Capacitor Bank C was closed. The breaker failure operation cleared the 38kV bus by opening breakers A, C, D, E and F. Identify contributing causes of the event to extent known: The cause remains unknown. Identify any protection system misoperations to the extent known: Station A breaker B breaker failure protection Narrative: When shunt capacitor C was closed in, for voltage control, breaker B breaker failure operated de-energizing the Station A 38kV bus. After exhaustive testing, no cause was found and everything was put back in service. 4
Final Outcome for THIS event Nothing Found Put back into Service 42
In same place, another time Brief description of event: On August 2, 207, the Station A 38kV breaker B breaker failure protection operated when Capacitor Bank C was closed. The breaker failure operation cleared the 38kV bus by opening breakers A, C, D, E and F. Identify contributing causes of the event to extent known: The cause remains unknown. Identify any protection system misoperations to the extent known: Station A breaker B breaker failure protection Narrative: When shunt capacitor C was closed in, for voltage control, breaker B breaker failure operated de-energizing the Station A 38kV bus. After exhaustive testing, no cause was found and everything was put back in service. 43
Initial -up (2 August 207) Station A A B C D E F Cap C 44
Shunt Capacitor C Closed in, for voltage control Station A A B C D E F Cap C 45
Breaker Failure, B Station A X A B C D E F Cap C 46
Bus Cleared Station A X A B C D E F Cap C 47
What could cause this?? Findings: Breaker failure initiate contact was partially conducting at times. This contact is a hybrid solid state contact that would build up voltage and initiate breaker failure randomly. A 0k ohm resistor was connected from the contact to negative to pull down this stray voltage build up, and not initiate breaker failure randomly (LL2030703) Manufacturer changed the design incorporate the same fix being implemented in the field. 48
In same place, another time Brief description of event: On August 2, 207, the Station A 38kV breaker B breaker failure protection operated when Capacitor Bank C was closed. The breaker failure operation cleared the 38kV bus by opening breakers A, C, D, E and F. Identify contributing causes of the event to extent known: Leakage current through a solid state contact erroneously triggered the Breaker B breaker failure. Identify any protection system misoperations to the extent known: Station A breaker B breaker failure protection Narrative: When shunt capacitor C was closed in, for voltage control, breaker B breaker failure operated de-energizing the Station A 38kV bus. After extensive testing it was determined that the breaker failure initiate contact from the SEL 32 relay was partially conducting at times. This contact is a hybrid solid state contact that would build up voltage and initiate breaker failure randomly. A 0k ohm resistor was connected from the contact to negative to pull down this stray voltage build up, and not initiate breaker failure randomly. 49
Ed Ruck Senior Reliability Engineer, RRM 302-376-569 office 847-62-0487 cell Ed.Ruck@nerc.net Questions??? and Answers!!! Jule Tate Associate Director, Event Analysis, RRM 404-446-2572 office 609-65-775 cell Jule.Tate@nerc.net Andy Slone Senior Reliability Engineer, RRM 404-446-979 office 404-99-8303 cell Andrew.Slone@nerc.net Rich Bauer Associate Director, Event Analysis, RRM 404-446-9738 office 404-357-9843 cell Rich.Bauer@nerc.net Wei Qiu Senior Reliability Engineer, RRM 404-446-962 office 404-532-5246 cell Wei.Qiu@nerc.net Rick Hackman Senior Reliability Advisor, RRM 404-446-9764 office 404-576-5960 cell Richard.Hackman@nerc.net 50
5
52
53
Dinner Discussion Why should we use Root Cause Analysis? We want to understand what happened and prevent it from happening again Why not use Apparent Cause Analysis like the Why Staircase? Apparent cause analysis only fixes the one issue We may not have identified the correct issue. Are we confident that what you found won t happen elsewhere? What is the goal of an Apparent Cause Analysis To fix the problem you have right now. What is the goal of a Root Cause analysis? To find effective correction actions that will prevent the problem from happening again. Are within our control and do not cause unexpected problems. What prevents your company from performing Root Cause Analysis? There is a perception that RCA needs is costly and requires extensive training of all parties involved. 54