Simplifying Cause Analysis: A Structured Approach 1944480463, 9781944480462

A step-by-step approach to your successful cause investigation. When the challenge is to get to the heart of a problem,

342 44 8MB

English Pages 236 Year 2019

Table of contents :
Cover......Page 1
Title page......Page 2
Copyright......Page 3
Acknowledgments......Page 4
Table of Contents......Page 5
Introduction......Page 12
References......Page 15
1.1 Two Tools to Begin the Investigation......Page 16
1.1.1 Investigation Effort and Rigor Assessment Tool......Page 17
1.1.2 Checklist Tool......Page 22
2.1 Conceptual Basis for the Cause Road Map......Page 27
2.1.1 Find the Cause Model Compared to Swiss Cheese Model......Page 28
2.2.1 Major Causal Factor Groups (Map 1) of Cause Road Map......Page 32
2.4 Cause Road Map as a Data Trending Source......Page 34
References......Page 36
3.1 Human Performance Related Maps......Page 37
3.1.1 Cause Road Map − Map 2 (Human Errors)......Page 38
3.1.2 Cause Road Map – Map 3 (Error Drivers)......Page 42
3.1.3 Cause Road Map − Map 4 (Flawed Defenses)......Page 49
3.1.4 Cause Road Map – Map 5 (Failed Oversight/Flawed Assessments)......Page 54
3.1.5 Cause Road Map – Map 6 (Latent Errors/Flawed Decisions)......Page 56
References......Page 64
4.1 Machine/Material/Hardware Failure Related Maps......Page 65
4.1.1 Machine/Material/Hardware Failure Related Map A......Page 67
4.1.2 Machine/Material/Hardware Failure Related Map B......Page 72
References......Page 77
5.1 Using the Cause Road Map......Page 78
5.1.1 Step 1......Page 80
5.1.2 Step 2 – Human Error......Page 81
5.1.3 Step 3 – Error Drivers......Page 82
5.1.4 Step 4 – Flawed Organizational or Programmatic Defenses......Page 83
5.1.5 Step 5 – Flawed Assessment Capability......Page 84
5.1.6 Step 6 – Latent Management Practice Weaknesses......Page 85
5.1.7 Step 7 – Repeat for another observation......Page 86
5.1.8 Step 8 – Safety Culture......Page 87
5.2.1 Human Performance Evaluation Briefing Report......Page 88
5.2.2 Hardware/Material/Design Failure Evaluation Briefing Report......Page 89
5.2.3 Human Performance Evaluation driven from previous Failure Evaluation......Page 91
5.3 Capturing the Details......Page 92
References......Page 99
6.1 Theoretical frame......Page 100
6.1.2 Rule-Based......Page 101
6.1.4 Slips, Mistakes, and Violations......Page 102
6.2.1 Skill-Rule-Knowledge (Balance Beam) Tool......Page 103
6.2.1.1 Corrective Actions Checklist for Skill-Based Errors......Page 109
6.2.1.3 Corrective Actions Checklist for Knowledge-Based Errors......Page 110
6.2.2 Substitution Test Tool......Page 111
References......Page 113
Chapter 7: Effective Interviewing......Page 114
7.1 Planning the Interview......Page 115
7.3.1 Open-ended Questions......Page 118
7.3.3 Follow-up Questions or Questioning to the Void......Page 119
7.3.4 Other Tips to a Successful Interview......Page 120
7.5 Recording......Page 121
References......Page 123
8.1 Tool Types and Use Matrix......Page 124
8.2 Comparative Event Line......Page 126
8.3 Failure Modes and Effects Analysis......Page 129
8.4 Barrier Analysis......Page 132
8.4.1 Classifying Barrier Functions......Page 133
8.4.3 Barrier Analysis Display Format......Page 134
8.4.4 Advantages and Weaknesses of Barrier Analysis Tool......Page 136
8.5.1 The Five Steps in Change Analysis......Page 137
8.5.2 A Cause Analysis Example......Page 139
8.6 Task Analysis......Page 143
8.6.2 Steps in Task Analysis......Page 144
8.6.3 Task Analysis Example......Page 145
8.6.4 Advantages and Weaknesses of Task Analysis Tool......Page 146
8.7.1.1 Approach to the Seven Events......Page 147
8.7.1.2 Suggested Common Cause Analysis Method......Page 148
8.8 The 5 Why’s/Why Staircase......Page 153
References......Page 156
9.1 Causal Factor Trees......Page 157
9.1.1 Advantages and Disadvantages of Causal Factor Tree......Page 159
9.2 Events and Causal Factors (E&CF) charts......Page 160
9.2.2 Considerations......Page 161
9.2.4 Formatting......Page 162
9.2.5 The Dump Truck Accident E&CF Chart......Page 165
9.2.6 E&CF Charting Advantages and Disadvantages......Page 167
References......Page 168
10.1 Investigation Tools Integration Protocol......Page 169
10.2 Commonalities Matrix......Page 171
11.1 Examining the Broader Implications of an Event......Page 177
11.2 Step 1: Define Same and Similar Conditions......Page 179
11.2.2 Extent of Cause Assessment......Page 180
11.3 Step 2: Determine the Potential Consequences and Risks......Page 181
12.1 Eight Basic Elements of Corrective Actions......Page 189
12.2 Effective Corrective Action Structure Elements......Page 190
12.3 Ten Criteria for Effective Corrective Actions......Page 191
12.4 SMART Corrective Actions......Page 193
12.6 Relative Effectiveness of Types of Corrective Actions......Page 194
12.7 Measurable Corrective Actions......Page 195
12.8 Assigned and Accepted Corrective Actions......Page 196
12.9 Realistic Corrective Actions......Page 197
12.10 Timely Corrective Actions......Page 198
12.11.1 Assigning Effectiveness Reviews......Page 200
Recommended Reading......Page 205
13.1 Points to Remember when Reporting Investigation Results......Page 206
13.2 Use of Graphics and Pictures in Reports......Page 207
13.3.1 Executive Summary......Page 208
13.3.2 Problem Statement......Page 209
13.3.3 Investigation Scope......Page 210
13.3.7 Event Significance......Page 211
13.3.8 Review of Operating Experience......Page 212
13.3.10 Use of Charts and Graphics......Page 213
13.3.12 Human Performance Précis......Page 214
References......Page 217
Chapter 14: Final Thoughts......Page 218
About the Author......Page 221
Credits......Page 223

Recommend Papers

Computer Architecture: A Structured Approach 0122208501, 9780122208508

Computer Architecture: A Structured Approach (A.P.I.C. studies in data processing ; no. 15)

201 44 9MB Read more

Root Cause Analysis In Process-Based Industries

115 103 5MB Read more

Structured finance: the object oriented approach 0470026383, 9780470026380, 9780470512722

Structured Finance: The Object Orientated Approach is aimed at both the finance and IT professionals involved in the str

330 69 2MB Read more

Evidence: A Structured Approach (Aspen Casebook) [5 ed.] 1543813615, 9781543813616

121 76 741KB Read more

Troubleshooting and Root Cause Failure Analysis 9780831195892, 9780831136659, 9780831195908, 9780831195885

320 62 5MB Read more

How to Prove It: A Structured Approach [3° ed.] 110842418X, 9781108424189

Proofs play a central role in advanced mathematics and theoretical computer science, yet many students struggle the firs

315 92 8MB Read more

Supply Chain Project Management: A Structured Collaborative and Measurable Approach [1 ed.] 157444350X, 9781574443509, 9780203501474

SCM doesn't change management goals, but relies on new knowledge, practices, and skills to better achieve those goa

447 89 2MB Read more

How to Prove It: A Structured Approach [3 ed.] 110842418X, 9781108424189

Table of contents : Contents Preface to the Third Edition Introduction 1 Sentential Logic 1.1 Deductive Reasoning and Lo

521 86 3MB Read more

How to Prove It: A Structured Approach [3 ed.] 110842418X, 9781108424189

Table of contents : Half Title page Title page Copyright page Dedication Contents Preface to the Third Edition Introduct

912 62 9MB Read more

How to prove it: a structured approach [2 ed.] 9780521675994, 0521675995, 0521861241, 9780521861243

Many students have trouble the first time they take a mathematics course in which proofs play a significant role. This n

347 105 2MB Read more

Simplifying Cause Analysis: A Structured Approach
1944480463, 9781944480462

Author / Uploaded
Chester D Rowe
Kristen Noakes-Fry

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Sim plifying Ca use Analysis: A St ruc t ure d Approa c h

Che st e r D. Row e Kristen Noakes-Fry, ABCI, Editor ISBN: 978-1-944480-46-2 (Perfect Bound) ISBN: 978-1-944480-47-9 (ePub) ISBN: 978-1-944480-48-6 (PDF eBook)

[email protected] www.rothstein.com Keep informed about Rothstein Publishing: www.facebook.com/RothsteinPublishing www.linkedin.com/company/rothsteinpublishing www.twitter.com/rothsteinpub

Upon registration, paid purchasers of this book are entitled to a free download of the licensed Interactive Cause Analysis Tool. See instructions on back page.

COPYRIGHT ©2018, Rothstein Associates Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without express, prior permission of the Publisher. No responsibility is assumed by the Publisher or Authors for any injury and/or damage to persons or property as a matter of product liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Local laws, standards, regulations, and building codes should always be consulted first before considering any advice offered in this book.

ISBN: 978-1-944480-46-2 (Perfect Bound) ISBN: 978-1-944480-47-9 (ePub) ISBN: 978-1-944480-48-6 (eBook - PDF) Library of Congress Cataloging Number 2017951738

Brookfield, Connecticut USA 203.740.7400

[email protected] www.rothstein.com

Upon registration, paid purchasers of this book are entitled to a free download of the licensed Interactive Cause Analysis Tool. See instructions on back page.

De dic at ion This book is dedicated to: •

•

My father, Kenneth H. Rowe. He gave me more than life. He ignited my quest for learning and taught me I could do anything I set my mind to. My mother, Ruth G. Rowe. She gave me, my brothers, and my sister everything we needed, even to her detriment.

Ack now le dgm e nt s Steven Davis of Fulton, NY, was my business partner for many years, and he helped develop the Cause Road Map concept. Steve and I used and taught this concept together for many years. Steve was co-author of my previous book, The Excellence Engine Tool Kit. David Collamore of Hampton, NH, was my second business partner and is one of my cousins. Dave was the best sounding board for my ideas I ever had.

T a ble of Cont e nt s DEDI CAT I ON ACKN OWLEDGM ENT S I NT RODUCT I ON Re fe re nc e s

CHAPT ER 1 : GET T I N G ST ART ED 1 .1 T w o T ools to Be gin the I nve stiga tion 1 .1 .1 I nve stiga tion Effort a nd Rigor Asse ssme nt T ool 1 .1 .2 Che ck list T ool

CHAPT ER 2 : T HE CAUSE ROAD M AP T AX ON OM Y 2 .1 Conce ptua l Ba sis for the Cause Road Ma p 2 .1 .1 Find the Ca use Mode l Compa red to Sw iss Chee se Mode l 2 .2 I ntroduc ing the Cause Road Ma p 2 .2 .1 Ma jor Causa l Fa c tor Groups (Map 1 ) of Ca use Road Ma p 2 .3 Cause Road Ma p De fine d a s a T a xonom y 2 .4 Cause Road Ma p a s a Da ta T re nding Sourc e Re fe re nc e s

CHAPT ER 3 : H UM AN PERFORM ANCE RELAT ED M APS 3 .1 Hum an Pe rforma nc e Re la ted Ma ps 3.1.1 Cause Road Map − Map 2 (Human Errors) 3 .1 .2 Ca use Roa d Map – Ma p 3 (Error Drive rs)

3.1.3 Cause Road Map − Map 4 (Flawed Defenses) 3 .1 .4 Ca use Roa d Map – Ma p 5 (Fa ile d Ove rsight/Fla w e d Asse ssments) 3 .1 .5 Ca use Roa d Map – Ma p 6 (La tent Errors/Fla w e d De c isions) Re fe re nc e s

CHAPT ER 4 : M ACHI NE/M AT ERI AL/HARDWARE FAI LURE M APS 4 .1 Ma chine/Ma te ria l/Ha rdw a re Fa ilure Re la te d Maps 4 .1 .1 Ma chine /Ma te ria l/Ha rdw a re Fa ilure Re la te d Map A 4 .1 .2 Ma chine /Ma te ria l/Ha rdw a re Fa ilure Re la te d Map B Re fe re nc e s

CHAPT ER 5 : USI N G T HE CAUSE ROAD M AP 5 .1 Using the Ca use Road Ma p 5 .1 .1 Ste p 1 5 .1 .2 Ste p 2 – Hum an Error 5 .1 .3 Ste p 3 – Error Drive rs 5 .1 .4 Ste p 4 – Fla w e d Orga niza tiona l or Progra m ma tic Defe nse s 5 .1 .5 Ste p 5 – Fla w e d Asse ssme nt Ca pa bility 5 .1 .6 Ste p 6 – La tent Ma nage me nt Pra c tice Wea kne sse s 5 .1 .7 Ste p 7 – Re pe a t for anothe r obse rva tion 5 .1 .8 Ste p 8 – Safe ty Culture 5 .2 Displa ying the Re sults 5 .2 .1 Huma n Pe rform anc e Eva lua tion Brie fing Report 5 .2 .2 Ha rdw a re /Ma te ria l/De sign Fa ilure Eva lua tion Brie fing Report 5 .2 .3 Huma n Pe rform anc e Eva lua tion drive n from pre vious Fa ilure Eva lua tion

5 .3 Capturing the De ta ils Re fe re nc e s

CHAPT ER 6 : H UM AN PERFORM ANCE BASI CS 6 .1 T heore tic a l fra me 6 .1 .1 Sk ill-Ba sed 6 .1 .2 Rule -Ba se d 6 .1 .3 Know ledge -Ba sed 6 .1 .4 Slips, Mista k e s, a nd V iola tions 6 .2 T ools & Inte rve ntions Supporting this T he ore tica l Fra me w ork 6 .2 .1 Sk ill-Rule -Know le dge (Ba lance Bea m ) T ool 6 .2 .1 .1 Corre c tive Ac tions Chec k list for Sk ill-Ba se d Errors 6 .2 .1 .2 Corre c tive Ac tions Chec k list for Rule -Ba sed Errors 6 .2 .1 .3 Corre c tive Ac tions Chec k list for Know le dge -Ba sed Errors 6 .2 .2 Substitution T e st T ool Re fe re nc e s

CHAPT ER 7 : EFFECT I V E I NT ERV I EWI N G 7 .1 Pla nning the I nte rvie w 7 .2 Opening the I nte rvie w 7 .3 Conduc ting the I nte rvie w 7 .3 .1 Open-ende d Que stions 7 .3 .2 Close d-e nde d Que stions 7 .3 .3 Follow -up Que stions or Que stioning to the V oid 7 .3 .4 Othe r T ips to a Succ e ssful I nte rvie w

7 .4 Closing the Inte rvie w 7 .5 Rec ording Re fe re nc e s

CHAPT ER 8 : ANALY SI S T OOLS AND T ECH NI QUES 8 .1 T ool T ype s a nd Use Ma trix 8 .2 Com pa ra tive Eve nt Line 8 .3 Fa ilure Modes a nd Effec ts Ana lysis 8 .4 Ba rrie r Ana lysis 8 .4 .1 Cla ssifying Ba rrie r Func tions 8 .4 .2 T ypica l Ba rrie r Ana lysis Chec k list Que stions 8 .4 .3 Ba rrie r Ana lysis Displa y Form a t 8 .4 .4 Advanta ge s a nd Wea kne sse s of Ba rrie r Ana lysis T ool 8 .5 Cha nge Ana lysis 8 .5 .1 T he Five Ste ps in Change Ana lysis 8 .5 .2 A Ca use Ana lysis Ex a mple 8 .5 .3 Advanta ge s a nd Wea kne sse s of Change Ana lysis T ool 8 .6 T a sk Ana lysis 8 .6 .1 T w o T ype s of T a sk Ana lysis 8 .6 .2 Ste ps in T a sk Ana lysis 8 .6 .3 T a sk Ana lysis Ex a mple 8 .6 .4 Advanta ge s a nd Wea kne sse s of T a sk Ana lysis T ool 8 .7 Com m on Cause Ana lysis 8 .7 .1 Using Com mon Ca use Ana lysis: Nuc lea r Pow e r Sta tion Ex a mple

8 .7 .1 .1 Approa ch to the Se ve n Events 8 .7 .1 .2 Sugge ste d Com mon Ca use Ana lysis Me thod 8 .8 T he 5 Why’s/Why Sta irca se Re fe re nc e s

CHAPT ER 9 : EV ENT M ODELI N G AND ANALY SI S T OOLS 9 .1 Causa l Fa c tor T re e s 9 .1 .1 Advanta ge s a nd Disadvanta ge s of Ca usa l Fac tor T re e 9 .2 Eve nts and Ca usa l Fac tors (E& CF) cha rts 9 .2 .1 De fining Proble m s 9 .2 .2 Conside ra tions 9 .2 .3 How to De ve lop the E& CF Cha rt 9 .2 .4 Form a tting 9 .2 .5 T he Dump T ruck Acc ident E& CF Cha rt 9 .2 .6 E& CF Cha rting Advanta ge s a nd Disadvanta ge s Re fe re nc e s

CHAPT ER 1 0 : EV ENT M ODELI NG AND ANALY SI S T OOLS 1 0 .1 Inve stiga tion T ools I nte gra tion Protocol 1 0 .2 Com mona litie s Ma trix

CHAPT ER 1 1 : EX T ENT OF CONDI T I ON AND EX T ENT OF CAUSE 1 1 .1 Exa m ining the Broade r I m plica tions of a n Event 1 1 .1 .1 Thre e -Step Approac h to I mplic a tions 1 1 .2 Ste p 1 : De fine Sa me a nd Sim ila r Conditions 1 1 .2 .1 Ex te nt of Condition Asse ssm ent

1 1 .2 .2 Ex te nt of Ca use Asse ssm ent 1 1 .3 Ste p 2 : De te rm ine the Potentia l Conseque nc e s and Risk s 1 1 .4 Ste p 3 : Ac tions Ma trix

CHAPT ER 1 2 : CORRECT I V E ACT I ONS 1 2 .1 Eight Ba sic Ele me nts of Corre c tive Ac tions 1 2 .2 Effe c tive Corre c tive Ac tion Struc ture Ele me nts 1 2 .3 T en Crite ria for Effec tive Corre c tive Ac tions 1 2 .4 SMART Corre c tive Ac tions 1 2 .5 Spe c ific Corre c tive Ac tions 1 2 .6 Re la tive Effec tive ne ss of T ype s of Corre c tive Ac tions 1 2 .7 Mea surable Corre c tive Ac tions 1 2 .8 Assigne d and Ac ce pted Correc tive Ac tions 1 2 .9 Re a listic Correc tive Ac tions 1 2 .10 T im e ly Corre c tive Ac tions 1 2 .11 Comprehensive Effec tive ne ss Re vie w Pla n 1 2 .11 .1 Assigning Effec tive ne ss Re vie w s Re com me nded Rea ding

CHAPT ER 1 3 : DOCUM ENT AT I ON AND REPORT I N G 1 3 .1 Points to Re m em be r w he n Re porting I nve stiga tion Re sults 1 3 .2 Use of Gra phic s a nd Pic ture s in Re ports 1 3 .3 Re port Spe c ific s 1 3 .3 .1 Ex ec utive Summ a ry 1 3 .3 .2 Proble m Sta te me nt

1 3 .3 .3 Inve stiga tion Scope 1 3 .3 .4 Eve nt Discussion 1 3 .3 .5 Compensa tory and I mm edia te Ac tions 1 3 .3 .6 Ex te nt of Condition/Cause 1 3 .3 .7 Eve nt Significa nc e 1 3 .3 .8 Re vie w of Ope ra ting Ex pe rie nce 1 3 .3 .9 Pre se nting Re sults 1 3 .3 .9 .1 Be fore the Pre se nta tion 1 3 .3 .9 .2 Pre se nta tion Sce na rios 1 3 .3 .10 Use of Cha rts and Gra phic s 1 3 .3 .11 Com m ona litie s Ma trix 1 3 .3 .12 Huma n Pe rform a nc e Préc is Re fe re nc e s

CHAPT ER 1 4 : FI NAL T HOU GHT S ABOUT T HE AUT H OR CREDI T S H OW T O GET Y OUR FREE DOWNLOAD

I nt roduc t ion During a return trip from a conference on root cause analysis, I had the pleasure of sitting next to a mother who was taking her young daughters on their very first flight in a jet plane. As is typical, the three-year-old’s “tool” for finding the “cause” for our plane’s takeoff was to simply ask her mother “why?” While the mother’s answer that jet planes just needed to go very fast was sufficient to satisfy her daughter, aircraft engineers and scientists have gained a much more fundamental understanding of flight. Fortunately, the aviation industry is now much more concerned with why planes have accidents. In fact, because of the adverse social and economic impacts of accidents, most companies and industries are very concerned with preventing failures. This book is intended for experienced individuals with some familiarity with cause analysis projects who are looking for simple and efficient cause investigation methodology. From this prospective, what is needed is an effective and insightful way of asking “why?” While there are numerous investigation tools for identifying the causes for problems, this book introduces a tool (the Cause Road Map) that is rigorous, yet is still intuitively easy to use and remember. As shown in Figure 0-1, studies have shown that most accidents and equipment failures are the result of some sort of human error.

Figure 0-1. Event Causes – Human vs Equipment There are already many scientific tools to help us understand the physical causes for machine failures; the challenge now is to find a way of investigating human performance failure modes. Driven in part by the public outcry over seminal events such as airplane accidents and nuclear power plant accidents, several pioneers, like Jens Rasmussen and Dr. James Reason, have provided several powerful conceptual models to enable us to understand why humans are often a major source of slips, lapses, and mistakes. This book will translate some of these pioneering conceptual models into easy-to-use, cause investigation-related tools and templates. • Chapter 1 will lay out a process to determine the level of effort should encompass. Having identified the kind of investigation needed, a strong, clear structure for conducting it is essential. • Chapters 2 through 5 presents a new and innovative structure to identify the underlying causes for the event (The Cause Road Maps) and conducting the investigation, while Chapter 6 introduces some conceptual human performance models and how to begin focusing on the human behaviors involved. Understanding that human behaviors are a critical part of any investigation, it necessarily follows that carefully interviewing personnel, as discussed in Chapter 7, is of utmost importance. Techniques and tools for this are presented in Chapter 8. • Finally, Chapters 9 through 13 detail how to “put the pieces together:” They analyze and model the event, determine corrective action, and document the investigations and findings.

As noted above, the book introduces the use of the Cause Road Map, which I developed over my years in the business. (Also copyrighted as The Cause Roadmap©). This multi-function event investigation tool provides a structured approach to finding the underlying causes for events. It will provide a comprehensive taxonomy for every cause investigation. It is not, however, intended to be used alone. The Cause Road Map requires the use of other tools provided in this book and by others to organize, analyze, and present the results of your investigation. This book will also present: • Investigation Rigor Selection Tool • Tool Types and Use Matrix • Common Cause Analysis • Commonalities Matrix • Interviewing Techniques • Comparative Event Line • Hardware/Material/Design Failure Evaluation Summary • Failure Modes and Effects Analysis Worksheet • Performance Evaluation Summary • Barrier Analysis • Event and Causal Factor Chart • The Why Tree (Described by and used with permission from Dr. William B. Corcoran) – 5 Why’s, Two Factor & Structured • Change Analysis • Task Analysis Worksheet • Corrective and Preventive Action Development Guidelines • A Rigorous Cause Analyses (RCA) Template • Structured Cause Investigation Report Template • Presentation Tips • A graphic to enable recognition and management of Knowledge-based decisions. Recently, the European Joint Research Center-Institute for Energy (JRC-IE) published a comparative analysis of event investigation methods, tools and techniques. One of their principal conclusions was: “Unstructured processes of root cause analysis put too much emphasis on opinions, take too long, and do not produce effective corrective measures or lasting results” (Ziedelis & Noel, 2011). The purpose of this book is to reduce the time it takes to perform a thorough and insightfull investigation through the use of simple, easy-to-use tools. Although this JRC-IE report did not address the Cause Road Map taxonomy, the above conclusion prompted a critical re-assessment of the structure, precision, and comprehensiveness of the tool set that I presented in my earlier book, The Excellence Engine Tool Kit. The new content in this current book resulted in part from this re-assessment as well as some more recent real-world experiences.

Re fe re nc e s Ziedelis, S. & Noel, M. (2011). Comparative analysis of nuclear event investigation methods, tools and techniques. Retrieved from http://publications.jrc.ec.europa.eu/repository/bitstream/111111111/16341/1/reqno_jrc62 929_jrc-str_fv2011-0513.pdf%5B1%5D.pdf

Cha pt e r 1 Ge t ting St art e d Your organization has just experienced an event. Now you should decide how to go about conducting an investigation that will determine the cause(s) for the event and determine what corrective actions are needed. Before you begin, you need to ask: • What level of effort (people, money, and other resources) do you need to dedicate to the investigation? • Is this a small event with minor consequences? • Is it a catastrophic event with major equipment damage, or is it a precursor event that could have had significant consequences? • What level of management involvement is needed? • Is an investigation team needed? If so, who should lead it? • How big should the investigation team be? • Does this problem have widespread implications? What are some of them? 1 .1 T w o T ools t o Be gin t he I nve st iga t ion This chapter introduces two tools to help your investigation get started in an orderly fashion. • Investigation Effort and Rigor Assessment Tool: The purpose of the first tool is to: o Determine the level of effort to devote to the investigation. o Help minimize the possibility of emotions dictating the level of investigation. o Recommend the level of investigative effort needed based upon consequences, risks, and uncertainties of the cause.

•

Checklist Tool: The second tool is a checklist to help you and your investigation team organize your investigation effort. The checklist serves as a tickler to remind you of the things that should be considered for a successful investigation.

What type of event investigation tools should you use to help identify the cause(s) for the event? Investigation tools are not effective in certain situations. Beginning with the Investigation Effort and Rigor Assessment Tool, before conducting any cause investigation or analysis, you will need to assess the level of effort needed and determine the rigor needed. To avoid “emotional” reactions (i.e., unwarranted levels of effort), as well as missed opportunities (i.e., not enough effort), the investigation needs to be as risk-informed as possible. 1 .1 .1 I nve st iga t ion Effort a nd Rigor Asse ssm e nt T ool The Investigation Rigor Selection Tool (Figure 1-1) provides you with the framework for a riskinformed decision by first assessing the actual or potential consequences or attributes. • Actual or Potential Consequences. The Actual or Potential consequences of the event or condition under consideration (measured as High, Medium, Low, or None): o The probability or likelihood of recurrence for this event or condition, assuming that no corrective actions are taken (Measured as High, Medium or Low). o The level to which corrective and preventative actions are Already Known or can be quickly determined without a cause investigation (Measured as Yes, No or Partial). o The level to which the cause or causes of the event or condition are Already Known or can be quickly determined without a cause investigation (Measured as Yes, No, or Partial).

Note: As the consequences associated with different types of events and conditions differ among various industries and companies, this attribute will require you to develop a set of facility-specific examples for High, Medium, Low and No Consequence events or conditions. Once you have completed the level of risk matrix and uncertainty matrix, you will be ready to select the level of investigation.

Figure 1-1. Investigation Effort and Rigor Assessment Tool

• Level of Effort Output from Rigor Selection Tool The output (Level of Effort) from this Investigation Rigor Selection Tool is one of the following: Com m on T e rm s Root Ca use : Among the many definitions of this term I’ve seen, I prefer the following: “The fundamental cause (or causes) of a chain of events which leads to an outcome or effect of interest.” The investigative effort to discover the root causes of an event is generally referred to as a Root Cause Analysis (RCA). Appa re nt Ca use : In my presentation at the 2007 IEEE/HPRCT joint conference (Apparent Cause Evaluation Myths by Stephen M. Davis and Chester D. Rowe © 2007, The Excellence Engine LLC) I listed 11 different definitions for this term. Since this presentation I’ve identified many more. The only commonality between many of these definitions is use of the phrase “the most probable cause”. The problem is that a “probable cause” is not a “learned cause”. A “probable cause” is basically a “best guess” cause. Therefore, I reject all existing definitions in favor of:

a. Rigorous Analysis (or Rigorous Cause Analysis): A rigorous analysis typically identifies causal factors generally defined as including Latent Organizational Weaknesses.

Latent Organization example: The organization’s management team has not enforced the company’s industrial safety program. This situation has resulted in the recent lost time accident and the industrial accident rate exceeding the company’s accident rate goal. Another example: The maintenance planning, maintenance and engineering organizations are not effectively aligned to implement preventive maintenance plans. This situation has caused delayed and missed preventive maintenance activities resulting in high unavailability of important equipment.

“The cause (or causes) within the chain of events closest to (or proximate to) the outcome or effect of interest.” The investigative effort to discover the apparent causes of an event is generally referred to as an Apparent Cause Evaluation (ACE).

b. Structured Analysis (or Structured Cause Analysis): A structured investigation typically identifies causal factors known as Apparent Causes or Proximate Causes and is generally limited to the identification of error drivers/precursors and Flawed Defenses for human performance-related issues and to hardware failure modes for equipment problems.

Flawed Defense example: Training for check valve maintenance is general in nature and does not address the unique design of the check valve with the fixed hinge design.

Note: A Structured Analysis can identify causal factors just as fundamental as a Rigorous Analysis. Doing so, however, is less likely because the amount of evidence typically collected and analyzed for a Structured Investigation may not be sufficient for you to support the most fundamental causes.

Note: A Common Cause Analysis can be performed within the context of either a Rigorous Analysis or Structured Investigation. Refer to Chapter 5 for a Common Cause Analysis methodology. Note: Rigorous Analysis and Structured Analysis often identify factors that did not directly result in the observed event, but did contribute to its significance, timing, impact or extent. These factors are known as Contributing Causes. Contributing causes should have corrective actions to address. Refer to Chapter 9 for information on corrective actions. c. Limited Inquiry and Correction: A Limited Inquiry typically is simply a checklist-driven “prompt investigation” for human performance errors. You may also use it as a systematic troubleshooting effort for hardware failures. Limited inquiries typically identify the causal factors generally defined as Direct Causes. Human Performance Error example: The operator did not follow the procedure as written. Equipment example: A pump is not pumping at rated capacity. An investigation found a worn impeller wear ring.

d. Correction Only: Your Correction Only effort typically includes conditions in which actions are already completed, do not need to be tracked to completion, or are simply not warranted because of the low consequences and significance. Correction Only Example: A burned-out airport taxiway light. It needs to be promptly replaced, but there is no need to investigate why it burned out. However, if there is a preventive maintenance program to periodically replace the lights, an investigation would be warranted to verify the program is being properly implemented or if the light is burning out frequently.

To better understand how the tool works let’s screen an event using the Investigation Rigor Selection Tool to determine the level of rigor for the investigation. Another Example: A phase to ground fault, located near the line side connection on a molded case circuit breaker, resulted in a near miss injury to the operator. A catastrophic failure of the Motor Control Center breaker bucket occurred when the operator closed the breaker. The operator was protected by proper Personal Protective Equipment and by standing to the side while operating the molded case circuit breaker. Although the molded case circuit breaker tripped, collateral damage to the other phases of the incoming line resulted in an approximately onehour delayed trip of the Unit Substation feeder breaker and temporary loss of other important plant equipment.

You can see in this example in Figure 1-2 that there is still subjectivity with using the Investigation Effort and Rigor Assessment Tool. However, as an investigator, you can start

thinking about the event in terms of the consequences, risk, and uncertainty before recommending the level of investigative effort. Previous use of this method in a wide variety of investigations has shown that, with practice using the tool, organizations can make consistent recommendations and help focus limited resources on investigating the events that truly warrant the effort.

Figure 1-2. Investigation Effort and Rigor Assessment Tool Example

1 .1 .2 Che c k list T ool Every successful investigation starts with a plan. When constructing a building, the contractor starts with a blueprint. The same is true with an event investigation. Right after an event occurs, there is often confusion and finger pointing. Frequently, important information is lost or mishandled. Just like a crime scene, the site of the event needs to be secured and information gathered to help determine cause. You and your investigation team need some sort of blueprint to conduct the event investigation. The blueprint needs to meet the organization’s programmatic requirements, and it must result in corrective actions recommendations that will eliminate the causal factors. Where does the investigation blueprint come from? Event investigation programmatic requirements such as report format, approvals required, and timeliness are usually contained in the organization’s administrative procedures. Many organizations employ checklists for other blueprint items. The following are typical checklist topics. • Manager Sponsor Determine who will be assigned responsibility for the overall quality, scope, and thoroughness of the investigation. The manager sponsor is often the person responsible for approving the investigation report. • Brief Event Description Provide a brief description of what happened. • Problem Statement Write a short statement, usually in object-defect formant, that will define the focus of the investigation (see Chapter 10 for additional information). • Investigation Scope (including extent of condition assessment scope) Describe how widespread the problem is. Define any related programs, processes, equipment or organizations affected by this problem. A key to a successful investigation is to define clearly the boundaries of the investigation (see Chapter 8 for a tool to help define the investigation scope). • Investigation Lead (and, if appropriate, team members) Decide who should be on the investigation team and what their roles should be. • Quarantine list Determine what information related to the event need to be segregated for a more detailed investigation.

 Scene Barricade the scene to limit accessibility.

 Evidence Isolate broken/damaged parts to prevent handling. Secure logs, charts recordings, computer data, and security video to prevent overwriting or tampering.

• Interviewee list Determine who needs to be interviewed. From the list, develop an interview plan. Who will conduct each interview, what questions will be asked, and in what order will the interviews be conducted.  Witnesses

 Management

 Participants

 Responders

• List of information to be gathered Determine what information is needed to conduct the investigation. The following is a list of the types of information required for event investigations.  Written statements Do written statements need to be obtained from personnel involved with the event or witnesses?  Records •

• •

•

• •

•

•

 Computer data •

•

Past

•

Past

Current

Past Work Orders

 Chart recordings

Past Modifications

 Parts

Procedure Changes

 Work Orders

Organization Changes

 Debris (Damaged Equipment) Caution − preserve evidence to minimize handling.

Area access records Correspondence

 Environment •

 Logs • Current

Temperature Lighting

Reference material • •

Expert Assistance List

Atmosphere Other Hazards

• • •

Drawings Specifications Technical manuals

 Manufacture

 Chemistry

 Engineering

 Transportation

• • •

Procedures Photographs Evidence

 Training

 Metallurgy

 Operations

 Vibration

 Other

•

Investigation Command Headquarters Determine what tools the team needs. This list is structured (more or less) for use by a team of investigators, but it can also be applied to single person investigations.  “Dedicated War Room” (For Root Causes)

 Photocopier

 Team meeting Schedule (For Root Causes)

 Evidence filing system

 Management Briefing Schedule (For Root Causes)

 Team Log

 Camera

 Whiteboards

 Telephones

 Administrative support person(s)

 Computers

•

Investigation Plan Determine what event investigation tools should be used by the investigation person or team.  Re-enact •

Simulator

•

Dry Run

 Arrange Timeline chronologically or sequentially •

Create a Comparative Event Line

•

Use the Cause Road Map for every Comparative Event Line disparity

•

Use Interviews, Barrier Analysis, Change Analysis or other techniques / tools familiar to the investigator or team to help answer Cause Road Map questions.

 Display the results with an Event and Causal Factor Chart  Display the results with Why Staircase Tree Diagrams

The lists may appear overwhelming. However, not every item needs to be considered for every investigation. It is meant to be a tickler to help you, as the investigator, to think of things that should be considered an investigation. The checklist should be tailored to fit the needs of your investigation. For significant events, most of the items should be considered.

Sum m a ry

 This chapter provided the Investigation Effort and Rigor Assessment Tool to help determine the level of effort appropriate for an investigation of an event based on its actual and potential consequences as well as the likelihood of its recurrence. The output from this tool is one of the following: 1. Rigorous Analysis: Intended to identify and rigorously demonstrate the most fundamental causes (i.e. Root Causes or Fundamental Causes). I also use the term Rigorous Analysis to make a clear distinction between the cause investigation Process and the Results of this process. 2. Structured Analysis: Identifies causal factors known as Apparent Causes or Proximate Causes, which are generally limited to the identification of error drivers/precursors and flawed defenses for human performance-related issues and to hardware failure modes for equipment problems. However, this level of effort can identify the most fundamental causes for an event. Again, I use the term Structured Analysis to make a clear distinction between this cause investigation Process and the Results of this process. 3. Limited Inquiry and Correction: A Limited Inquiry typically is just a checklist-driven, “prompt investigation” for human performance errors. It may also be a systematic troubleshooting effort for hardware failures. Limited inquiries typically identify causal factors generally defined as Direct Causes. 4. Correction Only: The Correction Only effort typically includes conditions where actions have been completed already, do not need to be tracked to completion, or are simply not warranted because of their low consequences and significance.

 This chapter provided a checklist to help organize your cause investigation efforts.

Look ing Forw a rd The next chapter will introduce the Cause Road Map© Taxonomy, its six causal factor groups, and the basis for this taxonomy. T a x onom y: Derived from Greek word ‘taxis,’ meaning arrangement or division, and ‘nomos,’ meaning law, is the science of classification per a predetermined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis or information retrieval.

Cha pt e r 2 T he Ca use Roa d M a p T a x onom y In Chapter 1, you saw how to determine both the level of effort and the accompanying type of investigation needed to identify the cause of an event. You were introduced to various tools that you can use to help gather information during the investigative process and the optimum function(s) for each tool. However, as an investigator, how do you know how to begin using these tools? This chapter introduces: • The conceptual basis for the Cause Road Map. • The major causal factor groups (Map 1) of the Cause Road Map. • The data trending capabilities of the Cause Road Map taxonomy. 2 .1 Conc e pt ua l Ba sis for t he Ca use Roa d M a p An earlier effort to develop a conceptual event model for determining the cause of an incident was the Find the Cause Model (Figure 2-2). This model was built on: 1. US Department of Energy (DOE) Barrier Analysis Model (2000) 2. Ishikawa’s Fishbone (Cause & Effect) Diagram (1985). 3. James Reason’s Swiss Cheese Model (2000). 4. The Institute of Nuclear Power Operations (INPO) Anatomy of an Event Model. While the first two works contributed to the model used in this book, you do not need to understand them to understand the Cause Road Map. Therefore, we will discuss only the elements from the last two: James Reason’s Swiss Cheese Model and the INPO Anatomy of an

Event Model that I first used to develop my Find the Cause Model and then used to develop the Cause Road Map. Figure 2-1 is my interpretation of James Reason’s Swiss Cheese Model.

Figure 2-1. Rowe Version of the “Swiss Cheese” Model

2 .1 .1 Find t he Ca use M ode l Com pa re d t o Sw iss Che e se M ode l My Find the Cause Model (Figure 2-2 and Table 2-1) adds more detail to my version of Reason’s Swiss Cheese Model in Figure 2-1. The Find the Cause Model in Figure 2-2 has six primary elements: 1. Human Errors (i.e., observable errors, active errors, or performance factors). • Equates to Figure 2-1“Unsafe Acts.”

2.

3.

4.

5.

6.

Human Error Drivers (i.e., pre-existing setup factors such as habits or work practices that drive human performance). • Equates to Figure 2-1 “Preconditions for Unsafe Acts.” Flawed Organizational or Programmatic Defenses (i.e., procedures, instructions, training, supervisors, corrective action program use, etc.). • Equates to Figure 2-1 “Unsafe Supervision.” Flawed Assessment Capability (i.e., inspections, self-assessments, management observations etc.). • No equivalent within Figure 2-1 model. Latent Management Practice Weaknesses (i.e., management behaviors and decisions that drive the organization). • Equates to Figure 2-1 “Organization Influences.” Machine/Material / Hardware Failures (i.e., broken and malfunctioning items).

•

Again, there is no equivalent within Figure 2-1 model.

Figure 2-2. Find the Cause Mode

Table 2-1. Six Areas from Which Problems Originate Human Error (Symptoms) Human errors are often symptomatic of underlying problems. Human errors are observable Actions of an individual or individuals that were incorrect or inappropriate For the Specific circumstances or conditions at a given time. Included are performance errors or gaps, such as work practices, communication, coordination and planning.

Human Error Drivers (Error Precursores) Error drivers are the pre-existing conditions that cause human reactions to their environment. These SETUP conditions existed before and were the immediate drivers for the observed human error. Human error drivers include: Task Demands Work Environment Individual Capabilities Human Nature

Flawed Organizational OR Programmatic Defenses (Failed Barriers) Programs and processes are the barriers that organizations establish to ensure consistent performance, and to prevent individuals from failing. This element includes: • procedures / programs / training • man-machine interface •

supervision / organizational structure

•

corrective action program

Flawed Assessment Capability (Failed Oversight) Many organizations have a process or an internal organization to monitor and assess their performance. Effective assessment processes will catch and correct weaknesses and small problems before they become consequential. This element looks at failed oversight attributes including: • test & inspections • quality inspections / surveillances / audits Latent Management Practice Weaknesses (Latent Errors / Flawed Decisions) Manager’s primary duty is to make decisions. This element includes: • command and control • organizational design / responsibility assignment • •

communications guidance

• •

planning resource allocation/availability

Machine / Material / Hardware Failure Related (Hardware Failures) This element looks for equipment failures in the following areas: • Expected, external or other failures • Inappropriate service conditions • • •

Inadequate maintenance Improper assembly or installation Improper operations

• • •

Inadequate design Inadequate manufacture Inadequate materials

2 .2 I nt roduc ing t he Ca use Roa d M a p By itself, the Find the Cause Model does not prove to be a sufficient investigation tool for cause investigations. However, the Find the Cause Model does form the basis for a new, multi-function tool, the Cause Road Map, which provides you a means to: • Identify the categories of questions to ask when planning interviews and gathering data. • Assess and validate causes during an investigation. • Select cause codes for use in trending. • Check an analysis for quality, rigor, or missing elements/information. Building on the Find the Cause Model concepts, the Cause Road Map is a series of hierarchical logic steps (displayed in six different “Maps”) created to aid you in identifying the underlying causes for both equipment failures and human performance errors. Each of the Cause Road Map figures also provides a tie to associated cause codes. These codes, in turn, provide you with an intuitive framework (each code is a mnemonic for the causal factor it represents) for discussion, trend analysis, and information retrieval. 2 .2 .1 M a jor Ca usa l Fa c t or Groups (M a p 1 ) of Ca use Roa d M a p At this point, it is important to emphasize that no single model or tool can cover all possibilities. The Cause Road Map (Figure 2-3) was built to be as intuitive and easy to use and remember as possible, while still covering as many human performance and hardware failure modes as possible. By necessity, the primary balancing points in The Cause Road Map represent a compromise between ease of use and depth of detail. Other balancing points include: • The Likelihood of Each Failure Mode For example, willful criminal acts are not covered in depth because they are unlikely within the types of events that would typically be investigated using the Cause Road Map. Likewise, hardware failure modes are only those for which we already have a significant amount of data.

•

The Lucidity and Feasibility of Corrective Actions

For example, the Cause Road Map does not suggest causal factors driven by an individual’s psychosis because corrective actions for psychiatric disorders typically fall outside the capabilities of management control systems. Likewise, corrective actions for hardware failures must be within the capabilities of existing technologies (e.g., requesting a grant for basic scientific research is not an option).

The Cause Road Map guides you, as an investigator, to look at the event in a systematic manner. For example, if the event involves failed equipment, you will first examine the hardware failure modes from the most likely failure modes to the least likely failure mode, and then, if applicable, examine the human performance failure modes. For human performance failures, the Cause Road Map guides you to look first at the individual, next at error drivers, then at flawed defenses,

then at assessment and inspections, and finally at management practices.

Figure 2-3. Cause Road Map – Major Causal Factor Groups – Map 1 The Cause Road Map in Figure 2-3 may be compared to a road map of an entire country. Such a map does not provide sufficient detail to find a specific city street; thus, a detailed city street map is needed. Similarly, the initial look at the Cause Road Map does not provide the needed level of detail. To proceed with your investigation, you need the “street maps,” and so the full Cause Road Map contains eight maps: • Map 1 – The six major causal factor groups. • Map 2 – Human Error details. • Map 2 – Error Driver details.

• • • •

Map 4 – Flawed Defense details. Map 5 – Flawed Assessment Capability details. Map 6 – Latent Management Practice Weakness details. Map A – Machinery/Material/Hardware Related details.

• Map B – Machinery/Material/Hardware Related continued details.

2 .3 Ca use Roa d M a p De fine d a s a T a x onom y The Cause Road Map is also constructed to be a taxonomy as well as a cause investigation tool. Toward this end, the map’s Cause Codes and graphical structure separate the various causal factors into groups with common characteristics and then sub-divides these groups in to a hierarchical order based on their relative significance or importance. To be considered a true taxonomy, the Cause Road Map needs to “include all possibilities.” That requirement is satisfied by the fact that: • Nearly every section contains an “other” group or subgroup. • Its structure can and should be augmented with subgroups specific to the user’s needs. With this taxonomy concept in mind, your maxim as an investigator should be: Caution: Performing a cause investigation/evaluation outside of a valid taxonomy is unlikely to achieve a valid conclusion.

2 .4 Ca use Roa d M a p a s a Da t a T re nding Sourc e The Cause Road Map is also a trending code source. Many organizations have failure reporting or corrective action programs, which include a coding method to collect demographic data about events for future reporting and trending. Users often retrieve demographic codes, such as equipment type, responsible organization, batch numbers, and problem cause, which are then sorted to look for emerging performance trends. The Cause Road Map provides you with an easy method to select a problem cause code for trending purposes. In addition to being a taxonomy, the Road Map contains a unique code for each stop on the map. Each code is labeled as a cause type and is a mnemonic: • All the equipment Cause Codes start with the letter M, for Machinery. • Individual or team human performance causes start with the letter H for Human. • Equipment causes related to manufacturing problems are label as MU for Machine ManUfacture. • Individual human performance causes related to document use are labeled HID for Human Individual Document. In support of your organization’s more specific needs, a finer level of detail can be obtained by simply appending another character to the end of the existing code sequence. For example:

If an organization needed to be more specific regarding the type of document that was not used properly, adding a P and an A to HID to make HIDPA would translate to Human Individual Document Procedure Administrative.

Because the Cause Road Map is a taxonomy and uses mnemonics for cause codes, it is an investigation tool that will help you to ensure consistency with cause code selection, which, in turn, will provide better trending results. The power of this type of taxonomy for insightful trending has been tested and proven more effective that the alphabetically or numerically ordered listings of codes that is typical for most trending programs.

Sum m a ry

 The Find the Cause Model was built on the Institute of Nuclear Power Operations’ (INPO) Anatomy of an Event Model; James Reason’s Swiss Cheese Model, the Department of Energy’s (DOE) Barrier Analysis Model; and Ishikawa’s Fishbone (Cause & Effect) Diagram.  The Cause Road Map turned the Find the Cause Model into an effective investigation tool for cause investigations.  The Cause Road Map was built to be as intuitive and easy to use and remember as possible, yet still cover as many human performance and hardware failure modes as possible.  The Cause Road Map’s cause-coding taxonomy has been tested and proven more effective that the alphabetically or numerically ordered listings of codes that are typically used.

Look ing Forw a rd

 Chapter 3 will provide details for the following individual maps: • Map 2 – Human Error details. • Map 2 – Error Driver details. • Map 4 – Flawed Defense details. • Map 5 – Flawed Assessment Capability details. • Map 6 – Latent Management Practice Weakness details.

Re fe re nc e s Ishikawa. K. (1985) What is total quality control. (Lu, D. J., trans.). Englewood Cliffs, NJ: Prentice-Hall Inc. Reason, J. (2000, March 18). Human error: models and management. BMJ 2000;320:768-770. US Department of Energy. (2000). Accident investigation workbook. Retrieved from http://158.132.155.107/posh97/private/AccidentPhenonmenon/investigationworkbook.pdf

Cha pt e r 3 H um a n Pe rform a nc e Re la t e d M a ps In Chapter 2, we introduced the Cause Road Map, a summary of how it was developed and its trending capabilities. This chapter introduces five human performance-related detailed maps and some examples for each. 1. Map 2 − Common Individual or Team Related (Symptoms / Active) Errors. 2. Map 3 − Common Error Drivers (Pre-existing / Setup Conditions). 3. Map 4 − Common Flawed Organizational or Programmatic Defenses. 4. Map 5 − Common Assessment Capability Team Related Errors. 5. Map 6 − Common Latent Organizational Weaknesses. 3 .1 H um a n Pe rform a nc e Re la t e d M a ps The Major Causal Factor Groups (Map 1) of the Cause Road Map were introduced in Chapter 2. This chapter introduces Maps 2-6, which display five human performance-related casual factor groups. The maps on the following pages display each of these human performance-related causal factor groups in detail, giving examples for each major group.

3 .1 .1 Cause Road Map − Map 2 (Human Errors)

Figure 3-1. Map 2 Icons (LTA=Less Than Adequate) The following Human Errors – Symptoms/Active Errors map (Figure 3-2) recognizes that direct human performance errors occur in three basic modes: 1. Direct active Individual errors. 2. Direct active Team errors (i.e., errors made collectively by multiple individuals). 3. First order indirect errors driven by the way the Work or task was planned. •

These work task planning errors are included in this map because they are the next logical causal factors for many hardware or material failures.

Figure 3-2. Human Errors – Symptoms/Active Errors (Map 2) with Examples

Table 3-1. Common Individual or Team Related (Symptoms/Active) Errors. Common Individual or Team Related (Symptoms/Active) Errors − Page 1 Individual Work Description Practices or Habits (HI) An individual failed to use or did not properly use an appropriate Self-checking/Error detection or verification technique. This includes the improper use of Stop-Think-ActError Detection HIE Review (STAR), “X” the step procedure place keeping or any other error detection (Individual) technique that the worker has been trained to use. An up-to-date and accurate Document was not referred to or used as intended. This includes failures to review “reference use only” procedures for infrequently performed tasks as well as missing a step in a system assembly procedure. An “accurate HID Document Use document” can take many forms including procedures, work orders, technical manuals or any documented technically accurate set of instructions for performing a task. Equipment / Material HIU

Use Worker

Preparation

Other Individual

HIP

HIO

Errors

Team Coordination or Work Performance (HT)

HTVV

Verbal

HTVU

Communication HTVT HTVC HTVO

An individual was not properly eqUipped for a task. This includes using the wrong tools or parts as well as not using the proper personnel protection equipment. An individual was physically or mentally unPrepared for the task. This includes tasks that are beyond the physical or mental abilities of the worker as well as when the worker did not study lesson materials needed to perform a task. An individual’s work practices, habits or Other individual work practice errors not covered by another of the previous criteria were inappropriate for the task. This includes an individual’s active error or failure to act because of bad habits or practices. Description V Verbal communications to or between the task performers were inaccurate or incomplete. This includes errors like failing to use: - clarifying questions - read back and verify - phonetic alphabet and numeric clarifications U Task performers failed to fully or accurately Understand a verbal communication. T Verbal communications with or between task performers were unTimely. C Verbal communications were conducted in a place or under Conditions that did not allow the task performers to fully understand them. O Some Other verbal communication error was made that resulted in a task performance error.

Common Individual or Team Related (Symptoms/Active) Errors − Page 2 Team Coordination or Work Performance (HT)

Error Detection

HTE

(Team)

Coordination or

HTC

Coaching

Other Team Errors

HTO

Work Organization or Planning Related (HW)

Work Planning

Insufficient Time was given to do the task properly

HWP

P

Insufficient time was given for the performers to adequately Prepare for the task.

HWI

Work Organization Work

Planning Errors

Description T

HWA

Other

The task team failed to use peer-checking, independent verifications or another verification technique properly. This includes the improper use of Peer Checking, Independent Checking or any other peer or oversight related Error detection technique that workers have been trained to use. Coordination or interactions with or between task performers or with supervision was inappropriate or inadequate. This includes missed peer-to-peer or Supervisory coaching opportunities as well as task delays or task performance mistakes due to miss-sequenced. Other team related task performance errors not covered by another of the previous criteria.

HWT

HWS

or Oversight

Description

HWO

HWW

I At the point of implementation, a work document or task Instruction did not adequately address the task to be performed or the conditions under which it was to be done. S Work planning, Scoping or preparation was deficient or otherwise inappropriate for the task performance Conditions or the task performer’s capabilities. A Task Assignments were inappropriate for the number of task performers or for their skills or capabilities. O Task Oversight was inadequate for the number of task performers or their skills and capabilities. This includes inexperienced on untrained supervisors as well as active oversight errors by experienced supervisors. Other Work planning or organization errors not covered by another of the previous criteria.

3 .1 .2 Ca use Roa d M a p – M a p 3 (Error Drive rs)

Figure 3-3. Map 3 Icons In the Error Drivers – Error Precursors/Setup Factors (Figure 3-4) map and Table 3-2, I have expanded, re-ordered, and refined the TWIN behavioral model published by the Institute for Nuclear Power Operations (INPO) (2006). I based this refinement on my own research, and it was included in a previous version of the Cause Road Map (copyright 2005). TWIN is shorthand for: 1. Task Demands – specific mental, physical, and team requirements to perform an activity that may either exceed the capabilities or challenge the limitations of human nature of the individual assigned to the task; for example, excessive workload, hurrying, concurrent actions, unclear roles and responsibilities, or vague standards. 2. Work Environment – general influences of the workplace, organizational, and cultural conditions that affect individual behavior; for example, distractions, awkward equipment layout, complex procedures, at-risk norms and values, or cavalier workgroup attitudes toward various hazards. 3. Individual Capabilities – unique mental, physical, and emotional characteristics of a particular person that fail to match the demands of the specific task; for example, unfamiliarity with the task, unsafe attitudes, lack of education, lack of knowledge, unpracticed skills, unsociability, inexperience, health and fitness problems, poor communication practices, or low self-esteem. 4. Natural Tendencies – generic traits, dispositions, and limitations common to all human beings that may incline individuals to err under unfavorable conditions; for example, habit, short-term memory, fatigue, stress, complacency, or mental shortcuts.

Table 3-2. TWIN Model Institute for Nuclear Power Operations, 2006 • • • • • • • • • • • • • • • •

Task Demands Tie pressure (in a hurry) High workload (large memory) Simultaneous multiple actions Repetitive actions/monotony Irreversible actions Interpretation requirements Unclear goals, roles, or responsibilities Lack or unclear standards Work Environment Distractions/interruptions Changes/departure from routine Confusing displays or controls Work-arounds/Out of Service instrumentation Hidden system/equipment response Unexpected equipment conditions Lack of alternative indication Personality conflict

• • • • • • • • • • • • • • • •

Individual Capabilities Unfamiliarity with task/first time Lack of knowledge (faulty mental model) New techniques not used before Imprecise communication habits Lack of proficiency/inexperience Indistinct problem-solving skills Unsafe attitudes Illness or fatigue; general poor health Human Nature Stress Habit patterns Assumptions Complacency/overconfidence Mind-set (intentions) Inaccurate risk perception Mental shortcuts of biases Limited short-term memory

The TWIN model in Table 3-2 has been presented as a standalone tool in many human performance or cause investigation related documents or manuals since it was first introduced by INPO. In fact, many Nuclear Power Plants use the INPO version of the TWIN model as a form or checklist within their cause investigation procedures and manuals. In my experience, when TWIN model-based forms and checklists are completed, most investigators select several of the error precursors or setup factors listed. However, when the checklist is applied to an actual event, investigators seldom fully investigate more than one of the precursors or setup factors. They do not use more because typically, the results are not required to be fully integrated into the cause-and-effect logic chain for the event. Therefore, I integrated a refined version of the TWIN model directly into the Cause Road Map.

Figure 3-4. Error Drivers – Error Precursors/Setup Factors (Map 3) with Examples

Table 3-3. Common Error Drivers (Pre-existing/Setup Conditions) Common Error Drivers (Pre-existing / Setup Conditions) − Page 1 Individual Capabilities (EI)

Description HABIT OR ATTITUDES EIHH

H

Actions driven by poor work abits or Attitudes including: • Imprecise communication • “Can Do” attitude for crucial tasks

EIHA

• Inappropriate values / motivation

H The task performer’s communication Habits, values or A Attitude were inappropriate for the task. Examples include: • Verbal communication habits or means that do not enhance accurate understanding by all members involved in an exchange of information. • Personal belief in prevailing importance of accomplishing the task (production) without consciously considering associated hazards. • Perception of invulnerability while performing a particular task. • Personal belief that accomplishing a task to the best of your abilities is not important.

TRAINING OR QUALIFICATIONS

Actions driven by a lack of

EITT

Training or Qualifications for the task including: • Inadequate training • Lack of proficiency • Inexperience • Unfamiliarity with task • First time performance • Indistinct problem-solving skills

T The task performer’s knowledge, Training or experience level was inappropriate for the task. Examples include: • Did not receive sufficient instruction / training to adequately understand or to establish an accurate mental model of the task to be performed. • Unawareness of factual information necessary for successful completion of task. Lack of practical knowledge about the performance of a task. • Degradation of Knowledge or skill with a task due to infrequent performance of the activity.

Q The task performer’s Qualifications, skills, proficiency or problem solving abilities were inadequate for the task. Examples include:

EITQ

• Unawareness of task expectations or performance standards. First time to perform a task (never; not performed in given time; serious procedure change). • Unsystematic response to unfamiliar situations; inability to develop strategies to resolve problem scenarios without excessive use of trial-and-error or reliance on previously successful solutions. • Unable to cope with changing plant conditions.

INDIVIDUAL CAPABILITIES Actions driven by inadequate

Individual Capabilities including:

• Unawareness of critical parameters • Less than adequate attention to details • Inadequate mental tracking (forgot to do) • Illness / Fatigue • Other individual capabilities drivers

EIO

Other individual capabilities drivers not covered by another of the previous criteria. Examples include: • An overload of the mind’s “storage buffer” for Recalling short-term information. • Degradation of a person’s physical or mental abilities due to a Sickness, disease, or debilitating injury. • Lack of adequate physical rest (Fatigue) to support acceptable mental alertness and function. • Individual capabilities that adversely impact the ability to use the human senses or to concentrate during performance of a task.

Common Error Drivers (Pre-existing/Setup Conditions) − Page 2 Natural Tendencies / Human Description Nature (EN**) ASSUMPTIONS, MIND SETS OR EMOTIONAL STATE The task performer's flawed decisions were motivated by Assumptions, mindsets, tunnel vision, complacency or similar natural tendencies. Examples include: • Human tendency to look for or see patterns in unfamiliar situations; application of thumb-rules or “habits of mind” (Heuristics) to explain unfamiliar situations: Including: Actions driven by

Assumptions, Mind Sets or Emotions including: • Mental biases • Mind set (intentions) • Tunnel vision (lack of big picture)

ENA

• Complacency / Overconfidence • Inaccurate risk perception

 confirmation bias

 similarity bias

 overload bias

 order bias

 oversimplification bias

 close in time

 frequency bias

• Suppositions made without verification of facts, usually based upon perception of recent experience; Believed to be fact; Stimulated by inability of human mind to perceive all facts pertinent to a decision; similar to attempting to see all the objects in a locked room through a door’s keyhole. • Tendency to “see” only what the mind is tuned in to see, preconceived idea. Information that does not fit a Mindset may not be noticed and vice versa; may miss information that is not expected or may see something that is not there; contributes to difficulty in detecting one’s own error(s). • Ingrained or automated pattern of actions attributable to Repetitive nature of a well-practiced task; Inclination formed for particular train/unit due to similarity to past situations or recent work experience. • Human tendency to ignore anything not directly related to the immediate task at hand. • A “Pollyanna” effect leading to a presumption that all is well in the world and that everything is ordered as expected. Self-satisfaction or Overconfidence, with a situation unaware of actual hazards or dangers; particularly evident after 7-9 years on the job. Underestimating the difficulty or complexity of a task based upon past experiences with task. • Personal appraisal of hazards and uncertainty based on either incomplete information or assumptions. Unrecognized or inaccurate understanding of a potential consequence or danger; degree of risk-taking behavior based upon individual’s Perception of possibility of error and understanding of consequences.

ATTENTIVENESS OR COGNITIVE CAPABILITIES Actions driven

ENC

The task performer's ability to make decisions impaired by a limited attention span, memory lapse, confusion, or other diminished Cognitive capabilities. Examples include: • The mind’s “workbench” for problem-solving and decision-making; the temporary, attention-demanding storeroom we use to remember new information involved during learning, storing, and recalling information; Forgetful; unable to accurately attend to more than 2 or 3 channels of information (or 5 to 9 bits of data) simultaneously. • Human tendency to limit the time spent in deep Concentration.

ENO

Emotional or mental decision-making impairments not covered by anOther of the previous criteria.

by inadequate Cogitative capabilities or Attentiveness including: • Limited short-term memory • Limited attention span (inattention to detail)

Other / Impaired decision making

Common Error Drivers (Pre-existing/Setup Conditions) − Page 3 Work Environment (EW**) Description EQUIPMENT DISPLAYS, LAYOUT OR OTHER HUMAN FACTORS DESIGN D The task performer(s) was/were confused by a Display or indication or the way the information was provided. Examples include: • The characteristics of installed Displays and controls confused or exceed the working memory capability of the individual. Some possible examples include:

Work environment driven errors including: • Confusing displays / Controls • Lack of alternative indication or insufficient information presented • Hidden system response • Poor equipment layout; poor access; poor placement; poor location • Unexpected equipment condition • Work-arounds / Out of Service (OOS) instrumentation • Inadequate design or omission of required control / display labels

EWED

 missing or vague content (insufficient or irrelevant)

 insufficient identification of displayed process information

 lack of indication of specific process parameter

 controls placed close together without obvious ways to discriminate conflicts between indications

 illogical organization and/or layout

EWEL

• Displays or Labels are either missing or otherwise do not provide the user with the needed information where/when it is needed. • System Response invisible to individual after manipulation; lack of information conveyed to individual that previous action had any influence on the equipment or system. L The task performer(s) was/were hindered by poor equipment Layout, access, placement, location or by a system response. Examples not limited to the following: • The Location or placement of equipment inhibits its operation or monitoring its function. • System or equipment status not normally encountered creating an unfamiliar situation for the individual. • Uncorrected equipment deficiency or programmatic defect requiring compensatory or non-standard action by a worker to comply with a requirement; long-term material condition problems. • Inability to compare or confirm information about system or equipment state due to Absence of instrumentation.

PHYSICAL ENVIRONMENT V EnVironmental conditions that adversely impact the ability to use the human senses or to concentrate during performance of a task. Examples include: Errors driven by the physical enVironment including: • Lighting, temperature, humidity, human factors (noise, space, etc.) • Distractions / Interruptions

EWVV

 Inadequate lighting

 Cramped conditions

 Uncomfortable temperature

 Uncomfortable humidity  Poor workplace layout

 Adverse weather

 Excessive noise

 Uncomfortable protective clothing

 Too many people in the area

 High radiation associated with task

 Exposed industrial hazards

 Respiratory protection equipment required  Special industrial safety equipment required

EWVD

D Conditions of either task or work environment requiring the individual to stop and restart a task sequence Diverting one’s attention to and from the task at hand.

Common Error Drivers (Pre-existing/Setup Conditions) − Page 4 Work Environment (EW**)

Errors driven by Other environmental conditions including: • Changes / Departure from routine • Personality conflict

Description OTHER WORK ENVIRONMENT CONDITIONS

EWOF

EWOC

F The task performer(s) was/were hindered by work schedule changes, Fatigue or similar working conditions. Examples include: • Departure from a well-established routine. • Unfamiliar or unforeseen task or jobsite conditions that potentially disturb individual’s understanding of task or equipment status. C The task performer(s) was/were affected by personality Conflicts. Examples include: • Incompatibility between two or more individuals working together on a task causing a distraction from task due to preoccupation with personal difference with another individual.

Task Demands (ET**) Description TASK INSTRUCTIONS / GUIDANCE OR COMMUNICATIONS REQUIREMENTS Errors driven by the

Task instructions including: • Inappropriate order • Interpretation requirements • Unclear goals, roles, or responsibilities • Lack of or unclear standards

ETTG

G The task instructions/Guidance were miss-ordered or described with insufficient clarity for easy use by the actual task performer(s). Examples include: • Guidance, primarily verbal, provided out of its appropriate order. • Situations requiring “in-field” diagnosis leading to misunderstanding or application of wrong rule or procedure. • Unclear work objectives or expectations. • Uncertainty about the duties an individual is responsible for in a task which involve other individuals. • Duties which are incompatible with other individuals. • Ambiguity or misunderstanding about acceptable behaviors or results; If unspecified, standards default to those of the front-line worker (good or bad).

WORKLOAD, SCHEDULE, TASK DESIGN (STRESS RELATED) Errors driven by a work

Schedule including:

• Perceived time pressure (in a hurry) • High workload (memory requirements) • Complexity / High information flow • Simultaneous, multiple tasks • Irrecoverable / Irreversible actions

ETSW

ETSS

W The number and type of tasks (Workload) assigned were inappropriate for the individual(s) assigned to do them. Examples include: • The amount or complexity of information presented exceeds the individual’s ability to use resulting in mental overload. • Performance of two or more activities, either mentally or physically possibly resulting in: divided attention, mental Overload, or reduced vigilance on one or the other. • Mental demands on individual to maintain high levels of concentration; e.g., scanning, interpreting, deciding, while requiring recall of excessive amounts of information (either from training or earlier in the task). S Insufficient time was Scheduled / allotted for the number and type of tasks and the individual(s) assigned to do them. Examples include: • Urgency or excessive pace required to perform action or task, usually in less time than possible. No spare time • Insufficient time allotted for information exchange at the job-site to help individual reach and maintain an acceptable level of alertness.

Common Error Drivers (Pre-existing/Setup Conditions) − Page 5 Task Demands (ET**)

Other task demands or stresses

Description OTHER TASK DEMANDS

ETO

Excessive or otherwise inappropriate (for the actual task performers) task demands or stresses not covered by anOther of the previous criteria. Examples include: • Inadequate level of mental activity due to performance of repeated actions; boring. • Action that, once taken, cannot be recovered without some significant delay despite best efforts. No obvious means of reversing an action.

3 .1 .3 Cause Road Map − Map 4 (Flawed Defenses)

Figure 3-5. Map 4 Icons The following Flawed Defenses (Figure 3-6) map classifies the various programs and process used to manage human activities including: 1. Various Programs and program to program interfaces. 2. Organizational structures, supervision and organizational interfaces. 3. Human to Machine interfaces (including data/information provided). 4. Use of Corrective Action programs.

Use of a Corrective Action program is separate because the corrective actions developed from the program frequently require changes to be made to other programs.

Figure 3-6. Flawed Defenses (Map 4) with Examples

Table 3-4. Common Flawed Organizational or Programmatic Defenses Common Flawed Organizational or Programmatic Defenses – Page 1 Procedures, Training Program or Program to Program Interfaces (DP**)

Description I

DPPI

Procedures (or any written task completion guidance)

W A procedure / instruction / work document was difficult to read, understand or follow because of the Wording, grammar or structure used. Examples include: DPPW • Procedure contains human factors (Presentations) deficiencies. • Unclear / complex Wording or grammar. • Inadequate Sequence, format, branching or user aids. DPID

Program-to-Program

Interface Breakdowns

A procedure/Instruction / work document was Inaccurate, unclear, too generic or otherwise did not provide the needed information. Examples include: • Technical inaccuracies or omission of relevant information. • Not designed for less practiced users or content inadequate for intended users.

DPIC DPIM

DPTF

DPTC

Training Breakdowns

DPTM

DPTS DPTN

D The interface between programs or procedures was inadequately described or Defined. C Changes in one program or procedure were inadequately Coordinated with interfacing programs or procedures. M A procedure, document or reference was not developed or not properly Maintained. F A training program was Flawed by inadequate course material, inappropriate presentation methods or poor coordination. Examples include: • Inadequate relation of training task to overall Plant operations. • Improperly coordinated with change Implementation. • Presentation of course materials inadequate. C The Content of training was inappropriate for its intended purpose or for the students. . Examples include: • Training Content did not adequately address procedures / references / expectations, etc. • Absence of training objectives. M The Training Method used was inappropriate for its intended purpose or for the students. Examples include: • Training method did not adequately address Refresher training, hands-on experience, etc. • Inadequate assessment of task proficiency. S The fidelity of a Simulator or simulation was inaccurate or flawed. N Needed training was not given.

Common Flawed Organizational or Programmatic Defenses – Page 2 Procedures, Training Program or Program to Program Interfaces (DP**)

Description

Other Program or

O There was anOther procedure, program, process interface or training course failure or breakdown. Examples include: • Programmatic or procedural deficiencies not addressed by another of the previous criteria. • Poorly written or non-existent documented guidance not addressed by another of the previous criteria. • Unclear or non-existent interfaces between guidance documents not addressed by another of the previous criteria.

Procedure Breakdowns

DPO

Organizational Supervision, Structure or Interfaces (DO***)

Command and Control

Other Organizational Breakdowns

Description

O Supervisory Oversight was missing or failed to provide the necessary command and control. Examples include: • No or insufficient Supervision. DOCO • Assignment of too many Administrative duties to immediate supervisor. • Inadequate supervision or command and Control. Active supervisory errors by a supervisor are properly classified as “Work Organization or Oversight” errors. S Worker Selection or task assignment was inappropriate. Examples include: • Unqualified worker selected. DOCS • Improper / inadequate task assignment, task schedule emphasis or poor task allocation. P The work status or Progress was inadequately monitored or controlled. Examples include: • Scheduling and Planning (including inputs to the schedule) less than adequate. DOCP • Progress / status (of work) not adequately tracked. • Inadequate control or specification of Maintenance activities. • Inadequate control or specification of NDE / Quality Inspection activities. Supervisory mistakes, oversight errors, organizational structure weakness or Other DOO organization-to-organization interface breakdowns not covered by another of the previous criteria.

Common Flawed Organizational or Programmatic Defenses – Page 3 Organizational Supervision, Structure or Interfaces (DO***)

Description

D The interfaces between organizations and programs were inadequately Described or defined. Examples include: • Interface procedure was needed but has not been developed. C Organizational Changes were inadequately coordinated with interfacing organizations or programs. • Commitment to program implementation less than adequate. DOPC • Procedure changes not made apparent to user. • Approved proposal / document with inadequate review. DOPD

Organization to Program Interface Breakdowns

Organization to Organization

Interface Breakdowns Use of OverTime (Fatigue Rule)

DOIC

C The coordination or interface between organizations was hampered by Conflicts or organizational structures.

DOTO O There was frequent use of Overtime (cumulative fatigue).

Human to Machine or Task or Information Interfaces (Human Factors) (DH***)

Description

Human to Machine or to Task Interface Breakdowns

C A display, label or indication provided Confusing, inaccurate or insufficient information to the operator. Examples not limited to the following: • Non-task information distracted from use of task information. DHMC • Controls/ displays/ labels not maintained/ functional or accurate. • “Scene Clutter” or other distracting information displays. H A control provided an unexpected or Hidden response when operated. Examples not limited to the following: DHMH • The specified tool or control sequence of use was inappropriate for the task. • Unanticipated interaction of systems or component. D A tool, control, display, label, indication or system component was designed or arranged so that it was counterintuitive or needlessly Difficult to use or operate. Examples not limited to the following: • The placement of tools or controls for the task was beyond ergonomic specifications or standards. • The placement of tools or controls was inappropriate for the frequency they are DHMD needed to perform the task. • Inadequate design or omission of required controls/displays/labels or Alarms/Annunciators less than adequate • System/Component configuration problem (as built/ documentation), walk down, functional design deficiency.

Information / Data / Software Problems

DHIF DHIC

Other Human Factors

DHO

Related Errors

F A computer or software application Failure led to a human performance error. C Computerized data Corruption or inaccuracies led to a human performance error. Any man-machine interface, man-task interface or information transmitting equipment deficiency problem that results in a human performance error that is not covered by anOther of the previous criteria.

Common Flawed Organizational or Programmatic Defenses – Page 4 Corrective Action Program Use (DC***) DCAC

DCAS Breakdowns in the use of a Corrective A ction Program DCAA

DCAU

Other Learning Culture

DCOU

Failures DCOO

Description C Problem Causes were not determined. This includes causes of previous industry or in-house problems that were not determined or used. S Cause investigations were too Shallow or too narrowly focused to identify more fundamental causal factors. Examples include: • Repetitive program failures without correction. • Cause analysis that was inconclusive or not rigorous enough. A Implemented corrective Actions failed to correct the problem or prevent recurrences. Examples include: • This includes actions that were not SMART. (See Chapter 9) • Corrective actions for previously identified problem or the identified causes of a previous event were not adequate to prevent Recurrence. • This includes when NO action was previously planned. U Implemented corrective actions were Untimely or not implemented as planned. There was another corrective action program or learning culture failure. Examples include: U Failures to Use a corrective action program or action tracking system to implement corrective actions from a self-assessment finding / observation belong in this category. Note that failures to use a Self-Assessment program to identify problems and improvement opportunities are properly classified under the “Self-Assessment Program use” (ASA) category. O Any Other Corrective Action Program, procedure or software or action tracking system use failures not covered by another of the previous criteria.

3 .1 .4 Ca use Roa d M a p – M a p 5 (Fa ile d Ove rsight /Fla w e d Asse ssm e nt s)

Figure 3-7. Map 5 Icons The following Failed Oversight/Flawed Assessments (Figure 3.8) map classifies the various quality inspection, verification, assessment and learning programs including: 1. Inspection and testing activities. 2. Quality verifications or operational experience reviews. 3. Self-assessment activities.

Figure 3-8. Failed Oversight/Flawed Assessment Capabilities (Map 5) with Examples

Table 3-5. Common Assessment Capability Team Related Errors Common Assessment Capability Team Related Errors Test Performance/ Quality Inspection (AIT)

Description AITT

Test / Inspection planning and performance

AITI AITO

Quality Verification, Inspection or OE Use (AQV) AQVV Quality Verification Methods and learn from experience

AWVI AQVO AQVU

Self-Assessment, Cause Analysis, Observation Use (ASA)

Self-A ssessment planning, use and rigor

T A post-maintenance or modification Test was inadequately planned or improperly run. I A Quality Inspection, NDE Exam, or other quality verification test was inadequately planned of improperly done. O Other Inspection or Testing related errors.

Description V A Quality Audit, Surveillance or Verification activity was inadequately planned, ineffectively performed or insufficiently critical. I A Quality Control Inspection was inadequately planned, ineffectively performed or insufficiently critical. O Operating Experience information was either not reviewed or not effectively used. U Other quality verification, audit, inspection or operating experience program Use errors.

Description ASAS

S A Cause Analysis or Self-Assessment effort was inadequately planned, ineffectively performed or was insufficiently critical.

ASAC

C A Cause Analysis or Self-Assessment Program inadequately implemented.

ASAO

O Job Observations, Self-Assessments, Lessons Learned or other effectiveness assessment efforts were ineffectively performed, not performed or insufficiently critical.

ADAL

L Other cause analysis, self-assessment or Learning from the past errors.

3 .1 .5 Ca use Roa d M a p – M a p 6 (La t e nt Errors/Fla w e d De c isions)

Figure 3-9. Map 6 Icons

The Cause Road Map 6 was derived from the following generally accepted principal business management functions: • Directing. • Controlling. • Planning. • Organizing. • Staffing. • Leading.

The first five of these management functions were first identified by Henri Fayol – often described as the “father” of modern management – in several of his books and papers. The “leading” function was added by several newer theorists. Many current business management books explain these functions in detail. This Latent Management Practice Weaknesses – Latent Errors/Flawed Decisions (Figure 3-10) map is a distillation of the above general business management functions and a refinement of a tool I developed to create checklists for management effectiveness audits. This map categorizes the above basic management functions as follows: 1. Guidance provided by upper management / company officers • Equates to the classic “directing” function. 2. Upper management/company officers Command and control activities • Equates to the classic “controlling” function. 3. Managements Planning activities • Equates to the classic “planning” function. 4. Assignment of Organizational responsibilities and structures • Equates to the classic “organizing” and “staffing” functions. 5. Management’s communications. • Equates to the “leading” functions.

Figure 3-10. Latent Management Practice Weaknesses – Latent Errors/Flawed Decisions (Map 6) with Examples

Table 3-6. Common Latent Organizational Weaknesses Common Latent Organizational Weaknesses – Page 1 Management’s Guidance (LG**) LGS

LGM

Management Guidance/Directing Decisions

LGP LGO LGC

LGD LGT LGF

Description Management failed to clearly define or fully enforce performance Standards, policies or other expectations. Examples include: • Policy guidance, management expectations, job performance standards were not well defined or enforced. Management failed to timely implement needed changes or failed to recognize the need for change (Change Management). Examples include: • Accuracy / effectiveness of Program or Design change not verified or not validated. • Risks, consequences or interactions associated with Program or Design change not adequately reviewed or assessed. • Untimely implementation • Program or Design change-related documents, training equipment not developed or not adequate. Management failed to establish meaningful Priorities for needed program or process changes. Other management guidance or change management efforts were ineffective. Management’s actions did inhibit Coordination between organizations (i.e. Silo effect). Management did Delegate duties or responsibilities inappropriately. Examples include: • Talents / innovations of subordinates not used. • Inappropriate delegation of authority. Management’s Team building efforts were ineffective or counterproductive. Management’s morale / conFidence building efforts were ineffective or counterproductive. Examples include: • Punitive response to unintentional actions or other actions with significant adverse effects on staff morale.

Common Latent Organizational Weaknesses – Page 2 Management’s Command and Control (LC**)

LCT

Description Management failed to Take corrective action or to enforce a program. Examples include: • Failure of Management to take action when clearly warranted. •

Policy not adequately enforced.

Other management control or review errors were made. Examples include: • Delayed implementation of corrective actions. LCO

•

A poorly designed or inconsistently enforced Corrective Action Program.

Management failed to Monitor the progress, status or effectiveness of a program or process. Examples include: • Progress / status / effectiveness (of programs or processes) not adequately tracked.

Management Command and Control Decisions LCM

LCS

LCE

Management’s Planning (LP**)

•

Management follow-up or monitoring of activities did not identify problems.

•

Program improvement management / monitoring less than adequate.

Management failed to Solve a problem. Examples not limited to the following: • Any significant impairment of Management's problem solving ability. Management failed to acknowledge identified performance Errors. Examples include: • Any Management failure to acknowledge identified performance errors.

Description Other management planning failures were made. Examples include: • Needed changes are not approved or not funded.

LPO

•

Management complacency.

Budgets, policies or goals were established without adequate Analysis. Examples include: • Did not have sufficient information to support policy decision.

Management Planning Decisions LPA

• •

LPPF

Budgets, policies or goals established without adequate analysis (e.g. cost benefit). Cost/benefit justified investigation inconclusive.

F Established policies were inadequate or inappropriately Focused. Examples include:

•

LPPD

LPB

LPG

LPI

Established goals are not meaningful or are not focused on performance improvement.

D Established policies, expectations or job performance standards were poorly Defined or misunderstood. Examples include: • Directions or expectations in a policy or standard are counterproductive. No or insufficient resources (Budget, personnel etc.) were actually provided. Examples include: • Insufficient resources applied (personnel, money, etc.) •

Inadequate baseline staffing.

No or insufficient Goals were actually established for an important program, process or attribute. Examples include: • Goals & objectives did not cover all known problems. •

Did not include all inputs in goal/objectives.

Innovative solutions were discouraged or ignored.

Common Latent Organizational Weaknesses – Page 3 Management’s Organizational Structure or Responsibility Assignment (LO**)

LOR

LOD

Management Organizing Decisions

LOQ

LOS

LOO

Management’s Communications (LM**) LMU LMP LMX LMW Management Communications Errors

Description Personnel roles or Responsibilities were poorly defined or poorly enforced. Examples include: • Responsibility of personnel was not well defined / understood or personnel were not held accountable. Position Descriptions were poorly defined or inappropriate to the job functions. Examples include: • Position descriptions do not address required job functions. Position Qualifications were poorly defined or inappropriate to the job functions. Examples include: • Established position qualifications do not address required skills, training or experience. An organization’s Structure was inadequate. Examples include: • Organizational structure inadequate to support required functions. Other organizational structure or responsibility assignment actions were less than effective. Examples include: • Accountability for program implementation lacking.

Description Communication with Upper management was ineffective, inaccurate, incomplete, or did not occur. Communications with management Peers was ineffective, inaccurate, incomplete, or did not occur. Communications with regulators or eXternal groups was ineffective, inaccurate, incomplete, or did not occur. Communications with Workers or subordinates was ineffective, inaccurate, incomplete, or did not occur. Other management communications were less than effective. • Did not ensure sufficient interdepartmental communications. •

LMO

•

Communications that demonstrated an insufficient awareness of the impact of actions/ decisions on nuclear safety or reliability. Any consequentially inadequate or incomplete communications with or between members of Management resulting organizational isolation (i.e. "Silo" effect).

Sum m a ry This chapter provided details for the following individual human performance related maps.  Map 2 Individual or Team Related (Symptoms/Active Errors).  Map 3 Error Drivers (Pre-Existing/Setup Conditions).  Map 4 Flawed Organizational or Programmatic Defenses.  Map 5 Failed Oversight/Flawed Assessment Capabilities.  Map 6 Latent Organizational Weaknesses.

Look ing Forw a rd Chapter 4 will provide details for the following individual maps:  Map A – Machinery/Material/Hardware Related details.  Map B – Machinery/Material/Hardware Related continued details.

Re fe re nc e s Institute for Nuclear Power Operations (INPO). (2006). Human performance reference manual. Atlanta, GA: Author. (Note: The TWIN model was contained in unpublished INPO documents dating from 1997.)

Cha pt e r 4 M a c hine /M a t e ria l/H a rdw a re Fa ilure M a ps

Chapter 3 introduced five human performance-related detailed maps and some examples for each. This chapter introduces two Machine/Material/Hardware Failure related maps, giving some examples for each. 4 .1 M a c hine /M a t e ria l/Ha rdw a re Fa ilure Re la t e d M a ps The Cause Road Map also includes a map to help diagnose equipment problems. The hardware failure classifications displayed on the Machine/Material/Hardware Failure Related Maps A and B are based on Machinery Failure Analysis and Troubleshooting (Bloch & Geitner, 1994) and are arranged from most frequent to least frequent failure modes. The equipment maps in this chapter will be used in the same way as the human performance maps in Chapter 3. Figure 4-1 is a section extracted from Cause Road Map 1 (Figure 2-3 from Chapter 2). The “continuation” arrows that connect to the “human performance related” icon display the fact that most hardware failure modes have human performance-related implications which may need to be investigated

Figure 4-1. Map A and B connections from Figure 2-3 in Chapter 2

4 .1 .1 M a c hine /M a t e ria l/Ha rdw a re Fa ilure Re la t e d M a p A

Figure 4-2. Map A Icons (LTA=Less Than Adequate) Machine/Material/Hardware Failure Related Map A (Figure 4-3) recognizes the first four most frequent and basic failure mode categories: 1. Non-repeatable Expected, External or Other unexpected or unpredictable conditions including weather related, abnormal aging, inappropriate storage. 2. Inappropriate or inadequate Maintenance related failures. 3. Inappropriate or inadequate A ssembly related failures including mis-adjustments, misalignments and rigging related errors. 4. Inappropriate or inadequate Operation or equipment monitoring related failures.

Figure 4-3. Failures Roadmap – Machine/Material/Hardware Failure Related (Map A) with Examples

Table 4-1. Common Expected, External or Other Failures Machine/Material/Hardware Failure Related (Map A) Page 1 Code

MEN MEC

Description Common Expected, External or Other Failures

•

Non-repeatable System/Component Anomaly

•

Degraded sub-component contributed to failure

External or weather conditions caused failure

MEW

MEA MEH MES MEU

•

Weather

•

Hurricane

•

Tornado

•

Earthquake

•

Flooding

•

Severe straight straight-line winds

•

Environmental conditions

•

Animal interference

•

Direct lighting strike

•

Grid power disturbance

•

Grid lightning strike

•

Collision

•

Accelerated or abnormal aging or unexpected embrittlement caused failure

•

Inadequate foreign material exclusion or poor housekeeping caused failure

•

Less than adequate warehouse storage caused failure

•

Unusual plant conditions or configuration caused failure

Other expected or external failures

MEO

•

Planned train outage

•

Leakage

•

Cut/abraded

•

Jetting from tube rupture

•

Cracked

•

Interference

•

Out of adjustment / /alignment

•

Spurious, non-repeatable operation of this device

•

Breaker tripped free unexpectedly

•

Run-to-failure device, cause not investigated

•

Scratched

•

Design prevents identifying initiating device

•

Torn

•

Device condition prevented root cause investigation

•

Loose

•

Spurious operation of device

Machine/Material/Hardware Failure Related (Map A) Page 1 Code

Description •

MMD MMU MMC MMV MMP MMF MMW

Common Expected, External or Other Failures Liquid accumulation

Common Maintenance Related Failures

•

Controls/displays/labels/guards/physical barriers not maintained or non- functional

•

Uncorrected equipment problems (e.g. problems noted during maintenance that were not fixed)

•

Less than adequate or improper corrective maintenance caused failure

•

Less than adequate preventative maintenance caused failure

•

No preventive maintenance performed

•

Maintenance did not fix the problem

•

Work in proximity contributed to or caused failure

Machine/Material/Hardware Failure Related (Map A) Page 2 Common Assembly or Installation Related Failures Code

Description Improper assembly or reassembly of component caused failure

MAA

MAM MAS MAR

MOM

•

Over torqued

•

Excessive force applied

•

Under torqued

•

Cross threading

•

Wrong size

•

Over tightened

•

Thread form, fit, or pitch mismatch

•

Cold/Defective solder joint

•

Improper surface preparation

•

Insufficient, Incorrect or Excessive Sealants Applied

•

Maladjustment/misalignment

•

Wrong sequence of fabrication

•

Inadequate rigging practices

•

Common Equipment Operation Related Failures Operating parameters not properly monitored or evaluated/erratic performance not noted/maintenance not requested

Not operated within design parameters or per procedure. Refer to the following examples:

MOD

•

Over pressurized

•

Over heated

•

Under pressurized

•

Over cooled

•

Over filled

•

Unevenly heated

•

Under filled

•

Unevenly cooled

•

Not turned while out of service

•

Over speed

•

Extended ops in nearly closed position •

•

Loss of control

•

Over extended Externally damaging conditions not corrected

4 .1 .2 M a c hine /M a t e ria l/Ha rdw a re Fa ilure Re la t e d M a p B

Figure 4-4. Map B Icons.

Machine/Material/Hardware Failure Related Map B (Figure 4-5) recognizes the last four basic failure modes: 1. Failures related to the Service conditions under which the equipment was operated such as improper lubrication, inappropriate operating temperatures or pressures or flow or electrical service. 2. Failures related to inadequate equipment, components or system D esign or a flawed design change process. 3. Failures related to inadequate equipment, components or system Manufacturing. This failure mode simply recognizes the fact that failures by a separate manufacturing company typically cannot be investigated further. 4. Failures related to inadequate Material supplied by a separate company. Again, this failure mode simply recognizes the fact that inadequate materials supplied by a separate company typically cannot be investigated further.

Figure 4-5. Failures Roadmap – Machine /Material /Hardware Failure Related (Map B) with Examples

Table 4-2. Machine/Material/Hardware Failure Related (Map B) Failures Machine/Material/Hardware Failure Related (Map B) Page 1 Code

Description Common Service Conditions Related Failures

MSC

MSL

MSP

MST

MSF

MSV

MSS

Corrosive or adverse environment caused failure • Contact corrosion/oxidation • Transgranular corrosion • Intergranular stress corrosion cracking • Galvanic corrosion • Thermal gradient stress corrosion cracking • Erosion • Stress corrosion cracking • Crevice corrosion cracking • Scaling • Corrosion product buildup in gap • Primary water stress corrosion cracking • Chemical attack • Pitting corrosion • General corrosion • Microbiologically induced corrosion • Oxidation Inadequate/loss of lubrication • Sticking/binding • Seizure • Rubbing • Galling • Slipping • Excessive Wear • Friction Excessive/inadequate pressure, differential pressure or pressure transient • Pressure binding • Rupture • Insufficient Motive Force due to Increased Pressure Excessive/inadequate temperature, differential temperature or temperature transient • Rotor bowing • Thermal stress Excessive/inadequate flow – Flow cut • Cavitation • Water hammer • Steam/flow cut • Steam impingement • Flow impact • Vapor bound • Insufficient Motive Force due to Increased Flow Excessive vibration • AVB vibration • Vibration • Vibration-flow induced Excessive force/applied stress • Bent • Fatigue • Structural deformation • Impact • High stress • Loading • Biofouling • Dirt/debris • Particulate contamination • Clogged/blocked • Particulate buildup in throat

Machine/Material/Hardware Failure Related (Map B) Page 2 Code

Description Common Service Conditions Related Failures (Continued) Electric service failures

MSE

• • • • • • • • • •

Under rated Insulation breakdown Crowbarred power supply Fused connection Degraded voltage Excessive discharge Excessive temperature sensitivity Fluctuating loads Defective circuit Battery will not charge

• • • • • • • • • •

Calibration/Set point drift Short/ground Invalid input Noise Open Overload Radio frequency interference Static discharge Intermittent loss of power Self-discharge

Common Design Related Failures Analysis/calculation deficiency

MDA

• •

Design analysis had inadequate failure evaluation • Design analysis had inadequate safety review •

Design change inadequate

MDC

• • • • •

Under rated Improper component selection Inadequate specifications Inadequate supports installed Inadequate comparability

Inadequate design (inputs, considerations)

MDD

• • • •

General design inadequacy Plant/system availability not considered in design Design basis not adequately considered Design Conditions Did Not Bound Anticipated/Actual Service Conditions

• • • • • •

Design analysis had inadequate independent review Improper construction code or standard applied Inadequate maintainability Does not address original problem resolution Not designed for required conditions Design convention not followed

•

Omission of required controls, displays or labels Unanticipated interaction of system or components Equipment reliability not adequately addressed

Inadequate review of aggregate effect of all • changes Inadequate/improper sequence of multiple changes • Inadequate post modification testing specified Testing Did Not Bound Anticipated/Actual Service • Conditions •

Change improperly coordinated with implementation Design changes not implemented in a timely fashion Unauthorized or un-reviewed modification Not All Required/Critical Functions Were Tested

Inadequate design implementation or testing •

MDI

MDM MDR MDW

• • • • • •

Misapplication or interpretation of design codes, standards, design basis, etc. Inadequate independent review, safety review, or failure modes and effects evaluation Inadequate field walk down input to design change for operability, maintainability, constructability, and testability

Machine/Material/Hardware Failure Related (Map B) Page 3 Code MUD MUH MUM

MLD MLE MLM MLS

Description •

Common Manufacturing Related Failures

•

Inadequate original design

•

Manufacturing, installation/construction errors or deficiencies in original materials

•

•

Improper heat treatment Fabrication deficiencies

Common Material Related Failures

•

Manufacturing, installation/construction errors or deficiencies in original materials

•

Improper storage environment

•

Defective material

•

Wrong material/improper material selection Damaged during shipping

Sum m a ry

 This chapter provided details for the two Machine/Material/Hardware Failure Related maps.

Look ing Forw a rd

 Next, Chapter 5 will provide instructions and examples for using the Cause Road Map and the individual supporting maps.

Re fe re nc e s Bloch, H.P. & Geitner, F.K. (1994). Machinery failure analysis and troubleshooting. Houston, TX: Gulf Publishing Company.

Cha pt e r 5 U sing t he Ca use Roa d M a p Chapter 4 introduced two Machine/Material/Hardware Failure related maps and some examples for each. This chapter shows you how to use the Cause Road Map and its supporting detailed maps using a real-world example. 5 .1 U sing t he Ca use Roa d M a p To use the Cause Road Map, start with an observable error and follow the guidelines below: • Drive only one observation through at a time. An observation is a discrete event or condition, which you will need to define as specifically as possible. Typically, an observation is an event or condition that differs from what should have happened or was expected. (Later, in Chapter 8, you will see how you can use the Comparative Event Line tool to identify these disparities.) •

• •

Decide if the selected observation is a machine, a material, or a hardware failure, or if it is related to human performance. Start down the appropriate route. This is a simple binary choice. Either something is broken or some human performance error (from the shop floor to in the corner office) occurred. If the observation is related to an observed or observable (active) error, then start at the far left of the map with the Human Errors section. Proceed by collecting data or using available data to answer one or more of the listed Stop Sign questions or questions from the supporting tables (Tables 3-1, 3-3 through 3-6, 4-1 & 42).

•

Stop when you have no more information available or when seeking answers to deeper questions will not change the actions needed to correct the original problem or prevent the occurrence/recurrence of similar, high consequence events. The Cause Road Map has STOP signs at the points where these conditions are typically met.

As we begin using the Cause Road Map, read this hypothetical example carefully, since these details will be used in the analysis steps covered in this chapter

Dump Truck Accident Example Ajax Construction Company was awarded a contract to build a condominium on a hill overlooking the city. Prior to initiation of the project, the company developed a comprehensive safety program covering all aspects of the project. Construction activities began on Monday, October 4, 1978, and proceeded without incident through Friday, October 8, 1978, at which time the project was shut down for the weekend. At that time, several company vehicles, including a 2 1/2-ton dump truck, were parked at the construction site. On Saturday, October 9, 1978, a nine-year-old boy, who lives four blocks from the construction site, climbed the hill and began exploring the project site. Upon finding the large dump truck unlocked, the boy climbed into the cab and began playing with the vehicle controls. He apparently released the emergency brake, and the truck began to roll down the hill. The truck rapidly picked up speed. The boy was afraid to jump out and did not know how to apply the brakes. The truck crashed into a parked car at the bottom of the hill. The truck remained upright, but the boy suffered serious cuts and a broken leg. The resultant investigation revealed that, although the safety program specified that unattended vehicles would be locked and the wheels chocked, there was no verification that these rules had been communicated to the drivers. An inspection of the emergency brake mechanism revealed that it was worn and easily disengaged. It could be disengaged with only a slight bump. There had been no truck safety inspections or requests to adjust the emergency brake linkage.

Several observations can be made regarding this event. For example: 1. The truck driver did not chock the wheels of the truck. 2. The emergency brake is worn and easily disengaged.

5 .1 .1 St e p 1 We will first evaluate the observation that “The driver did not chock the dump truck’s wheels” using the Cause Road Map.

Figure 5-1. Dump Truck Accident Example Step 1 (LTA=Less Than Adequate)

5 .1 .2 St e p 2 – H um a n Error Consider the question: “Was coordination or interaction with or between task performers and supervision inappropriate or inadequate for the task?” from Cause Road Map 2. Using details from the Coordination or Coaching (Code HTCC) entry in Table 3-1 results in the conclusion: “Neither the driver nor anyone else was tasked with locking and chocking vehicles.”

Figure 5-2. Dump Truck Accident Example Step 2

5 .1 .3 St e p 3 – Error Drive rs Consider the question: “Was the task performer’s knowledge or experience level inappropriate for the task?” from Cause Road Map 3. Using details from the Lack of knowledge (inadequate training) (Code EITT) entry in Table 3-3 results in the conclusion: “The driver had not attended a Driver Safety Training course.”

Figure 5-3. Dump Truck Accident Example Step 3

5 .1 .4 St e p 4 – Fla w e d Orga niza t iona l or Progra m m a t ic De fe nse s Consider the question: “Why was needed training not given?” from Cause Road Map 4. Sing details from the Training Breakdowns (Code DPTN) entry in Table 3-4 results in the conclusion: “No Driver Safety Training courses were scheduled or given.”

Figure 5-4. Dump Truck Accident Example Step 4

5 .1 .5 St e p 5 – Fla w e d Asse ssm e nt Ca pa bilit y Consider the question: “Was use of Self-Assessments to improve performance LTA?” from Cause Road Map 5. When this question is answered with details from the Self-Assessment planning, use, and rigor (Code ASAS) entry from Table 3-5, the conclusion is: “The company did not monitor safety program infractions or training attendance. This is a significant loss of performance monitoring capability.”

Figure 5-5. Dump Truck Accident Example Step 5

5 .1 .6 St e p 6 – La t e nt M a na ge m e nt Pra c t ic e We a k ne sse s Consider the question: “Was Management’s Planning LTA?” from Cause Road Map 6. When this question is answered with details from the Management Planning Decisions entry (Code LPGG) from Table 3-6, the conclusion is: “Management did not make safety a priority.”

Figure 5-7. Dump Truck Accident Example Step 6

5 .1 .7 St e p 7 – Re pe a t for a not he r obse rva t ion The conclusion that the company’s management team had not made safety a priority is supported by a documented trail of facts. By using a systematic approach to find the basic cause(s) for the event, appropriate actions can be taken to prevent a future similar event. Figure 5-8 examines the observation that “The emergency brake was worn and easily disengaged.” See Table 5-8 for a human performance evaluation for why no maintenance or inspections had been performed on the emergency brake mechanism.

Figure 5-8. Dump Truck Emergency Brake Example

5 .1 .8 St e p 8 – Sa fe t y Cult ure The conclusion that “Management did not make safety a priority” is as far as most other popular root cause methods will get you. If fact, because a member of the company’s management team is typically the sponsor for the cause investigation, asking this kind of question is often discouraged. Figure 5-9 uses the Cause Road Map to ask why the management team did not make safety a priority.

Figure 5-9. Dump Truck Accident Example Step 7

Glossary Safety Culture: The Advisory Committee on the Safety of Nuclear Installations (1993) defined this term as: “The safety culture of an organization is the product of individual and group perceptions, competencies, and patterns of behavior that determine the commitment to, and the style and proficiency of, an organization’s health and safety management. Organizations with a positive safety culture are characterized by communications founded on mutual trust, by shared perceptions of the importance of safety and by confidence in the efficacy of preventive measures.”

Figure 5-9 showed how the management team’s complacency (refer to Table 3-3 in Chapter 3) drove their failure to make industrial safety a priority. The key point here is knowing that organizational complacency was a key driver for this event. This leads directly to corrective actions to prevent complacency in the future and thus prevents many other types of future events. More than a dozen organizations around the world have developed safety culture assessment tools. Traits of a Healthy Nuclear Safety Culture by the Institute for Nuclear Power Operations (2012). highlights organizational complacency (Question Attitude attribute QA.4) as a key attribute of a good safety culture. In fact, many of the common error drivers describe in Table 33 correlate directly with one or more good safety culture attributes contained in this document, which I recommend highly as the most comprehensive discourse on this topic I know of. 5 .2 Displa ying t he Re sult s 5 .2 .1 Hum a n Pe rform a nc e Eva lua t ion Brie fing Re port To display the results of the Cause Road Map analysis effectively (shown in the previous diagrams), a more useful presentation form is shown in the following report form. This report form provides a way of detailing the facts and evidence used to support each “stop” (i.e. Causal Factor) along a “route” taken through the Cause Road Map. Use a separate presentation form for each observation (route through the Cause Road Map). Once all observations have been documented in this manner, you will find it a relatively simple matter to translate these results into a report. See the example in Table 5-1.

Table 5-1. Human Performance Evaluation Briefing Report for Wheels Not Chocked Observation Error # Human Errors

1

Error Description The wheels on the dump truck were not chocked.

Neither the driver nor anyone else was tasked with locking the doors or chocking the wheels. This was a team performance error.

Human Error Drivers

Flawed Organizational or Programmatic Defenses

The driver had not attended a safety training course. This resulted in the driver not knowing the safety rules.

No driver safety training courses were scheduled or given. This is a training program failure.

Flawed Assessment Capabilities

Latent Management Practice Weaknesses

The company did not monitor safety program infractions or training attendance. This is a significant loss of performance monitoring capability.

Management did not make safety a priority. This is a latent decision that is the most fundamental cause for this observation and is supported by the fact that other observations driven through the Cause Road Map end in this same fundamental cause. (Instructions for completing this Briefing Report will be provided later.)

When you present the results of the investigation in this manner, management can see easily how the investigator reached the underlying or root cause(s). In other words, the chain of logic is clear. 5 .2 .2 Ha rdw a re /M a t e ria l/De sign Fa ilure Eva lua t ion Brie fing Re port The example Hardware/Material/Design Failure Evaluation Report shown in Table 5-2 displays the factors leading to this lack of maintenance.

Table 5-2 Hardware/Material/Design Failure Evaluation Report for Worn Brakes Observation Human Failure Observation # 1 Defect/Failure Description

The truck's emergency brake was badly worn and easily disengaged Expected or Other

Performance Error Number

This fault was not caused by expected service conditions.

Faulty Maintenance

The emergency brake linkage had not been inspected or adjusted.

Faulty Assembly/Installatio

No evidence of a faulty assembly or installation was discovered.

Faulty Operation

Untended Service Conditions

Faulty Design

Faulty Manufacturing

Faulty Material

2

No evidence of a faulty assembly or installation was discovered. No evidence of inappropriate services conditions was discovered. No evidence of a faulty design was discovered.

No evidence of a faulty manufacturing was discovered.

No evidence of faulty materials was discovered. (Instructions for completing this Briefing Report will be provided later.)

5 .2 .3 Hum a n Pe rform a nc e Eva lua t ion drive n from pre vious Fa ilure Eva lua t ion From the Hardware/Material/Design Failure Evaluation Briefing Report shown in Table 5-2, faulty maintenance was flagged as the cause for the worn condition, which leads to another human error condition requiring further evaluation. Table 5-3 displays the factors leading to this lack of maintenance. Table 5-3. Human Performance Evaluation Briefing Report for No Maintenance Observation Error # 2 Error Description No maintenance or inspections had been performed on the emergency brake mechanism. Human Errors No one was tasked with inspecting the truck’s braking system condition. The driver did not request maintenance on the emergency brake linkage. Human Error Drivers

No schedule or plan implemented to periodically inspect the condition of the truck’s safety systems (i.e., emergency brake). Interviewees indicated that since the truck had a current government road safety sticker the truck was safe to operate.

Flawed Organizational or Programmatic Defenses

No equipment safety inspection program had been implemented as mandated by the site safety manual. The government road safety sticker inspections only ensure that the emergency brakes hold the vehicle.

Flawed Assessment Capabilities

The company did not monitor safety program implementation. This is a significant loss of performance monitoring capability.

Latent Management Practice Weaknesses

Management did not make safety a priority. This is a latent decision that is the most fundamental cause for this observation and is supported by the fact that other observations driven through the Cause Road Map end in this same fundamental cause.

5 .3 Ca pt uring t he De t a ils The previous Briefing Reports do not contain all the details needed for trending or investigation reporting. Therefore, I have expanded the Human Performance and Equipment Failure Reports to include additional information and codes to assist with trending. Tables 5-4 through 5-7 show and provide instructions for completing these Expanded Human Performance Evaluation Summary Reports including examples from the Dump Truck Accident.

Table 5-4. Expanded Human Performance Evaluation Summary Report Instructions Key Enter the 3 or 4-digit code assigned to the Cause Road Map question for which evidence existed to answer in the affirmative. Enter the text for the affirmative answer selected as detailed in the DESCRIPTION section of the tables associated with the selected 2 Cause Code Text: detail level Cause Map. Example: CODE = HTCC, Cause Code Text = Coordination or interactions with or between task performers or with supervision was inappropriate or inadequate.

1

3

CODE = Cause (Taxonomy) Code:

Supporting Information;

Enter the evidence that supports the affirmative answer.

Entering an “X” indicated that this causal factor is considered a Direct Cause for the observed condition/event. CC = Contributing Cause Entering an “X” indicated that this causal factor is considered a 5 Flag: Contributing Cause for the observed condition/event. Entering an “X” indicated that this causal factor is considered an 6 PC = Proximate/Apparent Cause: Apparent or Proximate Cause for the observed condition/event. FC = Fundamental/Root Entering an “X” indicated that this causal factor is considered a 7 Cause: Fundamental or Root Cause for the observed condition/event. ACT = Corrective Actions Entering an “X” indicated that corrective actions for this causal 8 Required: factor are warranted Note: The process for developing corrective and preventive actions will be discussed in detail later. 9 HE Link to the previously identified Human Error that this causal factor is related to by entering the related CODE. 10 ED Link to the previously identified Error Driver that this causal factor is related to by entering the related CODE. 11 FD Link to the previously identified Flawed Defenses that this causal factor is related to by entering the related CODE.

4

DC = Direct Cause Flag:

Table 5-4, Continued Human Errors CODE

Cause Code Text

Supporting Information

DC

CC

PC

FC

AC

1

2

3

4

5

6

7

8

CODE

Cause Code Text

DC

CC

PC

FC

AC

Error Drivers 1

Supporting Information

HE

2 3 9 4 5 6 7 Flawed Organizational or Programmatic Defenses

CODE

Cause Code Text

1

2

CODE

Cause Code Text

1

Supporting Information

ED

DC

CC

3 10 4 5 Flawed Assessment Capabilities Supporting Information

FD

DC

CC

8

PC

FC

AC

6

7

8

PC

FC

AC

7

8

2 3 11 4 5 6 Latent Management Practice Weaknesses

CODE

Cause Code Text

Supporting Information

FD

DC

CC

PC

FC

AC

1

2

3

11

4

5

6

7

8

Table 5-5. Expanded Human Performance Evaluation Summary Report for Wheels not Chocked Example Example Observation Id: Report No. 1 Human Performance Evaluation – Summary Observation Description

The Wheels of the truck were not chocked. CF=Safety Culture Flag, DC = Direct Cause; CC = Contributing Cause; PC = Apparent/Proximate Cause; FC = Fundamental/Root Cause; ACT = Corrective Actions Required

Human Errors CODE

Cause Code Text

HTCC

Coordination or interactions with or between task performers or with supervision was inappropriate or inadequate.

Supporting Information Neither the driver nor anyone else was tasked with locking and chocking vehicles. This is a team coordination error.

DC

X

CC

PC

FC

AC

Table 5-5, Continued Error Drivers CODE

EITT

Cause Code Text A task performer’s knowledge or experience level was inappropriate for the task.

Supporting Information

HE

DC

The driver had not attended a Driver Safety Training course. This resulted in the driver not knowing the safety rules.

HTCC

X

CC

PC

FC

AC

FC

AC

Flawed Organizational or Programmatic Defenses CODE

Cause Code Text Needed training was not given.

DPTN

Supporting Information

ED

No Driver Safety Training courses were scheduled or given. This is a Training Program failure.

EITT

DC

CC

PC

X

X

Flawed Assessment Capabilities CODE

Cause Code Text

ASAS

A Self-Assessment or Cause Analysis Program was inadequately or ineffectively implemented.

Supporting Information

FD

The company did not monitor safety program infractions or training attendance. This is a significant loss of performance monitoring capability.

DPTN

DC

CC

PC

FC

X

AC

X

Latent Management Practice Weaknesses CODE

Cause Code Text

LPGG

Established goals were inadequate or inappropriately focused.

Supporting Information

FD

Management did not make safety a priority. This is a latent decision that is the most fundamental cause for this observation and is supported by the fact that other observations driven through the Cause Road Map end in this same fundamental cause.

DPTN

DC

CC

PC

FC

AC

X

X

Table 5-6. Expanded Human Performance Evaluation Summary Report for No Maintenance Example Report No.

Example

2

Observation Id:

Human Performance Evaluation – Summary Observation Description

No maintenance or inspections had been performed on the emergency brake mechanism. CF=Safety Culture Flag, DC = Direct Cause; CC = Contributing Cause; PC = Apparent/Proximate Cause; FC = Fundamental/Root Cause; ACT = Corrective Actions Required

Human Errors CODE HTCC

HIEE

Cause Code Text

D C

Supporting Information

Coordination or interactions with or between task performers or with supervision was inappropriate or inadequate. An individual failed to use or did not properly use an appropriate self-checking/Error detection or verification technique

No one was tasked with inspecting the truck’s braking system condition.

X

The driver did not request maintenance on the emergency brake linkage.

X

C P F C C C

A C

C P F C C C

A C

C P F C C C

A C

X

X

C P F C C C

A C

X

X

Error Drivers CODE

ETTG

ENA

Cause Code Text

Supporting Information

The task instructions were misordered or described with insufficient clarity for easy use by the task performer(s).

No schedule or plan implemented to periodically inspect the condition of the truck’s safety systems (i.e., emergency brake). Interviewees indicated that since the truck had a current government road safety sticker the truck was safe to operate.

The task performer's flawed decisions were motivated by Assumptions, mindsets, tunnel vision, complacency or similar natural tendencies.

H E

D C

H T C C

X

H I E E

X

Flawed Organizational or Programmatic Defenses CODE

Cause Code Text

Supporting Information

DPIM

A procedure, document or reference was not developed or not properly Maintained.

No equipment safety inspection program had been implemented as mandated by the site safety manual. The government road safety sticker inspections only ensure that the emergency brakes hold the vehicle.

DOOO

Supervisory mistakes, oversight errors, organizational structure weakness or Other organization-to-organization interface breakdowns not covered by another of the previous criteria.

E D

D C

E T T G

E N A

X

Flawed Assessment Capabilities CODE

ASAS

Cause Code Text A Self-Assessment or Cause Analysis Program was inadequately or ineffectively implemented.

Supporting Information The company did not monitor safety program implementation. This is a significant loss of performance monitoring capability.

F D D P I M

D C

Table 5-6. Continued Latent Management Practice Weaknesses CODE LPGG

Cause Code Text Established goals were inadequate or inappropriately focused.

Supporting Information

F D

Management did not make safety a priority. This is a latent decision that is the most fundamental cause for this observation and is supported by the fact that other observations driven through the Cause Road Map end in this same fundamental cause.

D P I M

D C

C P F C C C

A C

X X

Table 5-7. Expanded Hardware /Material /Design Failure Evaluation Summary Report Instructions Key CODE = Cause Enter the 3 or 4 digit code assigned to the Cause Road Map question 1 (Taxonomy) Code:

for which evidence existed to answer in the affirmative. Enter the text for the affirmative answer selected as detailed in the 2 DESCRIPTION section of the tables associated with the selected Cause Code Text: detail level Cause Map. Example: CODE = MMV, Cause Code Text = Less than adequate preventative maintenance caused failure.

3

Supporting Information;

4

HU# = Human Performance

5

DC = Direct Cause Flag:

6 7 8 9

CC = Contributing Cause Flag: PC = Proximate/Apparent Cause: FC = Fundamental/Root Cause: ACT = Corrective Actions Required:

Enter the evidence that supports the affirmative answer. Enter the number of the Human Performance Evaluation prompted by the investigation of this failure. Entering an “X” indicated that this causal factor is considered a Direct Cause for the observed condition/event. Entering an “X” indicated that this causal factor is considered a Contributing Cause for the observed condition/event. Entering an “X” indicated that this causal factor is considered an Apparent or Proximate Cause for the observed condition/event. Entering an “X” indicated that this causal factor is considered a Fundamental or Root Cause for the observed condition/event.

Entering an “X” indicated that corrective actions for this causal factor are warranted

Table 5-7, Continued CODE

Cause Code Text

Supporting Information

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

Table 5-8. Expanded Hardware/Material/Design Failure Evaluation Summary for Worn Emergency Brake Example Defect/Failure Description The truck’s emergency brake mechanism revealed that it was worn and easily disengaged. HU# = Human Error Observation, DC = Direct Cause; CC = Contributing Cause; PC = Apparent/Proximate Cause; FC = Fundamental/Root Cause; ACT = Corrective Actions Required

CODE

Cause Code Text

MMV

Less than adequate preventative maintenance caused failure

Supporting Information The emergency brake linkage had not been inspected or adjusted.

HU#

2

DC

CC

X

PC

FC

ACT

X

Sum m a ry This chapter described:  How to use the Cause Road Map.  How to display the results.  How to capture the details.

Look ing Forw a rd Most events discussed in this chapter were the direct result of some type of human error. In Chapter 6, we will discuss some models for human performance/behavior that frequently drive these errors.

Re fe re nc e s Advisory Committee on the Safety of Nuclear Installations (ACSNI). (1993). Organizing for safety. UK: HSE Books. Institute for Nuclear Power (INPO). (2012). Traits of a healthy nuclear safety culture. Retrieved from https://www.nrc.gov/docs/ML1303/ML13031A707.pdf

Cha pt e r 6 H um a n Pe rform a nc e Ba sic s In earlier chapters, I described how to use the Cause Road Map and the supporting detailed map. Now we need some tools to help answer the questions asked within the human performancerelated maps. 6 .1 T he ore t ic a l fra m e In the early 20th century, Sigmund Freud concluded that errors were a product of the unconscious drives of the person. More recently, Donald A. Norman (1988) studied both cognitive and motor aspects of error and differentiated between two types of errors, Slips and Mistakes. This classification system is also known as the hybrid classification. • Slips are actions, errors or error of execution that are triggered by a person’s experiences, memories and organized knowledge. Slips are unintended failures of execution that occur almost every day in our lives because attention is not fully applied to the task in hand. • Mistakes are errors of thought in which a person’s cognitive activities lead to actions or decisions that are contrary to what was intended. Mistakes can result from the shortcomings of perception, memory, cognition and decision-making, resulting in the failure to formulate the right intentions. A refinement of Norman’s cognitive aspects of error came from Jens Rasmussen, a system safety and human factors professor in Denmark. Jens Rasmussen first published his Skill-RuleKnowledge (S-R-K) framework in a technical report from Riso National Laboratory, Roskilde, Denmark (1979). This framework expanded on Norman’s work and defined three types of human performance errors: skill-based, rule-based and knowledge-based. (Frequently referred to as the generic error modeling system or GEMS taxonomy).

In a later publication , Rasmussen further defined the behaviors (1983): •

•

•

Skill-based: The skill-based behavior represents sensory-motor performance during acts or activities which, following a statement of an intention, take place without conscious control as smooth, automated, and highly integrated patterns of behavior. Rule-based: The composition of such a sequence of subroutines in a familiar work situation is typically controlled by a stored rule or procedure which may have been derived empirically during previous occasions, communicated from other persons’ know-how as instruction or a cookbook recipe, or it may be prepared on occasion by conscious problem solving and planning. Knowledge-based: During unfamiliar situations, faced with an environment for which no know-how or rules for control are available from previous encounters, the control of performance must move to a higher conceptual level, in which performance is goal-controlled and knowledge-based.

I translate the above definitions as follows: 6 .1 .1 Sk ill-Ba se d Skill-based performance behaviors are like Norman’s slips, largely errors of execution. A necessary attribute of the skill-based performance behaviors is that the individual is very experienced with the task they were performing. A skill-based error is an action that is not in accordance with intentions. Some examples of skill-based performance behaviors that are automatic: • •

Driving while still paying heed to pedestrians, other vehicles, and traffic lights. Hammering a nail in while holding a board in place.

6 .1 .2 Rule -Ba se d Rule-based performance behaviors are a more advanced error mode than the skill-based performance mode. Rule-based performance errors are typically the result of a failure to recognize or understand the situations or circumstances encountered. Individuals working in this mode are typically somewhat familiar with the task but do not have the in-depth experience needed to perform the task at a subconscious level. Errors made in this performance mode often result from misunderstanding or misclassifying the situation and thus applying an incorrect rule. Rule-based errors can also occur from the failure to recognize a familiar pattern because a situational change masked the normal know-how of the task. For example: •

Helicopter pilots can make a rule-based error when they engage the cyclic lever instead of the rpm controller when a loss of power occurs. This a misunderstanding or misclassification of the situation.

Caution: Implementing a work order or procedure does not mean that the entire task will all be performed in the rule-base performance mode.

6 .1 .3 Know le dge -Ba se d Knowledge-based performance behaviors are Rasmussen’s highest level of performance. Knowledge-based performance errors result from shortcomings in an individual’s knowledge or limitations in their ability to apply existing knowledge to new situations. Individuals operating in this mode can’t rely on the “rules” they have acquired from previous experience or detailed in readily available documents. Operating in this mode requires problem-solving. Some examples: •

•

A student pilot with limited flying experience might be able to land an aircraft with knowledge he already has. However, it will be virtually impossible for the student pilot to perform the landing in similar conditions if he experiences a failure in his navigational equipment. This problem may be because he has neither faced such a problem, nor has he been trained to react to such a situation. A person traveling to a new location who makes a wrong turn, becomes lost, and then tries to find his or her way by instinct alone.

6 .1 .4 Slips, M ist a k e s, a nd V iola t ions In his book, Human Error, Dr. James Reason (1990) further clarified the differences between slips and mistakes by further classifying them into distinctive categories such as capture errors, mode errors, lapses, rule-based mistakes and knowledge-based mistakes. Dr. Reason also distinguished between mistakes and violations and described both as errors of intent resulting from inappropriate intentions or incorrect diagnoses of situations. Dr. Reason did not necessarily consider violations to be negative. While Dr. Reason’s categories can help understand the behaviors behind human errors, they to do not added much to the identification of these errors or the corrective actions for them. However, in this book, Dr. Reason also included a culpability decision tree, which is a decision tree build around the question of “was the behavior intended?” The end states from this decision tree are either, intentional sabotage, system induced violation, possible intentional violation, possible negligent error, and blameless error. While this entire decision tree is widely used in cause investigations in the nuclear power industry, only one question from this decision tree is a signification decision point within the context of the Cause Road Map. That is the “Pass substitution test?” question. “Yes” as an answer to the “Pass substitution test?” question in either a Blameless Error or a System Induced Error means the cause investigation must look beyond blaming the individual for the true cause of the event. This a key decision point for using the Cause Road Map. The following is an adaption of Dr. Reason’s full culpability decision tree frequently used in the nuclear power industry.

Figure 6-1. Culpability Decision Tree

6 .2 T ools & I nt e rve nt ions Support ing t his T he ore t ic a l Fra m e w ork Two reliable investigative tools can be used to determine the reasons for human performance errors. The first tool is one I developed as a simple way to implement the skill-rule-knowledge (S-R-K) framework. The second tool, the Substitution Test, comes from the culpability decision tree contained in Dr. Reason’s Managing the Risks of Organizational Accidents (1997). This tool is used to determine if personnel were exhibiting expected behaviors for the given situation. These tools, when used together, provide a quick and accurate method to determine whether the cause(s) for the event were the result of individual human performance error or of more fundamental organizational problems. 6 .2 .1 Sk ill-Rule -Know le dge (Ba la nc e Be a m ) T ool The Human Performance Modes Balance Beam tool is based on the premise that people switch among different levels of cognitive control when faced with different situations or separate tasks.

It is used to determine the thinking mode that the individual or a group of individuals was working in when the event occurred. Understanding these relationships is beneficial when crafting corrective actions. • Typically, skill-based performance mode tasks involve fewer than 7 complex steps or fewer than 15 simple steps. In the workplace, skill-based mode activities are sometimes referred to as skill-of-the-craft tasks. The error rate for skill-based related tasks is typically 1 per 1,000 attempts. • A simple rule for identification of rule-based errors is answering “yes” to the question “Is there a Rule and the individual knew there was a Rule?” The error rate for rule-based tasks is typically 1 per 100 attempts. Errors made at this level are often referred to as mistakes associated with applying the incorrect rule. • Knowledge-based modes are situations in which the person realizes that there are no rules that apply, and the person eventually switches to pure logical deduction and symbolic manipulation based on theoretical knowledge. In other words: “You don’t know what you don’t know.” Depending on error checking or validation efforts, the error rate for knowledge based tasks is typically between 1 in 10 to 1 in 2 attempts.

Toward this end, Figure 6-1 provides a simple visualization of the relationship between the skill-, rule- and knowledge-based human performance error modes based on the implementer’s familiarity with the task and the degree of attention required to achieve its successful completion.

Figure 6-2.1. Human Performance Modes Balance Beam Figure 6-2 provides another way of looking at the Skill-Rule-Knowledge model. It can be a useful tool to help recognize and manage knowledge-based decisions.

I Know

When you are CERTAIN about the appropriate course of action, you are GOOD TO GO.

We Think

If you find yourself saying, “I think…” you have left the “GOOD TO GO” range and entered a gray area. A decision to proceed should only be made by a team of individuals with supervisory concurrence after careful consideration of potential worst-case consequences.

Let’s Ask Joe

If you find yourself thinking, “We should ask Joe….” you don’t know. DO NOT PROCEED. Stop and get more information.

Nobody Knows When NOBODY KNOWS, then everyone is GUESSING. We NEVER guess about safety! Figure 6-3. Knowledge Based Decision Making Warning Graphic Following are some more complex skill-rule-knowledge assessment examples. Skill-Based Mode Example An instrument technician was performing routine preventive maintenance on a pneumatic pressure-indicating controller. Part of the maintenance task involved removing the mechanical linkage between the sensing element and the process instrument. To remove the linkage, the technician was required to spread the connecting clevis by rotating the plastic spreader so the clevis can be removed from the attachment pins. There is a similar connection at each end of the linkage. There are no detailed steps in the work order or technical manual for this portion of the maintenance activity. After maintenance was finished, the technician placed the linkage on the correct pin at each end of the linkage. He closed the spreader at the sensing element end but failed to close the spreader at the process instrument end of the linkage. The process instrument end of the linkage is partially hidden behind an adjustment knob. The pressure-indicating controller worked properly for a period of time. However, the linkage became disconnected and failed when it was called upon to respond to a rapid pressure transient. Since this task is typically considered a “skill-of-the-craft” task, neither the work order nor the instrument’s technical manual contained information about removing or installing the

linkage. In this case, the skilled technician failed to use the proper level of attention when reconnecting the linkage. Lessons learned from this event have been incorporated into the work order work steps by including a step to check that the spreader is closed.

In this example, the Skill-Based error itself was corrected by simply re-connecting the linkage. Adding steps from the lessons learned from this event to the work order template was a corrective action for a previous Knowledge-Based error. In that case the work order preparer failed to recognize the potential for this type of error given that the process instrument end of the linkage is partially hidden behind an adjustment knob. If the linkage had not been partially hidden, reconnecting the linkage would have been the only appropriate action to take. Conclusions: • The only appropriate corrective error for a true Skill-Based error is to correct to condition itself. • The only truly effective action that can be taken to prevent true Skill-Based errors is to take the human out of the process (i.e. automation). • True Skill-Based errors are not always easy to recognize. Many apparent Skill-Based errors are really an expression of a previous Knowledge-Based error. • Errors made while performing a task requiring little conscious intervention (i.e. from memory − muscle or mental − alone) are reasonably classified as Skill-Based errors. Rule-Based Mode Example A trained, fully qualified mechanic was reassembling a relief valve using a work order. The work order contained steps for valve disassembly and for performing a special test that included the use of a spiral wound gasket. The work order also refers to the technical manual for final reassembly of the valve. The valve was disassembled and the special test successfully performed. The mechanic, who was not present when the test device was installed, was instructed to reassemble the valve. Upon removing the special test device, the mechanic noticed the spiral wound gasket. Neither the work order nor technical manual mentioned the use of the spiral wound gasket. The mechanic assumed, since a gasket had been removed and the parts bag contained a new gasket of the same size, that a gasket was required. He reassembled the valve using the new gasket from the parts bag. When the valve was tested, it failed, causing a loss of power generation. The mechanic was following the steps of the work order. Therefore, he was working to a procedure, which was intended to be a rule-based activity. When he installed the spiral wound gasket during the valve reassembly process, he exited the rule-based mode and entered a knowledge-based mode. He assumed, based on experience, that metal-tometal surface joints normally require a gasket. Since there was a new gasket in the parts

bag, he assumed from base experience and instinct that a gasket was required for this joint, even though no gasket was called for in the work order. Lessons learned from this event included detailing the work steps to make it more clear that no gasket should be installed at that location.

In the above example, the only truly rule-based error was made when the mechanic failed to seek additional guidance when neither the work order nor technical manual mentioned the use of the spiral wound gasket. As soon as he proceeded working, based on his assumption that a gasket was required, he made a separate knowledge-based error driven by the earlier mistake of putting a new gasket in the parts bag. Conclusions: • True rule-based errors are not always easy to recognize because many are really an expression of poorly written work instructions or another previous knowledge-based error. In the above example, correction for the true rule-based errors was accomplished when the mechanic realized that he should have obtain additional guidance before he proceeded. • Frequently, corrective actions for apparent rule-based errors are actually corrections for previous knowledge-based errors. • If an error is understood to be a “slip or lapse,” then it is reasonable to classify this error as a rule-based error. Knowledge-Based Mode Example At 1516 hours on 19 July 1989, United Flight 232, a McDonnell Douglas DC-10-10, was cruising at 37,000 feet when the aircraft suffered a catastrophic engine failure. The uncontained disintegration of the number two engine's fan rotor caused the loss of all three of the aircraft's redundant hydraulic flight control systems and made the aircraft almost uncontrollable. Captain Al Haynes and his crew, augmented by a DC-10 instructor pilot who was aboard as a passenger, were able to navigate to the municipal airport at Sioux City, Iowa, U.S., where the aircraft was crash-landed approximately 45 minutes after the hydraulic failure (Haynes, 1991). On that day, the flight crew had to learn how to fly and land the massive DC-10-10 for which they had no ailerons to control roll, no rudders to coordinate a turn, no elevators to control pitch, no leading-edge devices to help slow down the plane for landing, no trailing edge flaps to be used in landing, no spoilers on the wings to slow the plane in flight or to help braking on the ground, no nose wheel steering, and no brakes. Of the 285 passengers and 11 crewmembers aboard, 111 passengers and 1 crew member perished in the crash, while 174 passengers and 10 crewmembers survived.

Lessons learned from the accident have been incorporated into flight training, procedures, and emergency preparedness plans and involved focused on either driving performance to the rule-based mode, improving task familiarity, or increasing awareness of the task.

This is an example of working in knowledge-based mode. Neither the flight crew nor ground based support personnel from United Airlines nor McDonnell Douglas had prepared for nor developed procedures for coping with a total loss of aircraft hydraulics. In fact, everyone was confident that the complete loss of all flight controls was impossible. By using their combined knowledge and experience, the flight crew, aided by ground support personnel, was able to control the aircraft up to the point of touchdown. That the aircraft was controlled at all and that there were any survivors in these unusual circumstances was recognized by the industry as an example of extraordinary airmanship by the crew. Conclusions: • Errors classified as skill-based and rule-based are frequently an expression of a latent knowledge-based error. • If an “assumption” or “guess” resulted in an error, then it is reasonable to classify this error as a knowledge-based error. The primary reason for taking the time and effort to understand Rasmussen’s skill-ruleknowledge (S-R-K) framework for each human error investigated is that this understanding will lead directly to effective corrective actions. 6 .2 .1 .1 Corre c t ive Ac t ions Che c k list for Sk ill-Ba se d Errors Most of the following corrective actions focus on reducing the effect of various precursors to the error. However, as previously noted, the only truly effective action to prevent true Skill-Based errors is to take the human out of the process (i.e. automation).  Identify critical steps of a task to increase attention.  Increase supervision.  Include additional personnel to peer-check critical steps.  Simplify tasks by procedure simplification, limiting memory requirements and standardizing similar tasks by using signs, procedure format, and forms.  Reduce distractions by enhancing workplace formality, not interrupting critical work, and preparing needed tools and information before work begins.  Reduce job stress through good vertical communication, develop a high degree of trust among organizations, and maintain effective communications.  Provide awareness tools (reminders) such as signs, pre-job briefings, and caution statements in procedures.  Ensure workers maintain an alert mental state through good supervisory techniques including effective pre-job briefings.  Increase personal experience with a task.

 Use self-check checking techniques such as STAR (stop-think-act-review) when performing a task.  Use senses TVAMO (touch, verify/verbalize, anticipate/additional verification, manipulate, observe) when performing a task.  Use visualization techniques before performing a task.  Practice or use a mockup prior to performing a task.  Limit multitasking. No more than two tasks at a time.  Use relaxation, meditation, and exercise to reduce stress.  Install interlocks between similar controls.

6 .2 .1 .2 Corre c t ive Ac t ions Che c k list for Rule -Ba se d Errors Most of the following corrective actions focus on increasing awareness of the guidance for performing the task.  Use training or effective supervision to ensure a verification process. For example, check-off sheets, repeat-backs, etc., can be used as part the normal work task completion.  Train on work fundamentals, including improving knowledge of procedure and work process bases.  Practice on transition between procedures.  Incorporate validation and verification training.  Clarify vague rules, by explaining “how to” perform a required action or explaining the desired outcome (criterion) to determine whether success was achieved.  Promote the practice of having workers verbalize intentions.  Organize work specialization groups: e.g., system engineers, component engineers, technical advisors, planners, etc. 6 .2 .1 .3 Corre c t ive Ac t ions Che c k list for K now le dge -Ba se d Errors Most of the following corrective actions focus on either driving performance to the rule-based mode through providing better guidance, improving task familiarity, or increasing awareness of the task.  Providing more concise task instructions.  In-progress independent reviews.  Critical attribute checklists.  Improve problem-solving skills.  Familiarize workers with the work process.  Provide knowledge oriented training.  Assign the role of “Devil’s Advocate” during key decision-making meetings.  Improve communication skills in all modes (speaking, writing, etc.).  Use work specialization.  Correct workers that are over-confident.

 Use outside experts.  Employ consultation and networking skills.  Assess all options before proceeding. 6 .2 .2 Subst it ut ion T e st T ool

•

•

As mentioned earlier in this chapter, the purpose for performing a substitution test is to determine if the event was caused by individual performance error or if the event was caused by a more fundamental error, such as a program or process error. The substitution test can be summarized with the following question: “Would another similarly trained and similarly experienced individual have made the same error under the same conditions?” In other words, if the answer to the above question is “YES,” then continue looking for more fundamental causal factors. When using the Cause Road Map, this means do not stop on the Grey or Red sections of the map. However, if the error in question was truly a skill-based type error (refer to the following discussion), then the answer to the substitution test question is automatically “NO” and there is no reason to believe that continuing through the Cause Road Map will add any significant value to the investigation. Substitution test example An electrician was checking phase voltage on the line side of a three-phase 600 VAC molded case circuit breaker as part of a preventive maintenance task. The electrician was using a voltmeter to check the voltage. Because the probe was not long enough to reach the wire connections, the electrician used an uninsulated screwdriver to extend the reach of the voltmeter probe. As the electrician was measuring the voltage, the shaft of the screw driver came in contact with a circuit breaker mounting bracket, causing a direct short to ground, destroying the breaker, causing flash burns to the electrician’s face and arms, and causing temporary loss of the 600 VAC bus. When the substitution test was applied, other electricians and the electrician’s immediate supervisor were asked about using screwdrivers as probe extenders. Their response was that using screwdrivers was the normal method for extending a voltmeter’s probe. None of the electricians interviewed was aware that insulated probe extenders were available in the tool storeroom.

In this case, the substitution test answer is “Yes.” This was a normal electrician work practice, indicating that there are more fundamental organizational causes for the event, such as: • Improper use of tools. • Quality of the work procedure. • Inadequate safety standards. Conclusions: • If the error in question was truly a Skill-Based type error (refer to the following discussion), then the answer to the substitution test question is automatically “NO” and there is no reason

•

to believe that continuing through the Cause Road Map will add any significant value to the investigation. Statistics indicate that in most cases the substitution test answers should be answered “Yes” and that most human errors are either Knowledge-Based or Rule-Based.

Sum m a ry In this chapter, we introduced and described:  The theoretical frame for assessing human performance errors including the introduction of Jens Rasmussen Skill-Rule-Knowledge (S-R-K) framework and Dr. Reason’s Culpability Decision Tree.  The Human Performance Modes Balance Beam tool for determining an individual’s S-R-K performance mode based on the individual’s familiarity with the task and the amount of attention required to perform the task.  Examples of corrective actions for Skill-Based errors that focus on reducing the effect of various precursors.  Examples of corrective actions for Rule-Based errors that focus on increasing awareness of the guidance for performing the task.  Examples of corrective actions for Knowledge-Based errors that focus on either driving performance to the rule-based mode through providing better guidance, improving task familiarity, or increasing awareness of the task.  The Substitution Test, which is used to determine if the event was caused by an individual performance error or by a more fundamental error.

Look ing Forw a rd In Chapter 7, we will discuss one of the most important elements of any investigation: interviewing.

Re fe re nc e s Haynes, A. (1991). The Crash of United 232. Retrieved from http://www.clearprop.org/aviation/haynes.html Norman, D. A. (1988). The psychology of everyday things. New York, NY: Basic Books. Rasmussen J. (1979) On the structure of knowledge: A morphology of mental models. Retrieved from http://orbit.dtu.dk/fedora/objects/orbit:91798/datastreams/file_6ef463d4-fda6-46ae-90d710ad449254d2/content Rasmussen, J. (1983). Skills, rules, and knowledge: Signals, signs and symbols and other distinctions in human performance models. IEEE transactions: Systems, man & cybernetics. Retrieved from https://reliability.com/industry/pdf/Skills-rules-knowledge-Rasmussen-seg.pdf Reason, J. (1990). Human error. New York, NY: Cambridge University Press.

Cha pt e r 7 Effe c t ive I nt e rview ing Interviewing is one of the most important elements of any investigation. In your investigation, you not only need to find out what happened, but you will also need to find out how and most importantly why. As with most events, a human element plays a role. The investigator’s job is to unravel the data and uncover the facts that resulted in the event. This chapter is not an allinclusive treatise on interviewing. Instead, it gives the investigator the basic tools for conducting an effective interview.

I m port a nc e of I nt e rvie w ing Dr. Sidney Dekker, in The Field Guide to Understanding Human Error (2014) introduces the concept of looking at an event from the perspective of the personnel involved. In other words, looking at the event from “inside the tunnel” rather than looking at the event from “outside the tunnel.” Looking at the event from “inside the tunnel” for the personnel involved in the event, we acknowledge that errors were made but see the errors as symptoms of deeper organizational troubles. Dr. Dekker explains that by looking at an event from outside the tunnel we are using hindsight to find causes and assign blame. In this manner, we will identify human errors such as: • Loss of situational awareness. • Lack of procedural adherence. • Lack of supervisory oversight. • Lack of training.

By looking at the event from the perspective of the personnel involved, the investigator will be able to understand their behaviors and factors that lead them to do what did. By doing so, you will be able to identity deeper and more fundamental factors that led to the event, such as: • Conflicting organizational goals. • Organizational culture drivers. • Management actions vs. management message. • Organizational structure. (Note: Refer to Cause Road Map, Map 6 and Table 3-6 for additional examples of fundamental factors.) Obtaining accurate data is key to understanding an event from the perspective of the personnel involved. Conducting effective interviews is an important element in obtaining accurate data. As an effective event investigator, you will need to gather information from all personnel involved as well as any personnel who may have witnessed the event. In addition, personnel who supported activities that led up to the event as well as middle and senior management personnel will need to be interviewed. This can often be a monumental task. Without reliable and accurate accounts of what happened, you will be unable to reveal correct underlying causes and implement proper corrective actions to prevent recurrence.

St e ps Ne c e ssa ry t o Pre pa re a nd Conduc t a n Effe c t ive I nt e rvie w It is critical, therefore, to clearly understand the five basic steps to conducting an effective interview: 1. Planning. 2. Opening. 3. Conducting. 4. Closing. 5. Recording. 7 .1 Pla nning t he I nt e rvie w As with any activity, the most important step is good preparation. Effective interviewing is no different. Maximizing the effectiveness of an interview requires good planning and detailed preparation. The interviewer needs to know specifically what he or she is looking for. Caution: Getting the sequence of events wrong will result in invalid conclusions. Step 1. Identify the information that is being sought. In other words, what questions or question “themes” are to be asked? What exactly are you looking for? Things to consider are:  What happened? When did it happen?

 What was the detailed sequence of events leading up to the event?  Who was involved (roles and responsibilities, including supervision and management, as applicable).  Whether the task was performed as a “skill-of-the-craft” task.  What policies, procedures, standards and expectations were: • Applied to the situation (e.g., night notes and job turnover information)? •

Being used at the time?

(Note: See Cause Road Map 4 and Table 3-4 or additional lines of inquiry.) Detail the work situation with respect to the TWIN model: (Note: See Cause Road Map 3 for lines of inquiry.) • Task demands. • Work environment, schedule. • Individual capabilities of personnel assigned to the task. • Human nature/Natural tendencies associated with the job/task.  Who conducted/participated in the task preview (pre-job walk down), if applicable.  The pre-job brief and what was discussed, including: • How, when and where the brief was conducted. • Who spoke during the brief and about what? The intent is to determine if the brief was sufficiently interactive to ensure the team member(s) were engaged. • Who attended the briefing? • What the potential consequences of inappropriate/incorrect actions and equipment response were. • What type of “what-if” scenarios were discussed and evaluated. • What key questions were asked, discussed, and fully explored and answered. For instance, did the team or crew discuss: • What the critical phases or steps of the activity were? Critical steps are those that: o Are irreversible or irrecoverable. o If performed could result in immediate component-, system- or plant-level impacts. o If performed could result in immediate impacts to personal safety. o If performed as direct could result in a mistake. For example: 1. Due to physical or paper changes since the activity was last conducted. 2. Working on or operating the wrong equipment (adjacent equipment). 3. Manipulating the wrong switch, valve or lever. 4. Incorrectly assembling components.

•

5. Making contact with energized or hazardous equipment or material.

What adverse things could happen? For example: 1. 2. 3. 4.

Personal injury. Equipment damage or destruction if not reassembled correctly. Plant transient, including an unplanned power change. Adverse effect on in-service equipment.

Obtain as much background information as possible. Sources for information include:  Logs.  Personnel statements (see Figure 7-1 for a suggested personnel statement form).  Technical manuals.  Procedures.  Training lesson plans.  Drawings.  Photographs.  Walk through of the area.  Maintenance records.  Training records.  Time sheets.  Work schedule. Step 2. Select a location to hold the interview. The interview setting is important. A mechanic who is used to turning wrenches will probably not be very comfortable being interviewed in the executive conference room. By the same token, holding an interview in a cafeteria at lunchtime will introduce too many distractions. It is often beneficial to let the interviewee select the time and place for the interview. Step 3. Who should be interviewed? Suggestions include:  Personnel directly involved with the event.  Eyewitnesses.  Support personnel.  Job Planners.   Job  Schedulers.  Supervisors.   Engineers.  •  Spouse/Family.

Management. Equipment suppliers. Maintenance personnel. Staff medical personnel. Trainers.

7 .2 Ope ning t he I nt e rvie w Set the tone for the rest of interview. Opening should:  Put the interviewee at ease by: • Thank the interviewee for taking time to be interviewed. • Tell the interviewee why you are there. • Be empathetic.  Give an overview of the material to be covered.  Establish your credibility by: • Providing your credentials. • Showing how you prepared (e.g., reviewing the procedure that was being used when the event occurred).  Provide direct answers to interviewee questions.  Acknowledge that interruptions are likely. Anticipate that the interview may be interrupted by phone calls or other distractions, especially if the interviewee is still on duty. 7 .3 Conduc t ing t he I nt e rvie w Another useful technique is to have two people conduct interviews: one to focus on asking questions and one to take notes. It is often difficult to focus on asking questions and to take notes at the same time. Another alternative may be the use of an electronic recording device if allowed by rules and policy. However, the recording process has the potential to make interviewees uneasy. It is courteous to ask the interviewee if he/she minds being recorded to ensure accurate transcription. If the interviewee is uneasy with recording, manual note-taking (typically using your company’s interview template) is recommended. A relaxed interviewee is more likely to recall and share pertinent information. Have a list of questions or question themes ready and use effective questioning techniques. Remember the response to a question usually depends on how the question was asked. 7 .3 .1 Ope n-e nde d Que st ions Use open-ended questions to quickly gather information about the event. Open-ended questions typically invite unrestricted answers in phrases or sentences. Open-ended questions typically start with: What, Where, When, How, Who, Why. For example: If investigating a fire event, start questioning with: “Where were you and what were you doing when you first became aware of the fire?”

Another example is the need to establish a person’s activities as they relate to an event. Ask the interviewee to describe his or her activities. For example:

If investigating the cause of a pump failure following maintenance, start your questioning with: “Please walk me through the pump maintenance activity.”

Even though this example is not specifically worded as a question, it is designed to draw out details as they relate to the event being investigated. Be prepared for a lengthy response to openended questions. 7 .3 .2 Close d-e nde d Que st ions While you will use open-ended questions to confirm or clarify information, you will use closedended questions to elicit a specific response, depending on how the question is phrased. Quite often responses to closed-ended questions are either YES or NO. For example: A person states that he was just entering the room when he became aware of the fire. Ask for clarification. Were you entering through the south door? (The response to this question would most likely be “YES” or “NO”.) For example: An operator states that he was at the control board when the event started. A closedended question would be used to determine which control board: which control board were you at when the event started? (The expected response would be the control board name.)

7 .3 .3 Follow -up Que st ions or Que st ioning t o t he V oid Don’t be reluctant to ask follow-up questions. It is critical to get as much information from each interviewee as possible. Sometimes called questioning to the void, the purpose here is to gather as much specific information as possible by asking a question based on the previous answer in order to elicit all possible information. The “void” will be the point at which no more progress is made. One technique for questioning to the void is the use of the turnaround question. The turnaround builds on the response to the previous question and turns it into a follow-up question. For example, the operator responded to the question, “Where were you when the event started?” by stating that he was at the control panel. Turn the response to the question around by asking: “Which panel were you at?” He states that he was at the secondary control panel. Turn that response around by asking: “Why were you at the secondary control panel?” He would then respond by stating his reason for being at the panel. Keep asking turnaround questions until the interviewee has no more information to offer. If you had stopped questioning after the operator said he was at the control panel, the information as to which panel he was at, and why, would not have been revealed.

Other valuable questions to ask are “What else?” and “Is there anything else?” These two questions give the interviewee an opportunity to provide any additional information, even if it proves to be irrelevant to the investigation. Keep the interview on track. If the interviewee strays from the subject or line of questions, be polite but firm and bring him or her back on track. For example, the interviewee might say that he or she has a concern about the work schedule, but you see that it has no bearing on the event being investigated. Acknowledge his or her concern and explain how to escalate that concern through the appropriate chain of command. Then, refocus the interview by again stating the purpose of the interview. 7 .3 .4 Ot he r T ips t o a Suc c e ssful I nt e rvie w • Keep an open mind and don’t ask leading questions (do not “put words in the subject’s mouth”). Even if you know what happened, do not try to get the interviewee to agree. Let the interviewee describe what he or she thinks happened. • Ask the interviewee what he or she thinks could have been done to prevent the event. • Be cordial and polite, even-tempered, sincere, and interested. • Sit upright, frontally aligned, lean forward on occasion, have your arms open (do not cross arms), and avoid slouching. • Maintain eye contact when asking questions and when the interviewee is answering. Casual breaks of eye contact are essential. • Avoid expressions (mock or otherwise) suggesting an emotional reaction to the interviewee’s comments, such as disbelief, shock, anger, humor, disgust, or skepticism. • Avoid repeating what the interviewee says. Avoid challenges to the interviewee’s statements. • Acknowledge and review any written statements made by the interviewee at some appropriate time during the interview. • Share interview notes with the interviewee or provide a copy of any electronic recordings if applicable. Such sharing not only establishes credibility, but it is also is a way to validate information. Even with the best preparation, a reluctant interviewee is sometimes encountered. The interviewee may be defensive and refuse to answer questions, perhaps taking the position that he or she did nothing wrong or claiming that all procedures were followed without error. In this situation, when the interviewee resists you, calmly restate that the purpose of the interview is not to fix blame but just to gather facts. As stated above, continue to be cordial and maintain an even temper. If progress is stopped, politely wrap up the interview. Again, offer to share any notes or recordings with the interviewee.

7 .4 Closing t he I nt e rvie w When the end of all questions has been reached and the original question themes have been accomplished, ask one last question: “What would you suggest to fix this problem or to prevent it from happening again?” Next, summarize the interview. Review with the interviewee the information gathered. Finally, thank the interviewee for his or her time and let the interviewee know that follow-up questions may be asked once addition information has been gathered. 7 .5 Re c ording Figure 7-1 is a useful template for recording your interviews. This template, or one like it, should always be used. It serves as your reminder of the topics you need to address in the interview and ensures consistency in format when more than one person is being interviewed.

PERSONNEL STATEMENT WORKSHEET Statement of (name): Event Date / Time: ________ / ________ Associated task/evolution in progress:

Date / Time of Statement: ________ / ________

Job position, role and responsibilities during the task/evolution:

Problem description (What is the key issue with this event?):

What happened? What was expected? / What was the objective of the task/evolution?

How was the problem discovered? Knowing what happened, what would you recommend to be done differently?

Signed: _______________________________________ Figure 7-1. Personnel Statement Worksheet

Sum m a ry This Chapter presented the following five basic steps for conducting an effective interview:  Planning.  Opening.  Conducting.  Closing.  Recording.

Look ing Forw a rd Chapter 8 will introduce the following cause investigation tools:  Comparative Event Line.  5 Why’s/Why Staircase.  Causal Factor Trees.  Barrier Analysis.  Change Analysis.  Task Analysis.  Common Cause Analysis.  Failure Modes and Effects Analysis.

Re fe re nc e s

Dekker, S. (2014). The field guide to understanding human error, 3rd ed. Boca Raton, FL: CRC Press.

Cha pt e r 8 Ana lysis T ools a nd T e c hnique s The previous chapter presented five basic steps for conducting an effective interview. Chapter 8 will introduce several derivatives of common cause analysis tools, as well as introducing a tool for classifying these tools by their function. 8 .1 T ool T ype s a nd U se M a t rix Like the tools in your household toolbox, no single tool can be used for all applications. It is hard to loosen the nut on a bolt with a screwdriver. Event analysis tools are the same. Not all analysis tools can show event timing, while other tools are not able to show cause and effect relationship(s). One way to distinguish between analysis tools is to classify them by their primary function or functions. The Tool Types and Use Matrix (Figure 8-1) lists several cause investigation tools and classifies them by their function. A tool may have more than one function (i.e., functional type). These functional types are: D = Data Gathering: Tools that are useful in helping to collect and organize information for later analysis. A = Analysis: Tools that are used to analyze a specific observation. E = Event Modeling: Tools used to display (or “model”) the relationship between a sequence of events and / or conditions. The matrix below shows several frequently used tools and then cross-references each tool with its function(s). Finally, the matrix shows the area(s) where the use of a given tool can be most effective. The following tools listed on this matrix are discussed further in this chapter: • Comparative Event Line. • Failure Modes and Effects Analysis.

• • • • •

Barrier Analysis. Change Analysis. Task Analysis. Common Cause Analysis. 5 Why’s/Why Staircase.

Interviewing was discussed in Chapter 7, and Causal Factor Trees for event modeling and Event and Casual Factor Charting tools will be discussed in Chapter 9. If one or more of the cause investigation tools used frequently in your company are not listed, you may construct a similar matrix to include them.

Tool Description

Tool Type

Failure Modes and Effects Analysis

A

Interviewing

D

Comparative Event Line

D

Barrier Analysis

A

Change Analysis

A

Task Analysis

A

Common Cause Analysis Commonalities Matrix 5 Why’s / Why Staircase Causal Factor Trees Event and Causal Factor Charting

Effective Tool Use

A A A EA EA

Problem Types

ACTIVE ERRORS

ERROR PRECURSORS

PROGRAM & PROCESS FAILURES

ASSESSMENT FAILURES

MANAGEMENT PRACTICE ERRORS

HARDWARE FAILURES

Figure 8-1. Investigation Tool Types and Use Matrix (“LTA” = Less Than Adequate)

First, I begin with tools that are primarily analytical tools typically used iteratively during an investigation. In this chapter, I will also introduce the 5 Why’s/Why Staircase, which is a

questions-asking technique more than a specific tool. The results from using this technique are typically displayed in a Causal Factor Tree, which will be discussed in Chapter 9. 8 .2 Com pa ra t ive Eve nt Line The Comparative Event Line is a unique and simple multipurpose tool that focuses on gathering data and should be the first tool used for every investigation. This tool is based on the Comparative TimeLine™ developed by William R. Corcoran, Ph.D. (2003). This tool is used in conjunction with the Human Performance Evaluation Summary and the Hardware/Material/Design Failure Summary worksheets introduced in Chapter 2 for evaluation of the direct consequences of each delta between Columns 4 and 5 and for analysis of the significance of these differences. The Comparative Event Line can also be used as a shortened Barrier Analysis by describing the adverse consequences or effects of an error in Column 4 and any exiting or missing barriers to preclude the adverse consequences or effects in Column 5. The format of the Event Line is a seven-column table of the important happenings (and omissions) related to the event. Contents of the columns are: Columns

Contents

1

Sequence number for ordering events relative to other events when a specific date of the event(s) cannot be established or is not relevant.

2&3

Date and time of each step in the development of the event, including post-event actions that influenced the consequences one way or the other.

4

Describes what actually happened or (in the case of omission and failures to perform) what actually did not happen. When used to describe failed barriers, this column is used to detail the adverse consequences or effects of a failed barrier.

5

Describes what should have or not have happened. When used to describe failed barriers, this column is used to detail the barriers that failed.

6&7

Whenever there is a consequential disparity between columns 4 & 5, or when these columns describe a failed barrier, Columns 6 or 7 are used to reference the Human Performance Evaluation Summary worksheet or the Hardware / Material / Design Failure Summary worksheet that documents why the disparity existed or the barrier failed.

Table 8-1. Comparative Event Line Template Seq. No.

Date

Time

What Actually Happened / Adverse Consequences or Effect

A sequence number for ordering events relative to other events when a specific date of the event(s) cannot be established or is not relevant.

The date (if known) of each step in the event.

The time (if known) of each step in the event.

The What happened? Column is where an influencing factor is recorded that, as understood, appears to have a significant relationship to the event. What happened can include what didn’t happen if something should have happened, but didn’t. The information in this column should be checked to make sure it is consistent with the best available evidence. Inferences and opinions should be so labeled. The investigators should put in everything that has a chance of being relevant and then removing it before releasing the report.

.

What Should Have Happened / Barriers to Preclude

“What Should Have Happened?” is the corresponding functional influence that reflects organizational standards and values. All supporting programs and processes should be aligned to achieve the organization’s mission. The information in this column should be checked to make sure it is consistent with the best available evidence. Inferences and opinions should be so labeled. When what happens is not what is expected or required there is often a misalignment between stated values and actual values. The misalignment can exist in a single person or in an entire organization. It is therefore important that an investigator understand if an inappropriate consequence influencing factor is aligned to organizational values before expecting to understand what the significance is to the organization. In regulated industries, regulatory requirements should be considered in determining what should have happened.

Observ #

Failure #

Whenever there is a consequential disparity between columns 4 & 5 or when these columns describe a failed barrier, enter the associated Observation number or Failure Number in these columns.

Table 8-2. Comparative Event Line Analysis for the Example Seq. No.

Date

Time

What Actually Happened / Adverse Consequences or Effect

What Should Have Happened / Barriers to Preclude

1

00:00:00 In early 2008, the Ajax Construction awarded contract to build condominium.

OK

2

00:00:00 In Mid-2008, the Ajax Construction develops a safety program covering all aspects of the project.

OK

3

09/01/2008 00:00:00 In September 2008, the Ajax OK Construction hires dump truck drivers.

4

10/04/2008 07:00:00 October 4, 2008 Construction activities begin, dump truck driver starts moving dirt.

As required by the company’s comprehensive safety program, the driver should have attended driver safety class before entering construction site.

5

10/08/2008 00:00:00 Between October 4-8, 2008, construction continues without incident.

Construction management should have been monitoring and correcting deviations from safety rules.

6

10/08/2008 17:00:00 October 8, 2008 Worker parked dump Dump truck should have been truck on hill for the weekend but left it locked with its wheels chocked. unlocked and without the wheels chocked. The dump truck has been left in a position to roll down the hill and is vulnerable to vandalism.

7

10/09/2008 00:00:00 On October 9, 2008, a 9-year-old boy Boy should not have been allowed to walks four blocks from home and leave his yard without permission. enters construction site. Parents unaware of boy’s absences and boy unaware of dangers of wandering off alone.

8

10/09/2008 08:15:00 The boy enters unlocked dump truck and releases the emergency brake. This allowed the truck to start rolling.

9

10/09/2008 08:16:00 The Boy was afraid to jump from truck The boy should have reset the as it rolls down the hill and picks up brake, but does not know how. speed.

10 10/08/2008 08:16:45 The boy was injured when truck hits parked automobile.

Construction vehicles should have been locked with wheels chocked or otherwise placed in a safe condition and the emergency brake should be hard for a young boy to release.

Observ # Failure #

1

2

1

8 .3 Fa ilure M ode s a nd Effe c t s Ana lysis Failure mode and effects analysis (FMEA) was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s with the one of the earliest being described by the US Armed Forces in MIL-P-1629 (US Department of Defense, 1949). Since then, many different types and FMEA worksheet formats have been developed including some that focus on process, procedure or software failures. Hardware failure FMEAs typically involves reviewing as many components, assemblies, and subsystems as possible to identify failure modes, along with their causes and effects. This type of FMEA examines a component or system’s individual subsystems and assemblies to determine the variety of ways each sub component or assembly could fail and the effect(s) of each failure. This section will present a simplified version of a hardware-related FMEA worksheet I developed to be used with the Cause Road Map. The analyst fills in the columns performing the following steps: Component

Perform a component by component examination of the system and list each component on a separate line.

Function

Describe this component’s function.

Failure Modes

Describe failure mechanism or mechanisms for the specified component.

Failure Causes

Describe potential failure causes (e.g., lack of lubrication, overfilled, under filled, etc.). Use the Cause Road Map Maps A & B to help identify these potential causes

Failure Effects

Describe effect of failure (e.g., stop fluid flow, reduce temperature)

Failure Detection

Describe ways to detect failure.

Comments

Provide any needed explanatory comments.

Cause Codes

Select the associated Cause Code category (e.g. MST* for Excessive / Inadequate Temperature).

Calculate an RPN value to provide a relative ranking between components as follows: Potential Severity Probability of Probability of Detecting RPN Rank Occurrence Rank RPN = Severity x Rank 1=Low Impact 1=Very Likely Occurrence x 1=Not Likely 10=High Impact 10=Not Likely Detection 10=Inevitable

Optional

Document the analytical work on the worksheet with comments. Matching the observed facts with the effects analysis will identify the likely causes.

Table 8-3. FMEA Worksheet Template Componen t

List component or sub component

Function

Describe its function (e.g., turn on, turn off, alarm, start, stop, open, close, etc.)

Failure Modes

Describe failure mechanism (e.g., break, stall, stick, short, etc.)

Failure Causes

Failure Effects

Failure Detection

Comments

Cause Codes

Describe failure causes (e.g., lack of lubrication, over filled, under filled, etc.)

Describe effect of failure (e.g., stop fluid flow, reduce temperature)

Describe method to detect failure

This is a good place to describe verification plans

Code from Road Map

Note − a failure mode can have multiple causes

Example: “Slug not free to move”

Failure mode 2 Failure cause 2

Note − a failure cause can have multiple detection methods

Use of the following table is optional. Potential Severity Rank 1=Low Impact 10=High Impact

How severe is the effect of this failure?

Probability of Occurrence Rank 1=Not Likely 10=Inevitable

How frequently is this failure likely to occur?

Probability of Detecting

RPN

Rank 1=Very Likely 10=Not Likely

RPN = Severity x Occurrence x Detection

How easy is it to detect this failure?

Risk Priority Number

Tips for using the FMEA process: • A thorough understanding of the failed equipment is necessary in order to conduct an FMEA. A highly knowledgeable subject matter expert is needed. If the evaluation team does not possess a high level of knowledge, an expert should be recruited from elsewhere, either inside or outside of the organization. • Do not rule out possible failure modes until physical evidence validates that they should be eliminated. In your evaluation, you may also need to look for a lack of evidence to eliminate a particular failure mode.

• •

•

Repeat the process as needed to identify intermediate failure modes until the primary failure mode is determined. You may need to examine physical evidence under laboratory conditions. If that is the case, be sure to get laboratory personnel involved as early as possible. I highly recommend that laboratory personnel visit the location of the failure to understand layout, environmental conditions, history, etc., which may have contributed to the failure. If the component failure was catastrophic, you may find that physical evidence has been lost or destroyed in the failure (for example, electrical insulation is destroyed by fire). If that is the case, examine other similar components. In addition, consider possible corrective actions with methods to capture and preserve physical evidence in future failures.

Table 8-4. FMEA for a Flow Control Valve Example Component Function

Actuator

Strokes valve open and closed

Failure Modes

Failure Causes

Failure Effects Failure Detection

Solenoid actuator failure

Internal contamination of solenoid coil insulation outgassing

Contaminants prevent the free movement of internals

Contact temperature of solenoid is relatively high

Take temp on each division valve

Sealing of slug end face by lubricant outgassing

Slug not free to move

Film of lubricant on the slug end face

Inspect for film

Actuator diaphragm seal leakage

Hydraulic pressure decreases

Requires disassembly and inspection

Diaphragm just replaced – could have been misassembled

Actuator hydraulic fluid Internal leakage

Comments

Cause Codes

MST*

MSM*

MAA*

Use of the following table is optional. Potential Severity Rank 1=Low Impact 10=High Impact

Probability of Occurrence

Probability of Detecting

RPN

RPN = Severity x Occurrence x Detection Rank Rank 1=Not Likely 10=Inevitable 1=Very Likely 10=Not Likely

10

8

1

80

5

5

1

25

5

5

5

125

8 .4 Ba rrie r Ana lysis In its orginal incarnation, this technique was used to assess the effectiveness of physical or flagging / warning barriers. Later versions focused on identifying behavioral deteminants. This technique has been in use since at least 1990 and has also known as a Hazard-Barrier-Target analysis. The following is a redrawn and revised version of a graphic originally created and by Dr William R. Corcoran that he used in many of his training courses.

Figure 8-2. Barrier Analysis Graphic

The Barrier Analysis technique described in this section looks for breakdowns or lack of barriers (both physical and behavioral) which resulted in unwanted conditions. Barriers are designed to prevent or mitigate the consequenses of a problem or event. Unwanted problems or events occur when a barrier(s) breaks down or is not present to prevent it from happening. Performing a Barrier Analysis is an effective method for you to identiy weak or missing barriers that contributed to or exacerbated an event. 8 .4 .1 Cla ssifying Ba rrie r Func t ions • Barriers that promote an appropriate action. Examples include: 1. A step-by-step procedure. 2. Labeling of equipment. 3. Design features such as layout of a control room, training of personnel or supervision.

•

Barriers that prevent an inapprioriate action. Examples include: 1. Interlocks that prevent opening a tank fill and drain valves simultaneously. 2. Fan belt guards. 3. Ignitition interlock that prevents a car from being started unless the transmission is in park or neutural. 4. Parts that can be assembled only one way.

•

Barriers that discourage an inapprioriate action. Examples include: 1. Warning signs; the automatic shutoff nozzle on a gasoline pump, 2. The “deadman” switch on lawn mowers that automatically shuts off the engine if the operator lets go of the handle.

•

Barriers that detect an inappropriate action. Examples include: 1. An alarm that sounds when you open your car door with the keys still in the ignition; supervision of a task. 2. Periodic safety inspections to detect a discharged fire extingusher.

•

Barriers that compensate for an inappropriate action. Examples include: 1. The reverse feature on automatic garage doors that will reverse the door if it detects a person or object passing through the doorway while the door is closing. 2. The overflow drain on a sink; robust crane design that has a built-in margin; automatic shut off on an iron that will turn it off if it is not moved for several minutes.

•

Barriers can serve more than one function. For example: An automatic shut off and a “deadman” switch could also be classified as a barrier that both discourages and prevents an inappropirate action.

8 .4 .2 T ypic a l Ba rrie r Ana lysis Che c k list Que st ions You will need to address many things during the performance of a Barrier Analysis. The questions listed below are designed to assist you in determining what barriers failed, were weak, or were ineffective – resulting in the problem/incident.  If there were barriers, did they perform their function? Why?  Did the presence of any barriers mitigate or increase the problem severity? Why?  Were any barriers not functioning as designed? Why?  Was the barrier design adequate? Why?  Were there any barriers on the condition / situation source(s)? Did they fail? Why?  Were there any barriers on the affected component(s)? Did they fail? Why?  Were the barriers adequately maintained?  Were the barriers inspected prior to expected use?  Why were any unwanted hazards present?  Is the affected target designed to withstand the condition/situation without the barriers?  What changes could have prevented or deflected the unwanted hazards? Why?  Did designers, operators, maintainers, or anyone else foresee the problem?  Is it practical to take steps to reduce the risk of the problem recurring?  Were adequate human factors considered in the design of the barrier?  What would you have done differently to prevent the problem, considering all economic concerns (as regards operation, maintenance and design)? This question helps to identify missing barriers. 8 .4 .3 Ba rrie r Ana lysis Displa y Form a t We have modified the Barrier Analysis worksheet for integration with the Cause Road Map, using a six-column table format to describe barriers (and missing barriers) related to the event. These columns are: Column 1

Contents Identifies the barrier. Barriers can be physical (fence, interlocks, alarms, etc.) or administrative (procedure, signs, communication protocol, etc.).

2

Identifies the target that the barrier is protecting. Targets can be people, equipment, processes, or other entities that need to be protected.

3

Identifies the threat that the barrier is designed to keep away from the target. Threats come in all shapes and sizes. Threats can be: injury to a worker, damage to equipment, economic loss, violation of a regulation, etc. Evaluates the effectiveness of the barrier as it relates to the event. Was the barrier weak, effective or ineffective, missing, not used, or, in the case of an administrative barrier, not followed?

4

5 6

Identifies the type of barrier: prevent, discourage, compensate, promote, or detect. Lists the cause code. If the barrier was effective, no Cause Road Map Cause Code should be listed. Column 6 is unique to this Barrier Analysis Worksheet.

Table 8-5 shows a barrier analysis for the dump truck accident teaching example.

Table 8-5. Barrier Analysis for the Dump Truck Accident− Teaching Example Barrier Fence and/or guard

Wheel chocks

Locked vehicle doors

Target Non-construction workers

Threat

Effectiveness in this Case Not used

Construction workers

Injury to nonconstruction workers, vandalism, loss of equipment Injury to people, property damage from movement of unattended vehicle Vandalism, loss of equipment Injury to people, property damage from movement of unattended vehicle Injury to people, property damage from movement of unattended vehicle Injury to people, property damage from movement of unattended vehicle Injury to people, property damage Injury to people, property damage

Child’s well being

Injury

Not Effective

People and other property

Leave truck in reverse gear when parked

Non-construction workers People and other property

Park vehicle on level ground

People and other property

Safety rules for operating vehicles

Vehicle operators

Safety training program Supervisor / Safety Officer Job observations Parent Supervision

Vehicle operators

Type and Cause Code Prevent / discourage

MDD**

Prevent / compensate

MDD**

Prevent

MDD**

Prevent

MOD**

Prevent / compensate

MOD**

Promote

DPP**

Promote

DPT**

Promote

ASA**

Promote

DOC**

Not used

Not used Apparently slipped out of gear or was not left in reverse Not effective. Vehicle parked on hill

Not followed

Not held Not effective

The following is an example of how the Comparative Event Line tool is used to identify and evaluate failed barriers.

Table 8-6. Comparative Event Line used as Simplified Barrier − Analysis Example Report

Example Failure #

EVENT / ISSUE TIME LINE Seq. No.

Date

Time

What Actually Happened/ Adverse Consequences or Effect

What Should Have Happened/ Barriers to Preclude

Observ #

1

10/08/2008

17:00:00

Dump truck should have been locked with its wheels chocked. (Barrier not used)

1

2

10/09/2008

00

October 8, 2008 Worker parked dump truck on hill for the weekend but left it unlocked and without the wheels chocked. The dump truck has been left in a position to roll down the hill and is vulnerable to vandalism. On October 9, 2008, a 9year-old boy walks four blocks from home and enters construction site.

3

10/09/2008

08:15:00

The boy enters unlocked dump truck and releases the emergency brake. This allowed the truck to start rolling.

The construction site should have been fenced or guarded to prevent entry by unauthorized persons. (Missing Barrier) Construction vehicles should have been locked with wheels chocked or otherwise placed in a safe condition (Barrier not used) and the emergency brake should be hard for a young boy to release. (Ineffective Barrier).

2

1

8 .4 .4 Adva nt a ge s a nd We a k ne sse s of Ba rrie r Ana lysis T ool Barrier Analysis Pros: • Conceptually simple, easy to grasp. • Easy to use and apply, requires minimal resources. The format for a Barrier Analysis is a Microsoft Word table. • Results translate naturally into corrective action recommendations.

Barrier Analysis Cons: • Some barriers (e.g., the organization’s self-assessment capabilities) may be hard to recognize, resulting in an incomplete analysis. • Can be subjective in nature, resulting in overstating or understating weak or ineffective barriers. • Barrier Analysis cannot identify underlying organizational weaknesses or cultural problems. • Barrier Analysis is not a standalone process and should always be used in combination with the Comparative Event Line and other methods. 8 .5 Cha nge Ana lysis Change is often described as the “Mother of trouble” because we often find that a problem or event was caused by some sort of change. An activity is performed hundreds of times but this time it resulted in an undesired outcome. What changed this time to cause the undesired outcome? This technique, also known as a Change Impact Analysis, is used in many different industries and thus has differing forms and definitions in each. The version most relevant to this discussion is the change analysis technique discussed in Charles D. Reese’s book Accident/Incident Prevention Techniques (2011). Reese’s discussion was derived from the US Department of Energy’s Root Cause Analysis Guidance (1992). Neither of these discussions included examples of how to use this technique, and neither discussed how to integrate this technique into the cause investigation process. For clarification, we include a teaching example and a worksheet modified to integrate with the Cause Road Map. Change Analysis is the technique that asks: • “What is different in the situation that resulted in undesired result or performance problems from the other times this activity or process was performed correctly?” • “What change in personnel, equipment, procedures, process, design, supervision and/or communications took place between the successful completion of the activity and the performance problem?”

8 .5 .1 T he Five St e ps in Cha nge Ana lysis The fundamental process of change analysis involves six steps, as depicted in Figure 8-3: 1. Situational determination. Record all facts concerning the occurrence with the undesirable consequences. Caution If there were problems associated with previous successful activity, Change Analysis technique should not be used. 2. Similar but problem-free event determination. Consider a comparable, reference occurrence that did not have undesirable consequences.

3. Comparison. Compare the occurrence with undesirable consequences to the reference occurrence. 4. Document the differences. Establish all known differences whether they appear relevant or not. 5. Analyze the differences. Analyze all the differences for their effects in producing the undesirable result. Be sure to include the obscure and indirect effects. For example: Different paint on a piping system may change the heat transfer characteristics and therefore change the system parameters.

6. Incorporate contributing differences into problem causal factors. Integrate this apparent cause(s) into the investigation for the root cause(s) as shown in Table 8-7.

Figure 8.3 Six Steps Involved in Change Analysis

Table 8.7. Change Analysis Worksheet Template Change Factor

Difference / Change

Effect

WHAT (conditions, activity, equipment) WHEN (occurrence, plant status, schedule) WHERE (physical locations, environmental conditions, step of procedure)

Questions To Answer

Observation Number

Failure Number

Whenever there is a consequential difference described in columns 2, use these columns to enter the associated Observation number or Failure

HOW (work practice, omission, extraneous actions, out of sequence, poor procedure) WHO (personnel involved (by job title, not name)

8 .5 .2 A Ca use Ana lysis Ex a m ple The following is my summary of this details of the airline crash that we will use as a cause analysis example. Comair Flight 5191 Accident Summary ComAir Flight 5191 (Delta Connection Flight 5191), was a domestic passenger flight from Lexington, KY, to Atlanta, GA. On the morning of August 27, 2006, at around 06:07 am, the flight crashed while attempting to take off from Blue Grass Airport on the incorrect runway. All 47 passengers and two of the three crew members on board the flight died.

Both pilots were experienced and well rested. In addition, neither pilot had any accidents, incidents or enforcement history. It is not entirely clear if they have worked together on a routine basis. There were no mechanical problems with the aircraft. Both pilots had arrived in Lexington at least 14 hours prior to the night of the crash and had flown into Blue Grass Airport several times, although not since its runway was repaved.

Blue Grass Airport's main runway, Runway 22, is 7,000 feet long. However, for an unexplained reason, Flight 5191 took off from Runway 26, the 3,500-foot general aviation runway. We need to ask why, since there were clues for the pilot to help him taxi to the correct runway, including signs marking the right way. The 3,500-ft.-long Runway 26 had less lighting than some other runways and contained severely cracked concrete – not the type of surface typically found on runways for commercial routes. The cockpit voice recorder indicates that preflight procedures were normal and that the crew reported no problems. All contacts between the crew and air-traffic control indicated that the crew planned to take off on the airport's main Runway 22, the appropriate one, not the much shorter Runway 26. Two earlier flights had taken off from the correct runway that morning before Flight 5191. When the plane taxied away from the main Runway 22, which had been repaved the previous week, it was dark and a light rain was falling, but the plane moved toward Runway 26. Other information about the conditions include: The two runways in question share the same common run-up area. The extended taxiway to the correct runway, Runway 22, was closed a week earlier due to construction. Pilots now had to make a sharp left turn and cross Runway 26 to reach Runway 22 (see Figure 8-4). One pilot experienced with the Blue Grass Airport’s configuration states that it has always been difficult for pilots to distinguish between the two runways when taxiing out, noting that it would be quite natural to take the wrong one and that the pilot is always tempted to take it.

A Change Analysis can be performed from the information acquired about the accident.

Approach route to Runway 22 in use on 8/27/06

Approach route to Runway 22 prior to construction

Approach route to Runway 26 Runway 26

Airport Terminal Runway 22

Figure 8-4. Blue Grass Airport Layout. (Created from a public domain photograph.)

Table 8-8. Change Analysis Worksheet ComAir Flight 5191 Accident Change Factor

Difference/Change

Effect

WHAT (conditions, activity, equipment)

It was dark and raining and the lights for Runway 26 were off. No change in the plane’s condition. No change on when. This was a daily scheduled flight. Taxi way routing. Weather conditions.

The fact that the lights were off was not relayed to the control tower.

WHEN (occurrence, plant status, schedule) WHERE (physical locations, environmental conditions, step of procedure)

HOW (work practice, omission, extraneous actions, out of sequence, poor procedure)

There is no indication that the pilot work practices had changed.

WHO (personnel involved - by job title, not name)

The pilots while experienced may not have routinely worked together. They did not fly together the previous day. There was no change in the flight controller or staffing.

Questions To Answer Why did the pilots continue with the runway lights off?

The flight crew had not departed the airport using the new taxiway configuration.

Need to interview other pilots to determine the level of confusion with the new configuration.

Effect unknown - but the effect may have impaired communication between the pilots.

Need to interview the survivor to determine the effect on communications.

Observation Number

Failure Number

1

2

3

By completing the change analysis worksheet, the investigator can quickly rule out issues such as pilot work practices and tower staffing. They were not the direct cause of the accident. However, they may have contributed to it. The focus here was not to analyze the complete event but rather to demonstrate how a change can directly lead to an event. In their accident report for this event the National Transportation Safety Board (NTSB) determined that the probable cause for this accident was (2007): …the flight crewmembers’ failure to use available cues and aids to identify the airplane’s location on the airport surface during taxi and their failure to cross-check and verify that the airplane was on the correct runway before takeoff.

And that the contributing to the accident were: …the flight crew’s non-pertinent conversation during taxi, which resulted in a loss of positional awareness, and the Federal Aviation Administration’s failure to require that all runway crossings be authorized only by specific air traffic control clearances. Although the NTSB’s report for this accident did not indicate that their investigation used this technique, the above conclusion answers all the questions in this teaching example. 8 .5 .3 Adva nt a ge s a nd We a k ne sse s of Cha nge Ana lysis T ool Change Analysis Pros: • Conceptually simple, easy to grasp. • Can be quickly performed. • Easy to use and apply, requires minimal resources. The format for a Change Analysis is a Microsoft Word table. Change Analysis Cons: • Depends on previous successful outcome. • Some changes may be hard to recognize resulting in an inaccurate analysis. • Change Analysis cannot identify underlying organizational weaknesses or cultural problems nor is it likely to identify weakness in the organization’s self-assessment capabilities. • Change Analysis is not a standalone process and should be used in combination with other methods. 8 .6 T a sk Ana lysis This technique is used in many different industries and thus has differing forms and definitions in each. The following definition provided in A Guide to Task Analysis (Kirwan & Ainsworth, 1992) is most application to this discussion: Task analysis is the analysis of how a task is accomplished, including a detailed description of both manual and mental activities, task and element durations, task frequency, task allocation, task complexity, environmental conditions, necessary clothing and equipment, and any other unique factors involved in or required for one or more people to perform a given task.

This technique assists the evaluator in learning as much as possible about tasks of interest and establishing a baseline for the evaluation concerning how the task is normally performed. There are many different versions of this technique in use today. The following is my version: • A task analysis performs a step-by-step reenactment of the task. It is useful in determining behavioral and environmental causal factors and is best accomplished using an individual that is fully qualified and proficient in the task. The individual(s) used should not have been involved in the occurrence.

•

A task analysis is most often used to study a situation where a performance problem occurred. A task is broken down into subtasks, steps, sequence of actions, procedures, conditions under which the task is performed, tools, materials controls, information needed and information used.

8 .6 .1 T w o T ype s of T a sk Ana lysis There are two types of task analysis, a table top task analysis (paper and pencil) or a walk through task analysis. • In the table top analysis, the evaluator reviews documents and interviews personnel who are fully qualified and proficient at performing the task. •

In the walk through analysis, the evaluator is led through the step-by-step performance of the task by simulated task performance by someone who is proficient in performing the task. Prior to the walk through analysis, the evaluator should perform all the document reviews described in the table top analysis.

8 .6 .2 St e ps in T a sk Ana lysis Following are the steps used in task analysis. Note: Steps 1-6 are performed for both table top and walk through analysis, while Steps 7-11 are performed for walk through analysis only. 1. Obtain preliminary information so that it is known what events were taking place when the inappropriate action occurred. 2. Decide specifically which tasks will be investigated. 3. Obtain available information about the task requirements. ♦ Study relevant procedures ♦ Gather equipment history ♦ View system drawings ♦ Interview personnel ♦ Review technical manuals 4. Divide the task of interest into component actions or steps and write the step name of action in order of occurrence on the Task Analysis Worksheet in the “Required Actions” column. 5. For each required action, identify who (by job title) performs the action step and the equipment / components / tools used. Write this information on the worksheet. 6. Review the analysis information and formulate any questions for which you need to collect additional data. 7. Produce a guide outlining how the task will be carried out so that it will be apparent, what to look for and how to facilitate timely recording of actions. 8. Become familiar with the guide and decide what information is to be recorded and how it will be recorded. 9. If the problem was related to human performance, select personnel who normally perform the task. If a crew performs the task, crewmembers should play the same role they fulfill when

normally carrying out the task. If a human performance evaluation is needed, enter the associated Human Performance Worksheet number in the appropriate column. 10. Reconstruct the problem step-by-step. Observe personnel walking through the task. Record their actions and use of displays and controls. The walk-through may be done in slow motion or even stopped to facilitate asking questions. (Note: In this step, use of a simulator or mockup may prove useful.) 11. Evaluate environment / safety conditions; e.g., lighting, temperature, noise, work schedule, schedule pressures, other tasks working in area. 12. Summarize all information collected and consolidate any problem areas noted. Identify probable contributors to the inappropriate action. Table 8-9. Task Analysis Worksheet Template Steps Who Required Component/ Tools Actions Equipment

Remarks/ Questions

Observation Number Whenever there is a consequential question described in columns 6, use this column to enter the associated Observation number.

8 .6 .3 T a sk Ana lysis Ex a m ple Check Valves at Nuclear Power Plant A two-unit nuclear power plant has eight check valves, four in each unit, in a safety system that has a history of leakage due to seat failure. All eight check valves perform the same function and are of the same make and model. The four valves in each unit provide isolation separating high pressure systems from redundant low pressure systems during normal operation and are supposed to be leak tight. To provide a leak tight seat the valves are supplied with a “soft” seat. The “soft” seat is made by a rubber like “O” ring inserted into a machined groove in the valve disk. However, the frequency of seat leakage appears to be increasing requiring more valve overhaul maintenance.

From a common cause analysis, we learned that the valve seat “O” rings are failing at the 12 o’clock position shortly after maintenance is performed. The valve manufacturer representative was contracted to assist in the performance of a Task Analysis of the valve maintenance procedure. An excerpt from the analysis is in Table 8-10.

Table 8-10. Task Analysis for Valve Maintenance Teaching Example Steps

Verify “O” ring groove is clean

Who

Required Actions

Mechanic Visual inspection of the groove

Component / Equipment Valve disk

Install “O” ring Mechanic Physically “O” ring and in valve disk install “O” ring valve disk groove in disk Install valve Mechanic The action is disk and hinge to take the arm in valve disk and body hinge, which is one piece, and place it in the valve body Verify disk Mechanic Verify that the contact with valve disk is valve seat making even contact with the valve seat

Tools

None

None

Remarks/ Questions

Observation Number

Skill of the craft. Procedure should include an acceptance criteria for “clean” Procedure should include a step to verify that the groove is free of nicks and burrs that could cut the “O” ring

Valve body, None disk and hinge arm

None

Valve body, None disk and hinge arm

This is a critical step that requires shimming the hinge arm to achieve uniform seat contact. The vendor representative advises using tissue paper between the valve disk and seat as a tool to measure proper closure contact.

1

1

1

From the task analysis, it was noted that there are critical steps needed to properly position the valve disk to achieve both seat tightness and prevent damage to the valve seat “O” ring. Neither the general valve overhaul procedure nor the technical manual provided sufficient instructions for the critical step. 8 .6 .4 Adva nt a ge s a nd We a k ne sse s of T a sk Ana lysis T ool Task Analysis Pros: • Conceptually simple, easy to grasp. • Can be quickly performed. • Easy to use and apply, requires minimal resources. The format for a task analysis is a Microsoft Word table.

Task Analysis Cons: • Task analysis cannot identify underlying organizational weaknesses or cultural problems nor is it likely to identify weakness in the organization’s self-assessment capabilities. • Task analysis is not a standalone process and should be used in combination with other methods. 8 .7 Com m on Ca use Ana lysis This technique is also known as a common cause failure analysis and has taken many different forms, even within the same industries. Therefore, I was unable to find any commonly accepted definition for this technique. Unlike the other analytical tools presented in this chapter, common cause analysis evaluates multiple events. The purpose of performing a common cause analysis is to look for more fundamental causes for an adverse trend or a series of events. Sometimes we have recurring problems despite our best efforts to determine the cause and taking appropriate corrective actions for the individual problems. When should a common cause analysis be used? The following are several suggestions for performing a common cause analysis: • When the rate of human performance errors is increasing. • When the rate of a component failure increases. • To gain further insight into causes for regulatory violations. • To look for commonality for several apparently unrelated events. 8 .7 .1 Using Com m on Ca use Ana lysis: Nuc le a r Pow e r St a t ion Ex a m ple Seven Root Cause Investigations in Three Years A two-unit nuclear power station has had seven root cause investigations over the past three years. Senior management wants to know if there is a common cause for the seven events and asks for a common cause analysis.

8 .7 .1 .1 Approa c h t o t he Se ve n Eve nt s The seven events to be analyzed are: 1. A Unit 2 automatic shutdown while performing a routine surveillance test. 2. A Unit 2 Auxiliary Feed Water Pump motor was found covered with a plastic tarp. 3. Unit 2 automatic shutdown caused by troubleshooting a switchyard breaker. 4. Unit 1 forced shutdown caused by a failed 3/4-inch compression fitting. 5. A Nuclear Regulatory Commission violation for failing to revise drawings to show installed Unit 2 electrical wiring configuration. 6. Radiation Work Permit violation while performing work near the reactor. 7. Unit 1 Safety Injection system pipe hanger was found damaged.

The approach to the analysis was to perform a detailed review of each of the event reports to determine the cause(s) and recommended corrective actions. From the reports, we learned that each event was thoroughly investigated and that the recommended corrective and preventive actions were appropriate. Therefore, at first, there did not seem to be anything common among the seven events. However, each of seven event reports noted that there had been an earlier similar event for which corrective actions had been recommended to prevent recurrence. Several recommended corrective actions for each of the seven events had not been implemented or had been altered in such a way as to have been ineffective in preventing recurrence of the event. Therefore, the more fundamental cause, or common cause, for the seven events was ineffective use of the corrective action process. This analysis only took about an hour to complete. But more importantly it revealed a significant weakness in the organization’s implementation its corrective action program. Management implemented several changes to their corrective action program to ensure corrective actions were not altered or deleted without proper review and approval. There is no right or wrong way to perform a common cause analysis. What the investigator needs is a method to look at information associated with the selected series of events to identify commonalities. 8 .7 .1 .2 Sugge st e d Com m on Ca use Ana lysis M e t hod The following is my suggested Common Cause Analysis method: 1. Establish a study grouping that has similarities (e.g., system, process, event report topic, cause codes, etc.). 2. Review documentation to identify demographic information and enter the data on a spreadsheet like Table 8-11. 3. For example, If the common cause analysis is for a power plant component failure demographic information should include information such as: • Component name. • Failure cause. • Component age. • Maintenance procedure used. • System in which the component is installed. • Physical location. • Last maintenance date. • Work group performing maintenance. • Any other information that may aid the analysis.

Table 8-11. Common Cause Analysis Template Event Name or Event Document Identifier

Demographic Data Componen Last Last System Physical Component t name Maintenance Maintenance Name Location Age (Identifier) Date Performed

Work Group

Activity in Progress Failure Cause at Time of Description Code Failure

Event 1 Event 2 Event 3 Event 4 1. Sort the demographic information looking for dominant patterns. For example, the majority of the failures may be associated with a specific work group, or the majority of the failures may be associated with components over six years old. 2. Investigate the dominant patterns using tools such as interviewing, the Cause Road Map and all available information to identify more fundamental causes or cause categories. Use best judgment (do not reinvestigate each individual event unless there were obvious shortcomings) based on what is known about the issue and what is documented (e.g., supporting documentation). For example: In the component failure case, it may be discovered that the failures occurred shortly after Mechanical Maintenance personnel performed a preventive maintenance inspection. By conducting interviews, to learn how the preventive maintenance is being performed, and by reviewing the maintenance instruction, the Cause Road Map may be used to help determine the underlying cause for the failures. In this case, it may be determined that there is an error in the maintenance instruction, a procedure inadequacy cause. Check Valve Example A two-unit nuclear power plant has eight check valves, four in each unit, in a safety system that has a history of leakage due to seat failure. All eight check valves perform the same function and are of the same make and model. The four valves in each unit provide isolation separating high pressure system from redundant low pressure systems during normal operation and are supposed to be leak tight. To provide a leak tight seat the valves are supplied with a “soft” seat. The “soft” seat is made by a rubber like “O” ring inserted into a machined a groove in the valve disk. However, the frequency of seat leakage appears to be increasing requiring more valve overhaul maintenance. A common cause analysis is requested to determine the cause of the high failure rate. A preliminary review was performed to verify that all eight check valves are installed in a similar manner and location in their respective piping system and are in the same system. Now we need to look at the failure patterns. All valves are of similar age and are original to each unit.

The worksheet in Table 8-12 is set up similar the one described above but with the failure description and cause codes documented separately.

Table 8-12. Common Cause Analysis for Check Valve Maintenance Teaching Example Demographic Data Event Name or Event Document Identifier

Component name (Identifier)

Unit

Last Maintenance Performed

Work Group

Activity in Progress at Time of Failure

9/5/91

Valve overhaul installed new seat “O” ring

Plant Mechanical Maintenance

Plant startup after valve test

7/12/89

MEC, “O” ring found missing Degraded from 10 o’clock to 3 subo’clock position component

5/11/88

Valve overhaul installed new seat “O” ring

Mobile Mechanical Maintenance

Steady state operation

MEC, Degraded subcomponent

3/1/91

Valve overhaul installed new seat “O” ring

Plant Mechanical Maintenance

Steady state operation

Cause Code

Last Maintenance Date

5/11/92

“O” ring found washed away at 12 O’clock position

MEC, Degraded subcomponent

Failure Date

Failure description

WO 92-5345

22A SI tank discharge check valve

WO 89-2275

12A SI tank discharge check valve

WO 91-9989

22A SI tank discharge check valve

2

9/5/91

Light scratches found on valve seat at 6 and 8 o’clock position. Also found damage on “O” ring at 11 o’clock position.

WO 92-4598

11B SI tank discharge check valve

1

8/3/92

Valve disk hinge found binding

MEC, Degraded subcomponent

7/12/90

Valve overhaul installed new seat “O” ring

Plant Mechanical Maintenance

Steady state operation

WO 90-1115

21A SI tank discharge check valve

2

1/25/90

Found “O” ring rolled out of groove in several places

MEC, Degraded subcomponent

6/30/87

Valve overhaul installed new seat “O” ring

Mobile Mechanical Maintenance

Steady state operation

WO 88-6590

22B SI tank discharge check valve

2

11/4/88

“O” ring damaged in several areas and missing at the 1 o’clock position

MEC, Degraded subcomponent

2/25/88

Valve overhaul installed new seat “O” ring

Plant Mechanical Maintenance

Plant startup after valve test

WO 86-9254

11A SI tank discharge check valve

1

10/25/86 No problem found

MEN, Not Repeatable failure

11/3/85

Valve overhaul installed new seat “O” ring

Plant Mechanical Maintenance

Steady state operation

2

1

From a review of this spreadsheet, we observe: • Two maintenance organizations have worked on the valves, plant mechanical maintenance and mobile mechanical maintenance. The results of the maintenance have been similar. • Failures occur with similar frequency on both units. • The valve seats appear to degrade and fail in about one to two years. But, sometimes failure occurs in just a few months. However, some of the eight valves have not experienced a failure, 12B and 21B SI Tank discharge check valves. • Most failures occur during steady state plant operation. • Four out of the seven failures occurred when the valve seat “O” ring was found damaged or missing at or near the 12 o’clock position. • Each of the individual causes was flagged as a degraded component. For each individual event, there was no attempt to determine the cause of the “O” ring failure. Note for most events the cause code was listed as “Degraded Subcomponent.”

From the review of the data it can concluded that there is a dominant failure mode. The valve seat “O” ring is failing at the 12 o’clock position. If the cause for the “O” ring failing at the 12 o’clock position can be identified, it will significantly improve the reliability of the check valve. A new problem statement can now be developed, (SI Tank discharge check valve seat “O” ring is failing at the 12 o’clock position) investigation tools can then be used to identify the cause for the failures. Investigation tools used: • Interviewing. • Task analysis. • Cause Road Map. From interviews (the Cause Road Map was used to help develop question themes) and use of the Cause Road Map as part of the analysis, it was learned that: • A generic check valve maintenance procedure and the valve technical manual are used to overhaul and maintain the valve. • It was also learned that there is no specific training on maintaining this unique style of check valve.

Next, the check valve manufacturer was contacted to have a vendor’s technician perform a tabletop task analysis (see the Task Analysis section, Table 8-9) of the valve overhaul procedure. From the task analysis, it was learned that there are critical steps needed to properly position the valve disk to achieve both seat tightness and prevent damage to the valve seat “O” ring. Next, a Cause Road Map analysis (Table 8-13) was used to identify the underlying causes, and apply trend codes, for which corrective and preventive actions were developed.

Table 8-13 Human Performance Evaluation Summary for Common Cause Analysis Conclusion Report No.

Observation Id:

Human Performance Evaluation – Summary Observation Description Four out of the seven failures occurred when the valve seat “O” ring was found damaged or missing at or near the 12 o’clock position. CF=Safety Culture Flag, DC = Direct Cause; CC = Contributing Cause; PC = Apparent/Proximate Cause; FC = Fundamental/Root Cause; ACT = Corrective Actions Required

Human Errors CODE

Cause Code Text

HTOO

Other team related task performance errors not covered by another criterion were made.

D C

Supporting Information From the task analysis, it was learned that there are critical steps needed to properly position the valve disk to achieve both seat tightness and prevent damage.

C C

P C

F C

AC T

C C

P C

F C

AC T

P C

F C

AC T

X

Error Drivers CODE

EITT

ETGG

Cause Code Text

Supporting Information

The task performer’s knowledge, Training or experience level was inappropriate for the task. The task instructions / Guidance were missordered or described with insufficient clarity for easy use by the actual task performer(s).

From the task analysis, it was learned that the vendor representative advised using tissue paper between the valve disk and seat as a tool to measure proper closure contact. None of the task performers were aware of this critical step and neither the vendor manual nor the site procedure included it.

HE

D C

X

HTOO

HTOO

X

Flawed Organizational or Programmatic Defenses CODE

Cause Code Text

DPT5

Needed training was not given.

DPPI

A procedure / instruction / work document was Inaccurate, unclear, too generic or otherwise did not provide the needed information.

Supporting Information No specific training on maintaining this unique style check valve was ever given. Neither the general valve overhaul procedure nor the technical manual provided sufficient instructions for the critical steps needed to achieve both seat tightness and prevent damage...

ED EITT

ETGG

D C

C C

X

X

X

X

The causes identified were: Neither the general valve overhaul procedure nor the technical manual provided sufficient instructions for the critical steps. Corrective actions implemented: A valve-specific maintenance procedure was written which provided the proper guidance for overhauling the valves. In addition, a training activity was developed for maintaining the valves. Valve performance significantly improved.

8 .8 T he 5 Why’s/Why St a irc a se The 5 Why’s is a questions-asking technique more than a specific tool. The technique is used to explore the cause/effect relationships underlying a problem. The 5 Why’s procedure is intuitive, easily done and simply involves asking “Why?” five times in succession. Figure 8-5 is an illustration of this technique.

Figure 8-5. 5 Why’s Technique Successful use of this technique requires the investigator to avoid assumptions and logic traps, and instead to trace the chain of causality in direct increments from the effect to a fundamental cause with a valid connection to the original problem. Another popular graphical representation of the 5 Why’s approach is the “Why Staircase.” Dr. William R. Corcoran uses Figure 8-6 to illustrate and explain the Why Staircase.

Figure 8-6. Why Staircase The 5 Why’s and Why Staircase techniques are not appropriate for complicated events, primarily because: • Investigators tend to stop at symptoms rather than going on to lower level root causes • Investigator can’t go beyond their current knowledge base • The results are seldom repeatable − different people using 5 Why’s or the Why Staircase will come up with different causes for the same problem • Investigators tend to isolate a single root cause, whereas each question could elicit many different root causes. This technique is typically used with causal factor trees, which will be discussed in Chapter 9. The thought process demonstrated by this technique is also implemented in the Cause Road Maps.

Sum m a ry This chapter described a Tool Types and Use Matrix Tool and the following cause analysis tools:  Failure Modes and Effects Analysis.  Comparative Event Line.  Causal Factor Trees.  Barrier Analysis.  Change Analysis.  Task Analysis.  Common Cause Analysis.  5 Why’s/Why Staircase.

Look ing Forw a rd In the next chapter, we will discuss Event Modeling and Analysis Tools including:  Causal Factor Trees.  Event Modeling & Analysis Tools.

Re fe re nc e s Corcoran, W.R. (2003). The phoenix handbook. Windsor, CT: Nuclear Safety Review Concepts Corporation.

Cha pt e r 9 Eve nt M ode ling a nd Ana lysis T ools

The previous chapter included a discussion of several derivatives of cause analysis tools and a tool for classifying these tools by their function. These tools are typically used iteratively during the cause investigation to generate lines of inquiry and then to pursue them. This chapter will discuss a different type of investigation classified as event modeling and analysis tools. These two tools are popular with leadership because they provide a visual representation of the cause and effect relationship for the event under investigation. These tools are dependent on the use of other tools to gather and analyze data, but the effort needed to construct these models typically suggest other lines of inquiry and thus can also be considered as analytical tools. 9 .1 Ca usa l Fa c t or T re e s A causal factor tree is a cause and effect relationship presentation tool typically used to support techniques like the 5 Why’s, Why staircase and the comparative event line. In the Phoenix Handbook, Dr. William Corcoran (2003) describes a causal factor tree based on his “8 Questions for Insight,” but this technique can result in very complex structures and is much less intuitive; thus, this technique will not be described in this chapter. Another way to understand the causal factor trees is a two-element factor tree based on the concept that each consequential error was driven by either an inappropriate behavior or condition (or both) and that these were in turn driven by more fundamental inappropriate behaviors and/or conditions. As with the 5 Why’s technique, the analyst typically keeps asking why multiple times to uncover deeper and deeper levels of inappropriate behavior and conditions. Again, latent management practice weakness or

cultural problems are typically uncovered at the fifth level of questioning. Figure 9-1 illustrates how a two-element causal factor tree is typically constructed.

Figure 9-1.

Two Element Causal Factor Tree Construction Illustration

Figure 9-2 shows a two-element factor tree (with labels identifying behaviors and conditions) for the dump truck accident that was introduced in Chapter 5. As with the Cause Road Map exercise in Chapter 5, this tree also identifies the fundamental cause as, “safety is not a corporate priority.” For your convenience, we have included the full dump truck example in Section 9.2.5 of this chapter.

Figure 9-2. Factor Tree for the Dump Truck Accident Teaching Example 9 .1 .1 Adva nt a ge s a nd Disa dva nt a ge s of Ca usa l Fa c t or T re e • Advantages o The advantages of this type of a causal factor tree is that it is very visual, easy to follow, and easy to construct. It also clearly displays the investigator’s thinking, shows relationships and interfaces, and provides a display for deductive analysis. (TIP: Using Post-it® notes and a white board makes developing and editing an E&CF chart easier.) • Disadvantages o

Why Trees can be hard to draw in a MS Word document.

o

Does not show the sequence of events; thus, “effects” that are tied to “events” that occurred after the effect are hidden.

o

Can become cluttered and confusing.

o

Takes some practice to use effectively.

9 .2 Eve nt s a nd Ca usa l Fa c t ors (E& CF) c ha rt s Some form of this technique has been in use for many years. The format for this technique used most often used was described in Scientech’s Events and Causal Factors Analysis (Buys & Clark, 1995). Similar to the SCIE version, the following Events and Causal Factors (E&CF) chart version depicts events and causal factors for accident occurrence in a logical sequence. E&CF charts can be used not only to analyze the accident and evaluate the evidence during investigation, but also to help validate the contribution of pre-accident set-up factors. The heart of this technique is the sequence of events plotted on a timeline. Construction of the E&CF chart should begin as soon as the accident investigator begins to gather factual evidence pertinent to the accident sequence and subsequent amelioration. The comparative event line tool described in Chapter 8 is the best tool for this purpose. As the problem time line is established, additional features such as related conditions, secondary problems, and presumptions are added. Probable causal factors become evident as the chart is developed or can be identified as disparities between the “What Happened” and “What Should Have Happened” columns of the comparative event line tool emerge. Often, causal factors that were not obvious at the start of the evaluation become evident through this technique. “Driving” an event or condition through the Cause Road Map will help develop the chart. Ovals tagged with the six human performance areas from the Cause Road Map should be used to display causal factors derived from this effort. Continue with this analysis until: • The primary effect is fully explained. • No other causes that can be found to explain the effect being evaluated. • Further analysis will not provide further benefits in correcting the initial problem.

E&CF charts are particularly useful in complex and complicated situations, and are more effective in communicating what happened and why than long narrative descriptions. The chart provides an excellent method to graphically display functional and non-functional barriers, changes, cause and effect relationships, and how they all interact within the problem. De finit ions Event Primary Event econdary Event erminal Event

An action or happening that occurred during some activity. An action or happening that directly leads up to or follows the inappropriate action. An action or happening that impacted the primary event but is not directly involved in the situation. The end-point of the evaluation.

Conditions Presumptive Event Primary Effect

Causal Factors

Circumstances pertinent to the situation that may have influenced the course of problems. An action or happening that is assumed because it appears logical in the sequence, but cannot be proven. An undesirable condition or event that was critical for the situation being evaluated to occur. It is the major problem (inappropriate action) that resulted in the terminal event. For example, a system malfunction or mechanical failure occurred. Contributing factors which would have precluded or significantly mitigated the problem. These will lead to the root cause of the problem.

9 .2 .1 De fining Proble m s The following criteria should be used when defining problems: • Events should describe an action or happening and not a condition, state, circumstance, issue, conclusion, or a result, e.g., “the pipe ruptured,” not “the pipe had a crack in it.” • Events should describe a single action or happening. • Events should be precisely described by a short phrase with one noun and one action verb, e.g., “operator placed pump switch to start,” not “operator started pump.” • Events should be quantified when possible, e.g., “water level decreased by 36 inches,” not “water level decreased.” • Events should be based upon valid information and each should be derived from the one preceding it, e.g., “operator placed pump switch to start,” then “operator verified normal pump discharge pressure reading of 800 psig,” then “operator placed discharge valve to open.” • Events should be labelled with the approximate time and date they occurred. • Causal factor conditions should be labelled with the appropriate Cause Road Map cause code. 9 .2 .2 Conside ra t ions 1. Begin early. As soon as the accumulation of factual information on problems and conditions related to the problem starts, begin construction of a preliminary problem sequence time line with known primary events / happenings. Collect this information on a Comparative Event Line. TIP: Using Post-it® notes and a White Board makes developing and editing an E&CF chart easier.

2. Proceed logically with available data. Because problems and causal factors do not emerge during the problem review in the sequential order they occurred, there will be initial holes and deficiencies in the chart. Efforts to fill these holes and to accurately track the problem sequences and their contributing conditions will lead to deeper probing by evaluators, who will uncover the true facts involved. In proceeding logically, it is usually easiest to use the last problem as the starting point and reconstruct the pre-event and post-event sequences from that vantage point. 3. Add known and presumptive conditions to the preliminary problem sequence line. Use an easily updated format to facilitate editing the preliminary items. 4. Gather facts using other problem review techniques such as a Comparative TimeLine™.

5. Develop conditions and causal factors to a greater detail. Again, “drive” conditions through the Cause Road Map to get this greater detail and to help ensure that a logical human performance cause and effect relationship is depicted. 6. Validate with results from other reviews. 7. There is no “correct” chart. Use the E&CF charting tool to help discover root/underlying causes and to convey analysis results. This can also be a focusing tool for the investigating team when dealing with complex issues. 9 .2 .3 How t o De ve lop t he E& CF Cha rt As you collect and review information pertinent to the evaluation, write down as much as possible about the sequence of events and the situation. Define the scope of your chart from the initial information. Add key conditions and personnel actions related to the inappropriate action to the chart. For example: • Conditions at the onset of the situation: o Plant or equipment status. o Number of people assigned to the task. o Time of day. o Type of procedure or instruction being used. • Conditions and actions in the course of the inappropriate actions: o Missed step in the procedure. o Tagging placed incorrectly. o Problems with communications equipment. Conditions and actions following the course of the inappropriate action: o Response to the problem. o Consequences of the problem. o Compounding actions taken. o Detection of problem. • Identify conditions and root causes that caused or contributed to the outcome: o Use the Cause Road Map or Why Trees. o Identify and evaluate barriers that failed to prevent the situation from developing (i.e. Barrier Analysis). o Use a change analysis technique to identify differences that may have impacted the situation (i.e., change analysis). o Use a “table top” or “walk through” task analysis to understand how the task was intended to be performed and how it was mis-performed. (i.e., task analysis). • Include factors affecting conditions and actions during the course of the scenario. • Include and evaluate, as necessary, all factors effecting the conditions and actions that followed the inappropriate actions (detection). 9 .2 .4 Form a t t ing Figure 9-3 shows my suggested formatting for EC&F charts when integrated with the Cause Road Map. Standard MS Word or PowerPoint Graphics

Visio Graphic version for use with this book.

Date / Time

1.

Enclose all events (actions or happenings) in rectangles.

2.

Enclose all conditions in ovals.

3.

Presumptive events, causal factors, or conditions are shown by dotted rectangles or ovals.

Enclose causal conditions with tagged ovals.

Format the object’s “Line” properties to a dashed line.

N/A Human Errors

Flawed Defenses

Excellence Engine Visio Graphics

Error Drivers

Flawed Defenses

Flawed Assessments

Latent Weaknesses

4.

Relative time sequence is generally from left to right.

5.

Secondary event sequences, contributing causal factors and causal factors are depicted above or below the primary event line.

Standard MS Word or PowerPoint Graphics

Excellence Engine Visio Graphic

Intact

6.

Broken Failed Barrier:

Barriers

Before

7.

Change.

8.

Primary effects are shown as diamonds.

9.

Flag all causal factors:

After

N/A

Add a blackened oval to the causal factor oval.

Root Cause

Contributing Cause

Terminal Event is 10. shown by a circle. Figure 9-3. Format for Event & Causal Factor Charting

Table 9-1. Criteria for Event Descriptions An event should:

Good example:

Be an occurrence or happening. Be described by a short sentence with one subject and one active verb. Be precisely described.

Pipe wall ruptured Mechanic checked front end alignment. Operator turned Valve 32 to "OPEN" position. Consist of a single discrete occurrence. Pipe wall ruptured. Be quantified when possible. Be derived directly from the event and conditions preceding it.

Plane descended 350 feet. Mechanic adjusted camber on both front wheels. IS PRECEDED BY Mechanic found incorrect camber. IS PRECEDED BY Mechanic checked front end alignment.

Bad example: Pipe wall had a crack in it. Front end alignment was checked and brakes were adjusted. Operator opened valve. Internal pressure rose and pipe wall ruptured. Plane lost altitude. NOTE When an event is not derived directly from the preceding event, this is usually an indication that one or more steps in the sequence have been omitted.

9 .2 .5 T he Dum p T ruc k Ac c ide nt E& CF Cha rt By reviewing the Dump Truck Accident from Section 5.1 of Chapter 5, it is then possible to use this scenario to construct an E&CF Chart. This hypothetical teaching example is repeated below for ease of reading. Dump Truck Accident Example Ajax Construction Company was awarded a contract to build a condominium on a hill overlooking the city. Prior to initiation of the project, the company developed a comprehensive safety program covering all aspects of the project. Construction activities began on Monday, October 4, 1978, and proceeded without incident through Friday, October 8, 1978, at which time the project was shut down for the weekend. At that time, several company vehicles, including a 2 1/2-ton dump truck, were parked at the construction site. On Saturday, October 9, 1978, a nine-year-old boy, who lives four blocks from the construction site, climbed the hill and began exploring the project site. Upon finding the large dump truck unlocked, the boy climbed into the cab and began playing with the vehicle controls. He apparently released the emergency brake, and the truck began to roll down the hill. The truck rapidly picked up speed. The boy was afraid to jump out and did not know how to apply the brakes. The truck crashed into a parked car at the bottom of the hill. The truck remained upright, but the boy suffered serious cuts and a broken leg. The resultant investigation revealed that, although the safety program specified that unattended vehicles would be locked and the wheels chocked, there was no verification that these rules had been communicated to the drivers. An inspection of the emergency brake mechanism revealed that it was worn and easily disengaged. It could be disengaged with only a slight bump. There had been no truck safety inspections or requests to adjust the emergency brake linkage.

Figure 9-4.

Event & Causal Factor Chart for the Dump Truck Example

9 .2 .6 E& CF Cha rt ing Adva nt a ge s a nd Disa dva nt a ge s Advantages • Aids in developing evidence, in detecting all causal factors through sequence development, and in determining the need for in-depth analysis. • Clarifies reasoning. • Illustrates multiple causes. Accidents rarely have a single “cause.” Charting helps illustrate the multiple causal factors involved in the accident sequence, as well as the relationship of proximate, remote, direct, and contributory causes. • Visually portrays interactions and relationships. • Illustrates the chronology of events showing relative sequence in time. • Provides flexibility in interpretation and summarization of collected data. • Conveniently communicates empirical and derived facts in a logical and orderly manner. • Links specific accident factors to organizational and management control factors. Disadvantages • Very time consuming to construct. • Very hard to draw in a Word document. They much easier to draw using MS Visio, but this software is expensive and typically not installed on business computers. • Can become cluttered and confusing. • Takes some practice to use effectively.

Sum m a ry This chapter described two Event Modeling and Analysis Tools, including:  Causal Factor Trees.  Event Modeling & Analysis Tools.

Look ing Forw a rd In Chapter 10, we will discuss a protocol for integrating the tools discussed in this and the previous chapter and a Commonalities Matrix tool to help integrate the results from using these tools.

Re fe re nc e s Buys, J.R. INEL & Clark, J.L. INEL. (1995). Event and causal factors analysis, SCIE-DOE-01TRAC-14-95. Retrieved from https://www.frcc.com/Educational/Shared%20Documents/2014%20Cause%20Analysis% 20Training/Reference%20Material/E%20and%20CF%20Charting.pdf Corcoran, W.R. (2003). The phoenix handbook. Windsor, CT: Nuclear Safety Review Concepts Corporation.

Cha pt e r 1 0 I nt e gra t e t he Tools a nd Re sult s To investigate an event properly, the investigator, like a mechanic repairing a car, typically uses multiple “tools.” Using multiple tools to investigate an event helps to ensure a thorough investigation. The previous chapters presented several event analysis and modeling tools each with its strengths and limitations. 1 0 .1 I nve st iga t ion T ools I nt e gra t ion Prot oc ol The best way to take advantage of each technique and limit the impact of their limitations is to integrate the multiple tools in a complementary manner. Figure 10-1 provides a visual representation of my investigation tools integration protocol.

Figure 10-1. Cause Investigation Tools Integration Protocol As indicated in Figure 10-1, the primary tools for every cause investigation effort are the comparative event line and interviewing. Then the barrier analysis, task analysis, and change analysis techniques are driven by disparities documented on the comparative event line and supported through interviews. The failure modes and effects analysis technique integrates with the comparative event line in a similar manner, but may also need to be supported by some form of hardware or material analysis. Although investigators should start drawing their EC&F charts or causal factor trees based on the evidence available at the start of the investigation, these event modeling tools can only be completed accurately after most of the investigation has been completed. Therefore, I’ve shown these tools as end products. When the tools described in the Chapters 8 and 9 are used together and iteratively, they can reveal additional and/or deeper fundamental causes of an event. The following commonalities matrix tool provides a way to identify and display potential commonalities between different cause and effect sequences and some potentially more fundamental causal factors.

1 0 .2 Com m ona lit ie s M a t rix This chapter will provide the investigator with an easy to use and understand way to integrate the various tools used to model and analyze and event. Using a Microsoft Word table to form a commonalities matrix, the investigator lists the results from each tool used, looks for common themes (as shown in Table 10-1), and then asks: “Are there any additional or more fundamental causes that need correcting?” If the answer is “No,” then the investigation is complete. If the answer is “Yes,” then the investigation is not complete because the cause of the causes needs to be determined. In other words: • An additional “Why” needs to be explored. • More fundamental conditions need to be identified. • The investigator needs to use the Cause Road Map again by taking the identified cause(s) and turning them into observations, and then stepping through the Cause Road Map again to reveal additional causes. The commonalties matrix is a six-column matrix that lists the tool or tools used and plots the observations against the five levels of defenses from human errors to latent management practice weaknesses. After listing the observations, the investigator looks at each column for common threads or themes. For example, for the Human Errors column, the investigator asks if there are similar errors or error types. If there are, then the investigator would list the similar errors as a conclusion. The same process is used for each of the other four levels of defense.

Table 10-1.

Commonalities Matrix COMMONALITIES MATRIX

Investigation tool Human Errors

Problem Statement

Error Drivers

Flawed Organizational or Programmatic Defenses

Flawed Assessment Capability

COMMONALITIES MATRIX CONCLUSIONS Common Common Flawed Flawed Common Common Organizational or Assessment Errors Drivers Programmatic Capability Defenses

Latent Management Practice Weaknesses

Common Latent Management Practice Weaknesses

Again, the dump truck accident from Chapter 5 and Chapter 9 may be used as an example. For this example, we will use only the Cause Road Map tool. From Chapter 5, we had two Cause Road Map observations, the “wheels were not chocked” and “no maintenance or inspections had been performed on the emergency brake mechanism.” Plotting the two human performance observations summaries in the commonalities matrix, we get the result shown in Table 10-2.

Table 10-2. Report #:

Commonalities Matrix Example Example

COMMONALITIES MATRIX Observations

Investigation tool Flawed Organizational or Human Errors Error Drivers Programmatic Defenses Cause Road Neither the driver Neither the driver No Driver Safety Map nor anyone else nor anyone else Training courses was tasked with was tasked with were scheduled or locking and locking the doors given. chocking vehicles. or chocking the This is a team wheels. coordination error. No one was No schedule or No equipment safety tasked with plan implemented inspection program inspecting the to periodically had been truck’s braking inspect the implemented as system condition. condition of the mandated by the site truck’s safety safety manual. systems (i.e., emergency brake). The driver did not Interviewees The government request indicated that road safety sticker maintenance on since the truck inspections only the emergency had a current insure that the brake linkage. government road emergency brakes safety sticker the hold the vehicle. truck was safe to operate.

Flawed Assessment Capability No safety program implementation monitoring.

Latent Management Practice Weaknesses

Management did not make safety a priority.

The company did Management did not not monitor make safety a priority. safety program implementation.

The company did Management did not not monitor make safety a priority. safety program implementation.

COMMONALITIES MATRIX CONCLUSIONS Problem Statement A boy playing in an unsecured dump truck suffered serious cuts and a broken leg.

Common Errors Unsafe work practices are the normal work practices.

Common Drivers Drivers not trained on safety requirements for driving trucks.

Common Flawed Organizational or Programmatic Defenses The work force was frequently required to perform work without sufficient guidance.

Common Flawed Assessment Capability No safety program implementation monitoring.

Common Latent Management Practice Weaknesses Company safety program is ineffective because company has not enforced safety as a priority.

From the conclusions, it becomes clear that the fundamental underlying cause for the event in the dump truck example is an ineffective safety program because safety has not been enforced as a priority. Also, additional contributors to the event that need correction are revealed, including driver training, additional field supervision, and safety program monitoring. We can use a similar approach if multiple investigation tools were used. First, review each of the tools that were used to investigate the event and summarize the results from each tool •

•

•

•

Note: using the Commonalities Matrix for each of the tools such as we did above for the Cause Road Map above may be helpful.

Cause Road Map identified: o Unsafe work practice are the normal work practices. o No safety program implementation monitoring. o Management has not made safety a priority. o No safety training held or planned. o No equipment safety inspections planned or implemented. The barrier analysis technique (or the comparative event line used as a barrier analysis tool) identified numerous barrier deficiencies. Specifically: o The following barriers were not used: – Fence and/or guard to secure the construction site. – Parking truck on level ground. – Wheel chocks and locking truck doors. – Safety training class o The following barriers were not effective: – Supervision of boy. – Leaving truck in reverse gear when parked. – Supervision of truck driver. Note that the company cannot control the boy’s supervision so corrective actions to address the parent’s behavior cannot be proposed. The event and causal factor charting tool identified the following “root causes”: o Company internal communication less than adequate. o Management control less than adequate. o Supervision of the boy less than adequate. Again, note that the company cannot control the boy’s supervision. The causal factor tree event modeling tool determined that: o Safety training not held. o No requirement to secure the construction site. o Safety is not a company priority.

We then plot the observations against the five levels of defense in the commonalities matrix and look for common themes, as shown in Table 10-3.

Table 10-3. Report #:

Commonalities Matrix for Multiple Tools Example

COMMONALITIES MATRIX Observations

Investigation tool Human Errors Cause Road Map

Barrier Analysis

Unsafe work practices are the normal work practices.

Error Drivers No equipment safety inspections scheduled. No schedule or plan implemented to periodically inspect the condition of the truck’s safety systems (i.e., emergency brake).

Flawed Organizational or Programmatic Defenses No Driver Safety Training courses were scheduled or given. No equipment safety inspection program implemented.

No safety program implementation monitoring.

Latent Management Practice Weaknesses Management did not make safety a priority.

Numerous physical and programmatic barriers were either ineffective of not used.

Event and Causal Factor Chart

Causal Factor Tree

Flawed Assessment Capability

Company internal communication is less than adequate. Supervision of the boy is less than adequate. Safety training not held. No requirement to secure the construction site.

Safety is not a company priority.

COMMONALITIES MATRIX CONCLUSIONS Problem Statement

Common Errors

A boy playing in an unsecured dump truck suffered serious cuts and a broken leg.

Unsafe work practices are the normal work practices.

Common Drivers Drivers not trained on safety requirements for driving trucks.

Common Flawed Organizational or Programmatic Defenses The work force was frequently required to perform work without sufficient guidance or training.

Common Flawed Assessment Capability No safety program implementation monitoring.

Common Latent Management Practice Weaknesses Company safety program is ineffective because company has not enforced safety as a priority.

Numerous safety barriers not used or were ineffective.

From the conclusions in Table 10-3, it becomes clear that the fundamental underlying cause for the event is an ineffective safety program because safety has not been enforced as a priority. Also, because additional tools were used to investigate the event, an additional contributor, use of physical and programmatic safety barriers, was revealed as needing correction. In summary, clearly showing the chain of logic from the problem statement to the contributing and fundamental cause(s) makes it easy to develop effective corrective and preventive actions and for management to accept the results of the investigation.

Sum m a ry This chapter described the following:  Tools Integration Protocol.  Commonalities Matrix.

Look ing Forw a rd In Chapter 11, we will discuss another important element of performing a thorough event investigation: assessing the generic implications, also known as extent of condition and extent of cause.

Cha pt e r 1 1 Ex t e nt of Condit ion a nd Ex t e nt of Ca use The previous chapters got you through the investigation. The next step is to build on what was learned by assessing the generic implications of the identified flawed events and causal factors. This is also known as extent of condition and extent of cause. If the management team truly wants to fix the problem, they need to know how big and widespread it is. 1 1 .1 Ex a m ining t he Broa de r I m plic a t ions of a n Eve nt Safety Pump Example A critical safety pump experienced bearing failure due to lack of lubrication. The setup condition, the fill line on the sight glass for the oil reservoir was marked too low.

Just correcting the identified problem, fixing the wiped pump bearing or correcting the errant procedure does not address hidden damages or latent conditions. The revealed problem is like the tip of the iceberg, as in Figure 11-1 − only a small part of the problem is visible. For example, the wiped pump bearing could have damaged other pump bearings or caused impeller damage.

Revealed Problem

Extent of the Problem

Figure 11-1.

Generic Implications Iceberg

Does one need to answer all these questions for every event? The answer is: not always. The investigator needs to identify What should be considered for inspection, then decide if it needs to be inspected. Using a method similar to that of determining the level of effort needed for the investigation, the investigator needs to determine the appropriate level of effort for assessing the extent of condition and cause of the event. The level of effort determination should be based on the risk associated with not correcting the extent of condition and cause issues. 1 1 .1 .1 T hre e -St e p Approa c h t o I m plic a t ions This chapter describes a systematic three-step approach to address the broader implications of an event (see Figure 11-2). • Step 1: (Definition) define other potential instances of the same-similar condition and cause. • Step 2: (Risk) determine the risk of a similar event occurring with the same-similar condition by assigning a probability of occurrence and level of consequence if no corrective or preventive actions are taken. • Step 3: (Action) decide if action is required.

Step 3 Step 1

Definition Matrix

Mitigating Actions Matrix

Application Extent of Condition

Risk

Extent of Cause

Similar Object Similar Cause Same Object Same Object Same Cause Same Cause Same/Similar Same/Similar Same Defect Similar Defect Same Result Similar Result Defect Result

Recurrence

High

Medium

Low

Evaluate & Similar Correct as Same/Similar needed.

Management Discretion

No Action

Same Similar

Evaluate & Correct as needed.

Management Discretion

Management Discretion

Same Same

Evaluate & Correct as needed.

Evaluate & Correct as needed.

Evaluate & Correct as needed.

Consequences Catastrophic

Critical

Marginal

Negligible

Frequent / Probable

High

High

Medium

Medium

Occasional

High

High

Medium

Low

Remote

High

Medium

Low

Low

Improbable

Medium

Medium

Low

Low

Step 2

Application

© 2017 (et ante) - Chester D Rowe

Risk Matrix Figure 11-2. Definition to Risk to Actions Matrix

1 1 .2 St e p 1 : De fine Sa m e a nd Sim ila r Condit ions Using the Definition Matrix, Figure 11-3, define same and similar conditions. These other areas potentially could be affected by the event or have similar conditions leading to a future event.

Figure 11-3. Definition Matrix

To conduct an extent of condition/cause assessment using the risk-based tool, let’s look at an example at the start of the chapter, where a pump bearing failed. First, restate the event/problem (failure due to lack of lubrication) and critical setup conditions (incorrectly marked fill line). 1 1 .2 .1 Ex t e nt of Condit ion Asse ssm e nt Define Same-Same, Same-Similar and Similar-Similar conditions using the following criteria and applying the same safety pump example from Section 11.1. Same-Same Same Program [e.g., CAP] or Process [e.g., Surveillance Scheduling] or System [Auxiliary Steam] with the Same kind of problem/error/failure or defect. In other words, Same Object-Same Defect. For example: Because the pump bearing failed due to lack of lubrication, the investigator needs to determine what other damage the failed bearing caused and to determine if the other bearings on that pump have an incorrect marking on the sight glass and if other pumps of the same make and model have incorrect sight glass level markings.

Same-Similar Same Program [e.g., CAP vs. Self-Assessment] or Process [e.g., Surveillance Scheduling vs. 13Week Scheduling] or System [Auxiliary Steam vs. Heater Drains] with a Similar kind of problem/error/failure or defect. In other words, Same Object-Similar Defect. For example: Because the bearing failed due to lack of lubrication, the investigator needs to determine if other make and model pumps with the same style bearings have incorrect oil reservoir level indicators.

Similar-Similar Similar Program [e.g., CAP vs. Self-Assessment vs. Management Oversight] or Process [e.g., Surveillance Scheduling vs. 13-Week Scheduling vs. WM] or System [Auxiliary Steam vs. Heater Drains vs. Main Steam] with the Same or a Similar kind of problem/error/failure or defect. In other words, Similar Object-Same or Similar Defect. For example: Because the pump bearing failed due to lack of lubrication, the investigator needs to determine if other make and model pumps with the same or similar style bearings have incorrect lubrication monitors.

1 1 .2 .2 Ex t e nt of Ca use Asse ssm e nt To assess the Extent of Cause, we need to know the cause(s) of the event. For example: The cause of the incorrect oil level marking on the sight glass was found to be a new style sight glass was recently installed and the change package did not change the oil level marking reference to compensate for the change in location of the new sight glass. Cause: Incorrect information in the design change package (did not compensate for the new location).

Impact: Bearing failure.

Define Same-Same, Same-Similar and Similar-Same or Similar causal factors using the following criteria. Same-Same Same Cause-Same Impact For example: Where else has this new sight glass been installed using this change package? Same-Similar Same Cause-Similar Impact

For example: What other design change packages has this design engineer made that replaces installed equipment with new models that failed to consider all variables?

Similar-Same or Similar Similar Cause-Same or Similar Impact For example: What other design change packages could result in unexpected consequences due to incorrect or omitted information?

1 1 .3 St e p 2 : De t e rm ine t he Pot e nt ia l Conse que nc e s a nd Risk s Using the Risk Matrix, Figure 11-4, determine the potential consequences and risk exposure for each of the conditions and causes defined in Step 1. This process is somewhat subjective but should be based on the facility’s experience and values.

Figure 11-4. Risk Matrix For the example, a critical safety pump experienced bearing failure due to lack of lubrication. The setup condition, the fill line on the sight glass for the oil reservoir was marked too low.

Using the defined Same-Similar conditions: • Evaluate the chance of occurrence for each defined potential condition if no corrective or preventive action is taken. • Determine the consequences for each defined potential condition. • Identify the Risk for each defined potential condition. Same-Same For example: Because the pump bearing failed due to lack of lubrication, the investigator needs to determine what other damage the failed bearing caused and to determine if the other bearings on that pump have an incorrect marking on the sight glass and if other pumps of the same make and model have incorrect sight glass level markings. There is a high probability that the problem will recur if no action is taken. Since this is a model pump and this style sight glass is used in critical systems, the consequence to the facility of another failure is critical.

Using the Risk Matrix, the risk is determined to be high. Same-Similar For example: Because the bearing failed due to lack of lubrication, the investigator needs to determine if other make and model pumps with the same style bearings have incorrect oil reservoir level indicators. Based on facility knowledge, this style sight glass is being used as a replacement for all bearings of this style. Therefore, there is a high probability that the problem will recur if no action is taken. However, if other pumps with the same style bearings fail the consequence to the facility is marginal.

Using the Risk Matrix, the risk is determined to be medium. Similar-Same or Similar For example: Because the pump bearing failed due to lack of lubrication, the investigator needs to determine if other make and model pumps with the same or similar style bearings have incorrect lubrication monitors. Based on facility knowledge, this style sight glass is being used as a replacement for all bearings of this style. Other similar style bearings use a different style lubrication monitoring system and are not part of the replacement program. Therefore, there is a remote probability that the problem will recur if no action is taken. If other pumps with the same or similar style bearings fail, the consequence to the facility is marginal.

Using the Risk Matrix, the risk is determined to be low. The same process is used for each cause. For the example case, the cause was found to be incorrect information in the design change package (did not compensate for the new location), which led to bearing failure.

Same-Same, Same Cause-Same Impacts For example: Where else has this new sight been installed using this change package? There is a high probability that the problem will recur if no action is taken. Moreover, since this is model pump and this style sight glass is used in critical systems the consequence to the facility of another failure is critical.

Using the Risk Matrix, the risk is determined to be high. Same-Similar, Same Cause-Same or Similar Impact For example: What other design change packages has this design engineer made that replaces installed equipment with new models that failed to consider all variables? There is an “occasional” probability that the problem will recur if no action is taken. Several reviews are performed for each design package before it is issued to the field for installation. The potential consequence to the facility, depending on the package could be critical to catastrophic.

Using the Risk Matrix, the risk is determined to be high. Similar-Similar, Similar Cause-Same or Similar Impact For example: What other design change packages could result in unexpected consequences due to incorrect or omitted information? There is an “occasional” probability that the problem will recur if no action is taken. Several reviews are performed for each design package before it is issued to the field for installation. The potential consequence to the facility, depending on the package could be critical to catastrophic.

Using the Risk Matrix, the risk is determined to be high. CAUTION: It is easy to define potential Similar-Similar potential cause conditions too broadly. For example, one poorly written procedure does not mean that a potential SimilarSimilar condition of “all procedures poorly written” need be considered. In this case, there is simply no valid reason to conclude that all procedures written by all procedure writers were poorly written. In fact, given adequate procedure writing guidance, it is very improbable. It is not even necessarily valid to conclude that every procedure written by the same procedure writer was poorly written because this would imply incompetence or willful wrongdoing.

1 1 .4 St e p 3 : Ac t ions M a t rix For each of the results from step two, use the Mitigating Actions Matrix, Figure 11-5, to determine if action is needed. For example, Figure 11-5 uses Same-Similar conditions.

Figure 11-5. Mitigating Actions Matrix Same-Same From the Risk Matrix, the risk was determined to be high. Applying Same-Same with high risk to the Mitigating Actions Matrix it is determined that an evaluation is required. Same-Similar From the Risk Matrix, the risk was determined to be medium. Applying Same-Similar with medium risk to the Mitigating Actions Matrix, it is determined, that management discretion may be used to determine if evaluation is to be performed. In this case, the sponsoring manager may decide it is not worth the time and effort to inspect other make and model pumps that have the same style bearings to verify their sight glass level markings are proper. The decision of whether or not to assess an area should be based on operating experience and potential consequence to the facility should a failure occur. Similar-Same or Similar From the Risk Matrix, the risk was determined to be low. Applying Similar-Same or Similar with low risk to the Mitigating Actions Matrix, it is determined that no evaluation is recommended. The same action determination process used for potential condition is now used for potential causes. In this case, Same-Same, Same-Similar, and Same or Similar-Similar were all determined to have a high risk. When high risk is applied to the Mitigating Actions Matrix, the result is that an evaluation is required.

The following human performance example uses the Extent of Condition/Cause Assessment Risk to Action Matrix, Figure 11-6 and easy-to-use templates to determine the scope of an Extent of Condition/Cause assessment.

Figure 11-6. Extent of Condition/Cause Assessment Risk to Action Matrix Hydrazine Tank Example A few gallons of hydrazine contaminated with sodium hydroxide waste were pumped from the secondary hydrazine addition tank to the Auxiliary Feedwater Storage Tank, which was being used to feed the Steam Generators while the plant was in Hot Standby.

Table 11-1.

Extent of Condition Contaminated Hydrazine Example

Set up Conditions

Problem / Event

Extent of Condition Potential SAME – SAME Condition What other systems or tanks are contaminated due to transfer from the secondary hydrazine addition tank or other direct transfers from the contaminated waste drum?

RISK Probable & Critical = HIGH RISK MITIGATING ACTIONS Evaluate & Correct

An empty hydrazine drum filled with sodium hydroxide waste was not labeled per station requirements.

A few gallons of hydrazine contaminated with sodium hydroxide waste were pumped from the secondary The waste drum was hydrazine addition later mistaken for a tank to the Auxiliary Feedwater Storage new drum of Tank, which was being hydrazine and its contents were added used to feed the Steam Generators to the secondary hydrazine addition while the plant was in Hot Standby. tank.

Potential SAME – SIMILAR Condition Are there any other mislabeled drums that could mistakenly be added to the same or other plant systems?

RISK

Potential SIMILAR – SAME/SIMILAR Condition

RISK

Probable & Critical = HIGH RISK MITIGATING ACTIONS Evaluate & Correct

Are there any other chemical Remote & Critical = transfer tasks that could be MEDIUM RISK mistakenly performed because of a labeling error? MITIGATING ACTIONS Mgmt. Discretion

Table 11-2.

Extent of Cause Contaminated Hydrazine Example

Problem Statement A few gallons of hydrazine contaminated with sodium hydroxide waste were pumped from the secondary hydrazine addition tank to the Auxiliary Feedwater Storage Tank, which was being used to feed the Steam Generators while the plant was in Hot Standby.

Apparent / Root Cause

Extent of Cause

Willful violation (i.e. Potential SAME – SAME shortcut taken) on the Impacts part of the individual Has the same individual responsible for labeling taken shortcuts when the waste drum. labeling other waste drums?

RISK Probable & Critical = HIGH RISK MITIGATING ACTIONS Evaluate & Correct

Potential SAME – SIMILAR RISK Impacts Has the same individual Probable & Marginal = taken any other safety MEDIUM RISK significant shortcuts? MITIGATING ACTIONS Evaluate & Correct Potential SIMILAR – SAME/SIMILAR Impacts Are other individuals in the same work group willfully violating labeling requirements?

RISK Improbable & Marginal = LOW MITIGATING ACTIONS No Action

Using the above results, for the hydrazine contamination event, the evaluator can quickly develop an extent of condition/cause assessment plan. In this case, the assessment plan should include: 1. An inspection of other systems or tanks that could have been contaminated due to transfer from the secondary hydrazine addition tank or other direct transfers from the contaminated waste drum. 2. An inspection of other drums to verify that they are properly labeled. 3. A review of past work of the individual responsible for labeling to determine if other similar errors were made. Because of low risk and the low potential for adverse consequences, the investigation team does not recommend reviewing other chemical transfer tasks and investigating other personnel for willfully violating labeling requirements. If the risk and or potential consequences had been determined to be higher, these areas should be assessed. Figure 11-7 offers one final illustration of how these Same-Similar Condition and Cause statements collectively provide a 360-degree approach to Extent of Condition and Extent of Cause assessments.

Roofing Nail Example A car in the driveway of a home where the roof is being replaced has a flat tire with a roofing nail in it. It is also observed that the roofing company has not put up any barriers or tarps to catch falling debris.

Figure 11-7. 360-Degree Approach to Extent of Condition/Cause From these observations, the potential extent of condition items includes: other roofing nails in the driveway, roofing nails in the lawn, and other roofing material where it does not belong. The potential extent of cause items include: no or minimal efforts to clean up the nails or other materials and no “housekeeping” program for the job.

Sum m a ry This chapter was devoted to assessing the generic implications of an event, and its causes through use of the Extent of Condition/Cause Assessment Risk to Action Matrix.

Look ing Forw a rd In Chapter 12, we will discuss crafting effective corrective actions that will correct the condition, reduce the chance of its recurrence and improve the organization’s overall performance.

Cha pt e r 1 2 Corre c t ive Ac t ions Now that you have determined what happened and why, it is time to figure out what corrective actions are reasonable and appropriate. In general, corrective actions driven by traditional rigorous (root) cause evaluations and structured (apparent) cause evaluations can be unbudgeted, un-resourced, unscheduled, unplanned work mandates with the potential to upset orderly work process. On the other hand, well-crafted corrective actions can be a powerful tool to improve overall performance. In addition, an assigned action owner will be much more receptive to a proposed action if it is clear what needs to be done and why the action is needed. Important: Major corrective actions may require a Change Management Plan to prevent adverse consequences of the proposed changes. Change Management Plans are not addressed in this book. For information sources, see the list at the end of this chapter.

Before developing specific corrective actions, you will need to establish a judgment of the need (i.e., the intent of) the proposed corrective actions. This will help the investigator to negotiate with the proposed action owner, since there may be different ways to solve the problem at hand. Also, it is very important that you discuss with the action owner the potential adverse consequences of the changes being proposed as corrective actions. While researching industry experience for similar problems can be a helpful starting point in addressing the problem, you will need to tailor your actions to your situation, not simply copy what was done elsewhere. 1 2 .1 Eight Ba sic Ele m e nt s of Corre c t ive Ac t ions To be successful, effective corrective action needs: 1. Clear alignment that the proposed action is addressing the identified cause. This helps explain why the action is needed.

2. Clearly defined scope. It must be easily understood by the reader and ultimately the action owner what needs to be done and what are the action limits. The scope should address the extent of condition and cause to be effective in preventing recurrence. 3. Description containing sufficient detail so that the end state is known. In other words, the action description provides a “blueprint” of what needs to be done. The blueprint must have sufficient detail to minimize the need to interpret the action’s intent. 4. To be stated beginning with action verbs such as: o Develop o Implement o Revise o Provide o Modify o Conduct 5. Descriptions that do not begin with words like these: o Consider o Ensure o Monitor o Assess o Evaluate o Investigate Actions that begin with these words do not result in actions to fix problems. 6. To be realistic and within the organization’s capability to implement. 7. To be assigned to and accepted by personnel who can implement them. 8. Interim compensatory measures as necessary to prevent or mitigate the consequences of a repeat event until the permanent action is implemented. 1 2 .2 Effe c t ive Corre c t ive Ac t ion St ruc t ure Ele m e nt s 1. A statement of how the action addresses the stated cause. 2. Action verb. 3. Action description (i.e. scope and what needs to be done) with a description of the action closure criteria. 4. Action Owner. 5. Due date. Example Statement of how this action addresses the cause: The focus of this action is to address CARB members’ root cause process knowledge weakness by improving CARB members’ knowledge of the root cause process and to provide CARB members Corrective Action Program insights to help them provide effective oversight.

Action Verb including closure criteria: Conduct a workshop for CARB members and Responsible Managers on the findings on this assessment. The focus of the workshop should be on: • • • •

Problem and Scope statements. Reviewing an analysis in Root Cause Investigation Reports looking to ascertain the chain of logic of causal factors. SMART Corrective Actions. Effectiveness Review plans and Effectiveness Review Assessments.

•

Reviewing examples of effective and ineffective Root Cause Investigation reports.

Action owner and due date: Action Owner _Somebody_______ Action Due date 12 Dec. 2016

1 2 .3 T e n Crit e ria for Effe c t ive Corre c t ive Ac t ions Once you have defined the elements of an effective corrective action statement, you need to ensure the actions will be effective. To effectively implement and close corrective actions, make sure that the following criteria are met: 1. Make sure the proposed Corrective Actions address the underlying cause(s) for the event, not the symptoms. 2. Make sure the scope of the proposed Corrective Actions is clearly defined. 3. Make sure the proposed Corrective Actions are reasonable, specific, measurable, doable and don’t create new problems (i.e., are your Corrective Actions SMART?). 4. Make sure the cost of the proposed Corrective Actions in terms of money, time, and people are understood. 5. Make sure stakeholders agree to the proposed Corrective Actions. 6. Make sure the implementation schedule for the proposed Corrective Actions is timely, or there are sufficient compensatory measures in place pending implementation of the proposed Corrective Actions. 7. Make sure the Corrective Actions are implemented as-planned, or any changes to the approved Corrective Actions do not alter their intent or ability to address the stated cause. 8. Make sure the Corrective Action approval authority approves any changes to Corrective Actions prior to close-out of the issue. 9. Make sure there is sufficient Corrective Action close-out documentation to verify the Corrective Actions were properly implemented. 10. Make a Corrective Action effectiveness review to verify the Corrective Actions eliminated the cause(s) of the event are scheduled and conducted.

Following are examples of good and not-so-good actions for addressing the cause of an event. Examples Root Cause Statement: Inadequate software interface.

The following action addresses the cause statement: Change Standby Generator software default value, modify size of offer window, add alarms to indicate when an incorrect offer is prepared/submitted.

The above action is reasonable, specific, measurable, doable and should not create any new problems. Root Cause Statement: The control of temporary plant modifications, via legacy processes and current Procedure OPS-000XX Temp Config Change (TCC), have permitted plant design changes to be made outside the Engineering Change Control Process.

The following action addresses the cause statement: Create a new TCC process procedure at the Station level per Fitzbottom Company benchmark.

The above action directly addresses the TCC process and is reasonable, measurable and doable. Details for this new TCC process are included by reference to the benchmark. Root Cause statement: Inadequate procedure use and adherence by fuel handling operators.

The following action does not address the cause statement: ACTION - OP Memo OPM-08-XXXX − Peer checking required for manual interventions during fuel push. Rollout SEC-FHO-XXXX, OPS-XXXX and OM-XXXX with FH operators and FH FLM, reinforcing expectations to have procedures in hand for all operations. Complete A/R 28047003 assignment #4 to revise refueling procedures NK-29OM-35000 to continuous use with the inclusion of tick/check boxes for place keeping.

Question: How does issuing and changing procedures address problems with personnel following procedures? Root Cause statement: Inadequate post-event inspection program.

The following action does not address the cause statement: ACTION − Ensure lessons learned are incorporated into the next planned revision of the Condenser Life Cycle Management Plan.

Question: What does a Condenser Life Cycle Management Plan have to do with a post event inspection program?

Make sure the scope of the proposed Corrective Action is clearly defined. It is easily understood by the reader and ultimately the action owner what needs to be done and what are the action limits? Following are examples of good and not-so-good actions for defining the action’s proposed scope. Examples Scope clearly defined: Replace all 120VAC Class II receptacles on the Reactivity Mechanism Deck so they are not available for general use on Units 1 and 2.

Note: While the scope is clearly defined, it would have been better to state what type of replacement receptacle to use. Scope not clearly defined: Include in Outage Readiness review a verification that tools will be available and tool modifications will be complete one month prior to outage start to enable management to review, verify and approve.

Question: What tools fall under the readiness review verification (specialized tools, tools supplied by the vendor, all tools on site? Scope not clearly defined: Develop and implement quality assurance checklists for identified critical components Reactors 1 and 2.

Questions: • What are critical components? • Are the components, SCRAM sensitive, Emergency sensitive, Radiation protection sensitive or Industrial safety sensitive? • What should be included in the quality assurance checklist? 1 2 .4 SM ART Corre c t ive Ac t ions Again, it is important that the action owner and action implementer clearly understand what needs to be done to fix the problem at hand. Another way to look at proposed corrective actions is that they need to be SMART. SMART Actions have the following criteria:

Criteria

S M A

Specific Clearly state why the action is needed, what is to be done and the "Desired End State" result or action; do not just restate the condition. Measurable Clearly define the necessary actions so a reviewer can easily determine the completion of the actions. Accountable Identify a specific person/group responsible for the action.

R T

Realistic The action should be within the control of the person / organization assigned to perform the action. Timely Provide reasonable due dates that allows sufficient time to complete the action before more significant consequences occur from repeat events.

Figure 12-1. SMART criteria 1 2 .5 Spe c ific Corre c t ive Ac t ions The Corrective Action should be written so that action owner and reviewers clearly understand the action’s intent and what the end product looks like. Remember, the action description needs to provide sufficient detail so that end state is known. In other words, the action description should provide a blueprint of what needs to be done. The blueprint should have sufficient detail to minimize the need to interpret the action’s intent. Also, be sure to consider problems that might be created by implementation of the proposed action. Also consider the potential effectiveness of your proposed Corrective Action. Figure 12-2 offers insight as to the relative effectiveness of corrective actions on a scale of one to six, with one being the most effective. 1 2 .6 Re la t ive Effe c t ive ne ss of T ype s of Corre c t ive Ac t ions These actions are also known as the Safety Precedence Sequence. 1

Design or design change to reduce the hazard.

2

Install automatic safety devices.

3

Install automatic safety warnings or alarms.

Procedure Change

4

Procedures or procedure changes.

Personnel

5

Personnel (training, knowledge, etc.).

Do Nothing

6

Identify and assume risk.

More Effective Hardware change

Less Effective

Figure 12-2. Relative Effectiveness of Types of Corrective Actions Examples Specific action: Change the Preventive Maintenance frequency from 4 to 1.25 years.

Note: While specific, the action should have stated what PM needs to be revised and why the change is being proposed. What new problems could implementation of this action create?

Specific action: Revise software to +$1999 to avoid potential errors to the default offer of -$2000.

Note: This action is very specific and it tells the action owner why the action is being proposed. However, could implementation of this action create a new problem? Specific action: Revise SM-XXXX to identify that the worker entering the confined space has the responsibility to ensure that the EC is aware of each entry.

Note: While good, this would be better if the action stated what steps of the procedure should be revised. Action that is not specific: Revise the Equipment Causal Analysis / Troubleshooting procedure to ensure that it requires one to identify failure mechanisms for unexpected equipment failures.

Questions: • How should the procedure be revised? • Should the purpose of the procedure be to help the user find the cause for the failure? • What guidance is missing from the current procedure? Action that is not specific: Ensure that there is sufficient guidance on identification and mitigation of the SPVs (Single Point Vulnerably) in PROG-XX.01 and its sub tier procedures.

Questions: • What is meant by sufficient guidance? • How will you know when the action is complete? • If you are asked to inspect that the action is done what will you do to evaluate that the action is complete? 1 2 .7 M e a sura ble Corre c t ive Ac t ions The Corrective Action should be written so that action owner and reviewers clearly understand the action’s intent and what the end product looks like. Remember that the action description needs to provide sufficient detail so that the end state is known. In other words, the action description provides a blueprint of what needs to be done. The blueprint has sufficient detail to minimize the need to interpret the action’s intent. Things you can measure: 1. Revise a procedure – you can go look at the revised procedure. 2. Provide training to all current shift operators – you can inspect the training records and compare them to the current list of shift operators. Things you can’t measure: 1. Reinforce procedure compliance for operators. 2. Revise a procedure to include more detail. How do you measure more detail? 3. Ensure lessons learned are incorporated. 4. Evaluate the adequacy of a procedure.

Examples Action that is measurable: Post operator aids on each System 003 125VDC/120VAC breaker panel stating: “Caution: Breakers in this panel can change position with the slightest of touch. Exercise caution to not touch a breaker when placing, removing or reading tags.” (Note: this action is measurable. You can go inspect each of the panels to verify that the signs have been installed.)

Action that is measurable: Revise SM-XXXX to identify that the worker entering the confined space has the responsibility to ensure that the EC is aware of each entry. (Note, this would be better if a proposed markup of the procedure was included.)

Action that is not measurable: Perform a review of the new exciter system to ensure maintenance and testing programs verify proper performance.

Question: How do you measure “maintenance and testing programs verify proper performance?” Action that is not measurable: Cable condition assessments for neutronic cabling into/out of the SDS rooms, and for any Hi Voltage cabling in close proximity to the instrument rooms will be carried out.

Questions: • How are the condition assessments to be conducted? • What is the specific scope of this action? • What actions are required if degraded cables are discovered? • How do you know when you are finished? Action that is not measurable: Ensure that there is sufficient guidance on identification and mitigation of the SPVs in PROG-XX.XX and its sub tier procedures.

Question: How do you measure sufficient guidance? 1 2 .8 Assigne d a nd Ac c e pt e d Corre c t ive Ac t ions It is important to make sure someone is assigned to complete the proposed corrective action. Ensure the right person, at the right level, in the right organization has agreed to accept responsibility for the proposed action. The best way to accomplish this is through face-to-face communication. One of the best methods to obtain buy in on a Corrective Action is to involve the prospective action owner in the development of the proposed action.

For example, a corrective action to implement a code of conduct program for maintenance cannot be assigned to a first line maintenance supervisor. An action such as this needs to be assigned and accepted at the senior management level. Assigning a training action needs to be assigned and accepted by the appropriate discipline in the training department. A corrective action that involves unbudgeted resources or funding, such as implementing a design change, needs to be worded such that an alternative solution can be opened if the resources are not approved. For example, if the action is to install a new design pump to correct a reliability problem, the action should include an option to upgrade the compensatory measures if the new pump is not approved. 1 2 .9 Re a list ic Corre c t ive Ac t ions In addition to ensuring the right person, at the right level, in the right organization has agreed to accept responsibility for the proposed action the action also need to be realistic. Does the benefit outweigh the cost for the proposed actions? Proposing a multimillion dollar fix for a $25 component that failed for the first time in 30 years is probably not a realistic action. On the other hand, if the failure of the $25 component caused a near miss for a catastrophic event then the million-dollar fix is probably realistic. The evaluator also needs to take into consideration the potential consequence of a repeat event and amount of risk involved when proposing corrective actions. Important: Anyone asked to develop these types of Corrective Actions should be trained in performing Cost-Benefits Analyses. For information sources, see the list at the end of this chapter.

Is the proposed action within the span of control of the organization? Proposing an action to prevent lightning from hitting the station is not realistic. However, proposing actions to minimize the consequences of a lightning strike are. In our dump truck accident teaching example, proposing an action to address the parent’s lack of supervision of the nine-year-old boy would be unrealistic. The construction company has no authority to make the parents go to parenting school. A problem is identified that the construction procedures are inadequate to ensure quality construction. Stopping construction to upgrade all construction procedures is not realistic. Instead, a process can be implemented to upgrade the procedures before their next use.

Examples: Actions that are realistic:

• •

Assign a Work Control individual responsibility to ensure a supply of locking devices are available. Initiate containment sump debris interceptor gate modification to install locks on the gates. The intent is to lock the gates prior to entering mode 4 and having the keys in the Control Room with SM authority required to unlock the gates.

Note: While action 2 above is realistic, it is not SMART. (The action was closed when the modification request was initiated. It was subsequently rejected as not being cost effective. There was no tie from the modification request to the corrective action. What the action should have said: “implement a modification to lock the containment sump debris interceptor gates and revise the key control procedure to include SM authority required to unlock the gates.”) Action that is not realistic:

•

Develop a focused ALARA Dynamic Learning Activity with these requirements: o o o o

Near term session to be held for site population. Requirement to be completed prior to the next scheduled Refuel Outage. The impact of improper field practices shall be demonstrated. Include the supervisor expectation that each worker is responsible for their own dose and for ensuring that their work practices are ALARA.

Questions: • Who is the target audience? • Does this mean all personnel at the site or just personnel who are radiation worker qualified? Not a realistic expectation: As the action is written, all site personnel would have to attend the training even though they are not radiation worker qualified. Action that is not realistic: Discuss standards, expectations and maintenance fundamentals as related to this event and associated with the need to properly perform duties and to understand what it means when you sign that something is performed and second checked.

Questions: • Who is the target audience? • How does discussing standards and expectations correct problematic behaviors? • Does this action to apply to personnel who do not perform maintenance? 1 2 .1 0 T im e ly Corre c t ive Ac t ions Does your proposed implementation schedule minimize the time at risk? Do you have sufficient interim measures in place to prevent or minimize the consequences of a repeat event until the permanent Corrective Actions are implemented?

Interim actions should have sufficient control measure to ensure that they are not undone before the permanent action is implemented and should have sufficient monitoring to assure effectiveness. The criteria for disbanding the interim measure should be clearly defined. Example: The action is “Train Maintenance staff on the Conduct of Maintenance Manual. This is being completed through continuing training.”

Note: It will take time to put all maintenance personnel through the training cycle. An interim measure might be to increase supervisory oversight of maintenance activities screened as medium and high nuclear and industrial safety risk until the training is complete. Monitoring will be done via the management observation program. Example: The action is “Replace turbine driven feed pump governor control valve stem with a stem manufactured with improved materials that will be less susceptible to the corrosions mechanisms that caused valve binding.”

The compensatory measure that is required until the valve stem is replaced is: Increase turbine driven feed pump testing frequency. The turbine driven feed pump is presently being tested every week and management has requested review and approval of any proposed changes to that testing frequency. This compensatory action is recommended to remain in place until the valve stem is replaced. The following actions are being performed in conjunction with this increased test frequency: 1. Gather video footage of governor and linkage response during turbine driven feed pump testing. 2. Gather data to plot pump speed vs. time during turbine driven feed pump testing. 3. Gather UDS or Viper (string potentiometer or similar) data on the control system during turbine driven feed pump testing. 4. After tests, review video to confirm acceptable speed control. 5. After tests, review pump speed vs. time data to verify acceptable speed control. 6. After tests, review UDS or Viper (string potentiometer or similar) data.

Note: Monitoring of data will help to confirm that valve does not suffer from any binding or catching due to corrosion caused by leaking steam admission valves. Important: Without an effective Action Tracking software application to help manage your organization’s corrective actions, it is unlikely that any will be completed effectively. Action Tracking systems are not addressed in this book. All the companies I’m familiar with have software for tracking corrective actions, but these software programs all had weaknesses, so I have no good examples to recommend. I’ve designed and even built some action tracking software applications in my career, and I once had an opportunity to design a very comprehensive one, but none is currently in use.

1 2 .1 1 Com pre he nsive Effe c t ive ne ss Re vie w Pla n The Effectiveness Review Plan should be provided with an outline on the intent of the Corrective Action plan and what the desired end state – success − should look like. Things to consider and include are: • Scope of the effectiveness review to be completed (e.g. what is included, and expected timeframes for the Corrective Actions). • Critical aspects / criteria within the scope address all causes (e.g. what will success look like, and method(s) to measure success). • The need for an interim effectiveness review. The plan should also consider and specify when possible, the earliest opportunity that should exist for when a determination of effectiveness can be made. In these cases, the due date for the Effectiveness Review should correspond to when that opportunity exists. 1 2 .1 1 .1 Assigning Effe c t ive ne ss Re vie w s Determine which Corrective Actions need to be reviewed for effectiveness after they are completed. Specifically describe which activities, processes, behaviors, etc., need to be analyzed and provide specific actions, owners, and due dates. See Figure 12-3 for examples.

Effectiveness Review (Hypothetical) Following completion of all Corrective Actions, perform an effectiveness review per PROCXXXX. State the requirements for an acceptable effectiveness review determination including recommended method(s) to determine effectiveness. Corrective Action Description & Effectiveness Review AR#

Purpose of Effectiveness Review/Success Criteria

Reduce backlog of corrective maintenance to improve equipment availability

Determine reduction in equipment unavailability. Use the WANO Equipment PIs for unplanned unavailability as well as the overall unavailability to judge acceptability.

Review planned unavailability (outage performance against scheduled outage duration taking into account emergent work) and unplanned unavailability (unavailability resulting from equipment failures that could have been corrected during equipment outages). Compare Equipment performance PIs against WANO performance goals. Schedule: 4/29/17 Owner (Planner)

>75% system engineer attendance at Owners Group meeting with documented trip / benchmarking reports.

Assess how many Owners Group meetings have been attended by system engineer.

AR#2008XXXXX

Participate in Owners’ Group meetings to overcome Gaps / Weaknesses in Engineering knowledge of assigned system AR#2008XXXXX

Evidence that applicable recommendations were implemented and currently being used to maintain the system

Method(s) to Determine Effectiveness, Schedule and Owner

Review trip report and interview engineering to determine which OG recommendations have been adopted for the system being maintained. Interview engineers to verify increased knowledge of assigned systems. Schedule: 5/5/17 Owner (Engineer)

Figure 12-3. Effectiveness Review (Hypothetical)

Corrective Action Description & Effectiveness review AR#

Purpose of Effectiveness Review/Success Criteria

Method(s) to Determine Effectiveness, Schedule and Owner

Revise SM-XXXX to identify that the worker entering the confined space has the responsibility to ensure that the EC is aware of each entry

The purpose of this action is to ensure the EC maintains positive control over all personnel entering his/her assigned confined space. The EC needs to know who is in the confined space, and their scope of work to ensure all safety precautions are being met.

Review the revised procedure to verify the new requirements are clear.

AR#2008XXXXX

Six months after the procedure change has been implemented: • Interview personnel assigned responsibility as an EC to verify their understanding of the new requirements. •

•

Interview personnel qualified to enter a confined space to verify their understanding of the new requirements. Conduct observations of 10 different confined space entry activities to verify that the procedure requirements are being properly met. The observations should be performed separately from the interviews.

Schedule: Owner Assign a Work Controls individual the responsibility to ensure a supply of locking devices are available. AR#2008XXXXX

The purpose of this action is to ensure that there is a ready supply of approved locking devices that can be quickly issued to replace devices that are found to be defective or missing.

5/5/17 (Someone Independent)

Six months after the After the action has been closed: • Verify that the Work Controls procedures have been revised to include an assignment for maintaining locking devices. •

• •

Verify that a Work Control person has an assigned responsibility for locking devices. Interview Operators to test their understanding on where to obtain locking devices. Challenge the system by requesting several different styles of locking devices.

Schedule: Owner

Corrective Action Description & Effectiveness review AR# Replace turbine driven feed pump governor control valve stem with a stem manufactured with improved materials that will be less susceptible to the corrosions mechanisms that caused valve binding.

Purpose of Effectiveness Review/Success Criteria The purpose of this ACTION is to correct a

turbine driven feed pump governor control valve stem design deficiency that resulted in binding of the valve stem resulting in turbine over speed trips and, on occasion inability, to achieve design Aux. Feed flow.

5/5/17 (Someone Independent)

Method(s) to Determine Effectiveness, Schedule and Owner

Until the new valve stem is inspected, continue reviewing the CDC traces for evidence of valve binding after each quarterly pump run. Five years after installation of the new valve stem verify that there have been no events, over speed trips or failures to achieve design flow caused by a binding

governor control valve stem. Inspect the new valve stem for signs of corrosion at the refueling outage closest to the five-year anniversary to of operation with the new valve stem. Schedule: Owner

5/5/17 (Engineer)

Sum m a ry This chapter provided details for the following Corrective Action related elements:  Corrective actions basic elements.  Effective corrective action structure elements.  Ten criteria for effective corrective actions.  SMART corrective actions.  Specific corrective actions.  Relative effectiveness of types of corrective actions.  Measurable corrective actions.  Assigned and accepted corrective actions.  Realistic corrective actions.  Timely corrective actions.  Comprehensive effectiveness review plan.

Look ing Forw a rd Chapter 13 provides details for documenting and reporting the result of rigorous (root) and structured (apparent) cause investigation efforts.

Re c omm e nde d Re a ding For more about change management plans, see: Forck, F. (2016). “Plan corrective actions.” Cause analysis manual: Incident investigation method & techniques. Brookfield, CT: Rothstein Publishing, p. 213. For more about cost-benefit analysis, see: Forck, F. (2016). “Plan corrective actions.” Cause analysis manual: Incident investigation method & techniques. Brookfield, CT: Rothstein Publishing, p. 218. Karam, R.A. & Morgan, K.Z., editors. (1976). Energy and the environment cost-benefit analysis. (Conference proceedings, June 23–27, 1975, Sponsored by the School of Nuclear Engineering, Georgia Institute of Technology, Atlanta, GA.). Amsterdam, Netherlands: Pergamon. Layard, R. & Glaister, S. (1994). Cost-benefit analysis, (2nd ed.). Cambridge, UK: Cambridge University Press.

Cha pt e r 1 3 Doc um e nt at ion a nd Re port ing One of the most important steps in any event investigation is to document and “sell” your results. Even the most thorough investigation that identifies the correct causes and recommends the most effective corrective and preventive actions will not have the desired results if no one can understand the investigation report. Your event investigation report must be a call for action. Your intent is to show why a problem needs fixing and recommend a fix. This chapter presents several useful tools for presenting investigation results. 1 3 .1 Point s t o Re m e m be r w he n Re port ing I nve st iga t ion Re sult s Who is the audience? • Will the report be read by senior management? • If the report is technical in nature, will the audience understand the technical details? • Will the report potentially give offense to some person or some organization? • Is the report easy to read? o Remember, a 200-page report with no executive summary will not be read by senior management. o Do not use jargon that the audience may not understand. o Senior management will expect the organization’s report standards. Is your report clearly written and organized? • A well laid out report should lead the reader from the problem statement to the event description, then to the cause analysis and finally to the causes and corrective actions. • Use concrete language, not vague or imprecise language.

•

Use a consistent verb tense throughout the report. If you write in the past tense, use the past tense throughout the report. • Spell out abbreviations at their first use. Also, minimize the use of abbreviations, especially if the audience may not be familiar with the topic.

Does it have an executive summary? If the report will be read by senior management (most are), it needs an executive summary. The executive summary should contain the following: • Why is this report important? • What are the principal causes for the event? • What are the principal corrective and preventive actions? • Were there any positive findings from the investigation? Eight questions to answer in a well-written report. In a well written report, the reader will be able to: see the clear connection between the problem statement and the rest of the report, identify the cause and corrective actions, and pick out the answers to the following eight insightful questions (Corcoran, 2003): 1. What were the setup factors for the event? 2. What triggered the event? 3. What made the event as bad as it was? 4. What kept the event from being any worse? 5. What are the consequences of the event (generic implications)? 6. What is the significance of the event (safety implications)? 7. What are the learnings from the event (causes)? 8. What are the corrective and/or preventing actions for the event? 1 3 .2 Use of Gra phic s a nd Pic t ure s in Re port s Use of graphics and pictures in a causal analysis report helps make the report come alive and can greatly improve its effectiveness. An instrument technician was performing routine preventive maintenance on a pneumatic pressure indicating controller. Part of the maintenance task involved removing the mechanical linkage between the sensing element and the process instrument. To remove the linkage, the technician was required to spread the connecting clevis by rotating the plastic spreader so the clevis can be removed from the attachment pins. There is a similar connection at each end of the linkage. There are no detailed steps in the work order or technical manual for this portion of the maintenance activity. After maintenance was finished, the technician placed the linkage on the correct pin at each end of the linkage. He closed the spreader at the sensing element end but failed to close the spreader at the process instrument end of the linkage. The process instrument end of the linkage is partially hidden behind an adjustment knob.

Pictures in a report can help explain a situation that would be difficult in a paragraph setting. For example, it is hard to visualize how the instrument technician performing maintenance on the pneumatic controller could have made a mistake. Pictures help make the explanation clearer. Let us review the description again, but this time with pictures (see Figure 13-1).

Figure 13-1. Pressure Indicating Controller Pictures. (Photo from the Root Cause Investigation of the March 2006 automatic scram at Nine Mile Point Unit 2 performed by Steve Davis and Chester Rowe).

Use of graphics and tables will also improve the report’s presentation and make it easier to show the analysis and chain of logic. An effective way to include graphics and tables is to include them as attachments. For example, including graphics and tables such as the Human Performance Evaluation Presentation, Comparative Event Line, EC&F Chart, Factor Tree, or other similar presentation tool as a report attachment make it easier to document the analysis. 1 3 .3 Re port Spe c ific s 1 3 .3 .1 Ex e c ut ive Sum m a ry Many cause analysis report formats require an executive summary. However, even if your organization’s report structure does not require one, we advise that you include one, especially if the report is lengthy. The executive summary should be limited to one or two pages at the most. The summary should contain a quick summary of the event, the consequences, significance of the event, event causes,

and the principal corrective and preventive actions. Short paragraphs and bullets will make your executive summary easier to read. Executive Summary for Main Turbine Trip Report Example On March 9, 2006, at 22:14 hours, Unit 2 automatically scrammed from approximately 86.6% core thermal power. The cause of the scram was a Main Turbine Trip due to low condenser vacuum due to failure of both the primary and backup turbine steam sealing systems. This event was important to the station in that multiple equipment failures challenged the operators and the degraded condition of the primary steam stealing system went undiagnosed for seven hours because: • • •

Conflicting indications were not reconciled. Ineffective risk assessment. Poor communication practices.

• •

Reliance on assumptions. Failure to engage technical expertise.

There was also an Operations crew performance issue, in that following the reactor scram the crew did not recognize that entry conditions for entering Emergency Operating Procedures were met. Mitigating the event was effective operator action to start the condenser vacuum pumps and prevent a total loss of vacuum and subsequent closure of the Main Steam Isolation valves that would have resulted in a loss of reactor normal heat removal function. Principal Preventive Actions include: 1. Improve the maintenance procedures for overhauling / calibrating pneumatic controllers by including checks to ensure linkages and other critical components are properly reassembled following maintenance. 2. Correct the Emergency Steam Supply system design deficiency. 3. Conduct Simulator Training scenarios involving appropriate personnel from Engineering Department, Operations Department, and Maintenance Department. This training will include communication exercises in which various role players will communicate critical information about a plant issue, in order to evaluate the effectiveness of “problemstatements,” conservative decision-making, and troubleshooting diagnostic. 4. Conduct Simulator Training scenarios that include Control Room and in-plant “indication / condition” distracters requiring application of problem-solving skills. 5. Conduct training on problem solving for current licensed and non-licensed Operators. Incorporate troubleshooting training into initial Operator training programs to ensure that all Operators receive troubleshooting training prior to shift assignment.

1 3 .3 .2 Proble m St a t e m e nt Both the investigator and the reader need to understand what problem needs to be resolved. The problem statement needs to identify clearly what “object” has the problem and what is its “defect.” Keep the problem statement as short and concise as possible.

The condition report description was: During a control room panel walk down discovered relay chattering on C72-K13B and C72K77B. Simultaneously noted IRM’s F, H, and B downscale and counts lowering on nuclear instruments SRM B and D. With battery chargers, BWS-CHGR3B1 and 3D1, out of service for maintenance, suspected degraded voltage on the batteries 3B and 3D. Measured voltage on battery 3B at 17 volts DC and 3D at 21 volts DC. Subsequently, after discussions with Electrical Maintenance, Operations opened the main feeder breaker from the batteries per procedure N2-OP-73B to the battery loads. Opening the feeder breaker caused the power panel to de-energize and prevent damage to the nuclear instruments.

From the above description, it is difficult to assess what the problem is. Is there something wrong with the batteries? Is the maintenance being performed improperly? Are the loads on the batteries too large? Actually, the problem is Operations had not expected the battery voltage to go below the minimum allowed during the maintenance activity. The problem statement was: The B side nuclear instruments (SRMs and IRMs) were rendered unavailable when terminal voltage on Batteries 3B & 3D unexpectedly fell below the minimum allowed voltage of 21 Volts DC during the performance of scheduled battery charger maintenance.

The object in this problem statement is battery voltage. The defect is unexpectedly fell below minimum allowed. 1 3 .3 .3 I nve st iga t ion Sc ope The investigation scope describes the boundaries of an investigation; it is a statement of the “contract” with the investigation sponsor. If the investigation boundaries are not set, you may encounter difficulties selling the results to reviewers and approvers. They will often say the investigation was not broad enough or did not go deep enough. In the battery case described above, the investigation scope was to evaluate the following: • Station conservative decision making / questioning attitude. • Work planning and scheduling. • Risk management. • Knowledge of design basis. • Use of human performance error prevention tools. • Training of Operations, Maintenance, Engineering personnel. • Roles and responsibilities. • Use of post work/activity critiques. In this example, the investigation team was asked to go well beyond determining why Operations was surprised battery fell below the minimum allowed. • For human performance problems, the investigation scope may be limited to a work group or sub work group. • For equipment related problems, the investigation scope may be limited to a specific equipment type (e.g., make or model), or a particular application.

1 3 .3 .4 Eve nt Disc ussion The event discussion should provide the reader with an understanding of what happened. For example, the battery event description included two paragraphs describing the discovery of the problem and activities that terminated the event. The event description also included several paragraphs describing the activities leading up to the event. Any supporting information needed to help the reader understand the event description should be included as an attachment to the report. For example, if the startup procedure for a pump needs to be explained, the procedure steps should be included as an attachment. In the battery case example, a description of the battery system was included as a report attachment. 1 3 .3 .5 Com pe nsa t ory a nd I m m e dia t e Ac t ions This section of the report is dedicated to describing actions taken to terminate the event or stabilize the condition, and describe any actions taken to minimize the potential for a recurrence of the problem until corrective actions are complete. In the case of the battery, the immediate action taken to stabilize the condition was to open the breakers and remove the loads from the batteries. As with any business, downtime is lost revenue. Operation often needs to be restored before all actions needed to prevent recurrence of the problem are complete. Therefore, interim or compensatory actions must be implemented. Compensatory and interim actions include actions such as: • An extra verification task until an automated system can be installed. • Requiring a fire watch to tour the building hourly until the automatic fire alarm system can be repaired. • Use of contractor services until staff personnel are trained. 1 3 .3 .6 Ex t e nt of Condit ion/Ca use Refer to Chapter 11 for details on documenting Extent of Condition and Cause assessments.

1 3 .3 .7 Eve nt Signific a nc e As an investigator, always ask why this event is important to the organization. Any event investigation requires dedication of resources (people and materials). It also disrupts work and production. This section of the report describes the significance of the event in terms of both actual and potential consequences. Describe the actual consequences in terms of: • Extent of damage to equipment or the facility. • Extent of injury to personnel. • Extent of economic impact to the organization.

o Lost revenue. o Cost of repair. o Cost of investigation. • Extent of regulatory impact. • Extent of public impact. Describe the potential consequences by considering whether this event is a precursor to a much more serious event. Assess the event in terms of the number of barriers that were penetrated, and the number of barriers that remained. Remember the only difference between a nonconsequential event and significant consequential event might be luck. Luck is not a very robust barrier! In the previous battery event, no equipment was damaged. Therefore, in terms of actual consequences the event was minor. However, in terms of potential consequences the event was significant because all the pre-established work control barriers to prevent battery voltage from unexpectedly dipping below the minimum allowed were violated or broken. 1 3 .3 .8 Re vie w of Ope ra t ing Ex pe rie nc e There are rarely any new problems or errors. By performing an operating experience review, you may find it possible to learn from previous, similar events. Therefore, conducting an operating experience review can help determine cause and extent of condition, as well as provide ideas for corrective and preventive actions. When reviewing operating experience, review events that appeared to have similar problems, and events that appeared to have similar causes. We have reviewed reports that bound the operating experience review to the exact problem and cause. They obviously did not learn anything from their review. We have also seen reports that only contained a list of previous similar events. We could not tell if they used the information to confirm the causes, or to help develop corrective actions. The first place to look for operating experience is in-house. Has the organization experienced a similar problem before? Next, talk to other facilities, organizations, vendors, manufactures, and trade associations. Also, use the Internet. It is amazing how much information can be found in a short time. Operating experience can be a valuable tool. Fish Scale Example For example, several years ago there was a problem with an emergency diesel generator. The engine would start to “hunt” at about 25% load. In spite of exhaustive troubleshooting, the team could not locate the problem. They called the engine manufacturer and, as could be expected, they blamed the governor manufacturer. During a conference call with the governor manufacture, one of the governor design engineers asked, “Did you perform a fish scale test?” He went on to explain that the fuel racks, the device that controls the amount of fuel to each cylinder, can stick and cause the governor to hunt. He said that by using a small fish scale while stroking the fuel rack, it is possible to detect sticky spots. Maintenance personnel performed the test and found several sticky spots. A subsequent investigation found dried grease in the fuel rack bearings. The problem was resolved with proper

lubrication. To prevent recurrence of the problem, the fish scale test was added to the maintenance procedure, along with better instructions on lubricating the fuel racks.

1 3 .3 .9 Pre se nt ing Re sult s Now that the investigation is complete, and results have been documented, your final report must be approved by management. For significant events, organizations often require the investigation findings and recommendations to be presented to a board or committee for approval. Here are some tips to ensure success when making a presentation to the board.

1 3 .3 .9 .1 Be fore t he Pre se nt a t ion • • • • •

Understand what type of presentation you will be expected to make. Make sure board members receive copies of the report in advance of the meeting so they can review it ahead of time. Make sure the investigation team and sponsor agree on the investigation findings. Do not present if all issues are not resolved. Invite investigation team members to attend and participate in the board presentation. They can often answer questions and resolve issues on the spot. Make sure corrective action owners have agreed to their assigned actions.

1 3 .3 .9 .2 Pre se nt a t ion Sc e na rios The two main presentation scenarios require different preparation approaches. The first approach requires development of a presentation separate from the report. The trick here is to keep the presentation at a summary level. 1. The lead investigator or event investigation sponsor is required to make a formal presentation to the board summarizing the results of the investigation and corrective actions. After the presentation, the board votes to accept or reject the investigation. • Do not bore the audience with too many details. • Use of a PowerPoint presentation is recommended. o Each slide should include only one or two points. o Use pictures and graphics to help explain major points. • Use team members to present portions of the presentation. 2. The board reviews the report in advance, asks questions, and then votes to accept or reject the investigation. 1 3 .3 .1 0 U se of Cha rt s a nd Gra phic s As the adage says, “sometimes a picture is worth a thousand words.” A one-page summary that shows the chain of logic from the problem to the most fundamental cause(s) can be a powerful tool to gain management acceptance and buy-in of the investigation results.

1 3 .3 .1 1 Com m ona lit ie s M a t rix The commonalities matrix in Chapter 10 is an effective way to present the results of the investigation. If the matrix that was developed as part of the investigation has more than one page, the investigator should develop a one-page summary matrix for use at the management presentation. 1 3 .3 .1 2 Hum a n Pe rform a nc e Pré c is A “Cone of Cause” or “Précis” is a tool that provides an easy visualization of how the investigation team arrived at the fundamental or root causes. Figure 13-2, derived from the dump truck case study Commonalities Matrix in Chapter 10, shows how the results of the investigation can be displayed to the management team. In an actual case where a “Cone of Cause” or “Précis” graphic Figure 13-2 was used, the station manager said that the graphic was all he needed to understand the event causes. The presentation and vote to approve the report took less than fifteen minutes.

Figure 13-2. Dump Truck Human Performance Précis

Sum m a ry This chapter provided details for the reporting the results of a cause investigation:  Points to remember when reporting investigation results.  Use of graphics and pictures in reports.  Report specifics, including • Executive summary. • Problem statement. • Investigation scope. • Event discussion. • Compensatory and immediate actions. • Extent of condition / cause. • Review of operating experience. • Presenting results. • Use of charts and graphics. • Commonalities matrix. • Human performance précis.

Look ing Forw a rd Your licensed, downloadable Interactive Cause Analysis Tool is available with purchase of this book. (See the Registration and Download Instructions provided). This Interactive Cause Analysis Tool is an Excel workbook that contains the following elements: 1. An EVENT/ISSUE SUMMARY and PROBLEM STATEMENT template combined with an EVENT/ISSUE TIME LINE worksheet. (1 TAB) [Refer to section 8.2 of Chapter 8 for completion instructions.] 2. Five HUMAN PERFORMANCE EVALUATION/SUMMARY WORKSHEETS. (5 TABS) {Refer to Chapters 3 and 5 for completion instructions} 3. A FAILURE OBSERVATION / DEFECT/FAILURE EVALUATION WORKSHEET. (1 TAB) [Refer to Chapters 4 and 5 for completion instructions.] 4. A COMMONALITIES MATRIX. (1 TAB) [Refer to Chapter 10 for completion instructions.] 5. A CAUSE ROAD MAP GRAPHIC upon which cause & effect logic lines are automatically drawn and is great for presentation. (1 TAB) [Similar to Figure 5-8 but with straight rather than curved lines and no text boxes.] 6. CAUSE ROAD MAPS 2 through 6 supported by multiple popups. (5 TABS) [Refer to Chapters 2 and 5 for completion instructions.]

7. CAUSE ROAD MAPS A & B supported by multiple dropdown lists. (2 TABS) [Refer to Chapters 3 and 5 for completion instructions.] 8. A page for development, structuring and sequencing, but not tracking, of ACTIONS and supporting TASKS. (1 TAB) [Refer to Chapter 12 for completion instructions.] 9. An investigation TOOL SELECTION page. (1 TAB) Worksheets available from this page include: •

•

•

•

• • • •

•

Extent of Cause [Refer to Chapter 11 for completion instructions.] Extent of Condition [Refer to Chapter 11 for completion instructions.] Barrier Analysis [Refer to section 8.4 from Chapter 8 for completion instructions.] Change Analysis [Refer to section 8.5 from Chapter 8 for completion instructions.] Common Cause Analysis [Refer to section 8.7 from Chapter 8 for completion instructions.] Comparative Event Line [Refer to section 8.2 from Chapter 8 for completion instructions.] Failure Modes and Effects [Refer to section 8.3 from Chapter 8 for completion instructions.] Statements Template [Refer to Chapter 7 for guidance for Effective Interviewing.] Task Analysis [Refer to section 8.6 from Chapter 8 for completion instructions.]

Note: Each of the above worksheets contains downloadable MS Word templates. •

Safety Culture Assessment Selection for this option opens a GENERALIZED SAFETY CULTURE ASSESSMENT TOOL which in turn allows selection of one or more of the following 10 pages of Safety Culture related questions.

1. Personal Accountability

2. Questioning Attitude

4. Leadership Safety Values and Actions 7. Continuous Learning

5. Decision-Making 8. Problem Identification and Resolution 10. Work Processes

3. Effective Safety Communication 6. Respectful Work Environment 9. Environment for Raising Concerns

Re fe re nc e s Corcoran, W.R. (2003). The phoenix handbook. Windsor, CT: Nuclear Safety Review Concepts Corporation.

Cha pt e r 1 4 Fina l T hought s A detail comparative analysis of many current Root Cause investigation methods, tools, and techniques was done in 2011 by the Joint Research Centre – Institute for Energy (ISBN 978-9179-19712-3). This analysis covered the following event investigation tools: • Kepner-Tregoe Problem Solving and Decision • Event and causal factor charting and analysis. Making. • Cause and effect analysis. • Causal factor tree analysis. • Interviewing.

• Event tree analysis.

• Change analysis.

• Current Reality Tree (CRT).

• Task analysis.

• Interrelationship diagram.

• Barrier analysis.

• Human Factors Investigation Tool (HFIT).

• Fault tree analysis.

This document also examined the following commercial all-purpose tools: REASON®

PROACT®

RealityCharting®

TapRooT®

This analysis concluded: 1. “Most of the available recommendations concerning selection and usage of different event investigation methods, tools and techniques are not exhaustive and user-friendly.” 2. “Most of the available recommendations are of little practical value, leaving current and future event investigators (especially newcomers, who are not yet proficient) to make decisions about the selection of event investigation methods and tools, based on their own knowledge or providers’ promotional materials.”

This Joint Research Centre effort did not include a review of the Cause Road Map or the unique method of classifying and integrating tools described in this book. If it had, they may have recognized that classifying tools and techniques as either Data Gathering, Data Analysis or Event Modeling (or combinations thereof) and how this insight, coupled with the structural elements of the Cause Road Map and our unique “Comparative Event/Time Line” tool, provide a userfriendly framework that makes it easier for even a novice to make tool use decisions for any cause investigation. This JRC-IE report also noted that over 56% for all events can be traced to a Latent Organizational Weakness. Unlike most of the previously listed “commercial” tools, the Cause Road Map not only makes it easy to identify the latent organizational weakness behind the majority of significant event, but it also provides insights into why these weaknesses existed. Even newcomers can use this book to make good decisions about the best event investigation methods and tools to use. This book also provides user-friendly guidance and the tools needed to rapidly and accurately investigate the causes for all categories of events. Effective corrective actions derived from causal investigations are key to any organization's performance improvement effort. Implementing the right corrective actions is also an important element in reducing operating costs. However, corrective actions cannot fully resolve problems if they do not address and fix the underlying causes. This book helps the investigator ensure causes related to the event are identified so appropriate corrective actions can be taken. We understand there are multiple tools available for investigating human performance and equipment problems. This book should make your investigation tasks easier because it covers both human performance and hardware failure causes/modes. Constructed as a taxonomy, the graphical structure separates the various cause factors into groups with common characteristics and sub-divides these groups into a hierarchic order based on their relative significance or importance. During more than five years of validation, event investigators found the Cause Road Map to be intuitive and easy to use. They used the Cause Road Map to: • Develop a data gathering checklist. • Develop corrective actions. • Validate chain of cause logic. • Ask the next “Why”. • Develop interview questions. • Present investigation results. I’ve met only a few people in my career who actually enjoyed doing a cause investigation. Most liken the experience to having a “Root Canal without Novocaine.” Reading this book is unlikely to change your attitude about conducting an investigation, but use of my techniques will definitely take some of the “pain” out of your efforts. Unfortunately, if your organization is like most, once you have done one cause investigation, you then become the “go-to guy or gal” for the next one. Therefore, I encourage you get some practice in a training environment.

The mission of my company, The Excellence Engine LLC, is to make cause investigations as easy to do and as simple as possible, while still providing an unmatched level of insight into the fundamental causal factors. In this regard, we provide the following services: • Four-and-one-half-day Basic Cause Evaluation Course for both novice and experienced investigators who want some practice with realistic case studies. • Two-day Basic Cause Evaluation Course, which is a condensed version of the Four-and-onehalf-day course with fewer practical exercises, intended for investigators who perform less rigorous investigations. • Three-hour Cause Analysis Overview Course for Managers. • Short-term mentoring for event investigation efforts. • Perform quality reviews of cause analysis reports. • Support Performance Improvement Initiatives including “Safety Culture” assessments. • Assist with Corrective Action/Trend Analysis Program development or upgrades. (Excellence Engine personnel assisted in the development of the first full featured fully integrate CAP program and software application designed specifically to satisfy the unique needs of a new Nuclear Construction project.) • Provide Project Management and training for Performance Improvement-related projects. If you are interested in any of these services, you can contact me at [email protected] or by phone at +1 (865) 238-1466. Please note that this phone number does not ring; so you will need to leave me a message.

About t he Aut hor Chester D. Rowe’s self-reliance and problem-solving abilities emerged early when, at age 14, he was given some parts for a ham radio receiver. To complete this project, he taught himself calculus and electronics. By the time his father died, when he was 16, Chet had designed the circuits for and then built this receiver using parts from discarded TVs. His father’s death pushed Chet into early adulthood and responsibilities to help his raise his siblings. After overcoming 16 years of financial and other obstacles, Chet earned a BS in Nuclear Engineering and an AS in Physics/General Studies. In addition, Chet has completed training equivalent to a degree in Electrical Engineering. Chet recently retired after over 40 years in the commercial nuclear industry. Chet is the creator of the Cause Road Map© Taxonomy and trained in Kepner-Tregoe, TapRoot®, PII, MORT, and other cause investigation and problem-solving techniques. As a result, he has been involved in more cause investigations than he can remember. Chet has an extensive background in: • Quality Assurance (QA). • Corrective Action Programs (CAP). • Engineering at five Nuclear Power plants. Chet has been a: • QA Engineer. • Lead Auditor. • Level III Inspector. • QA and a QC Supervisor. • Certified Quality Engineer.

• • • • •

Corrective Action Program Manager. Corrective Action Program Coordinator/Engineer. Equipment Performance Monitoring/Plant Engineer. Management Consultant and Root Cause Training Instructor. Co-founder and Vice Chairman of the Central PA Section of the American Nuclear Society (ANS). Prior to his commercial nuclear experience, Chet served in the US Navy aboard three submarines, completing qualification as an Engineering Watch Supervisor (EWS) – equivalent to a Reactor Operator at a commercial nuclear plant. He achieved the rank of First Class Petty Officer (EM1-SS) in under four years, received several Navy commendations, and was recommended for a billet aboard the NR-1 deep submergence research submarine and a limited duty officer (LDO) commission (both of which he declined in favor of returning to civilian life and college).

Cre dit s Kristen Noakes-Fry, ABCI, is Executive Editor at Rothstein Publishing, since 2011. Previously, she was a Research Director, Information Security and Risk Group, for Gartner, Inc.; Associate Editor at Datapro (McGraw-Hill), where she was responsible for Datapro Reports on Information Security; and Associate Professor of English at Atlantic Cape College in New Jersey. She holds an M.A. from New York University and a B.A. from Russell Sage College, and resides in St. Petersburg, Florida. Cover Design & Graphics: eBook Design & Processing: Lead Copy Editor:

Sheila Kwiatek, Flower Grafix Donna Luther, Metadata Prime Nancy M. Warner

Philip Jan Rothstein, FBCI, is President of Rothstein Associates Inc., a management consultancy he founded in 1984 as a pioneer in the disciplines of Business Continuity and Disaster Recovery. He is also the Executive Publisher of Rothstein Publishing. Glyn Davies is Chief Marketing Officer of Rothstein Associates Inc. He has held this position since 2013. Glyn has previously held executive level positions in Sales, Marketing and Editorial at several multinational publishing companies and currently resides in San Rafael, California. Rothstein Publishing is your premier source of books and learning materials about Business Resilience, including Crisis Management, Business Continuity, Disaster Recovery, Emergency Management, Security, and Risk Management. Our industry-leading authors provide current, actionable knowledge, solutions, and tools you can put in practice immediately. Rothstein Publishing remains true to the decades-long commitment of Rothstein Associates, which is to prepare you and your organization to protect, preserve, and recover what is most important: your people, facilities, assets, and reputation.

H ow t o Ge t Your FREE DOWN LOAD of Bonus Re sourc e M a t e ria ls for T his Book You are entitled to a free download of the Interactive Cause Analysis Tool with your purchase of Simplifying Cause Analysis: A Structured Approach, by Chester Rowe. This licensed software is an MS Excel Workbook that provides: 1. An Event/Issue Summary and Problem Statement template combined with an Event/Issue Time Line worksheet from Chapter 8. 2. An automated Human Performance Evaluation Summary Report and the Hardware/Material/Design Failure Evaluation Summary Report from Chapters 3, 4 & 5. 3. Separate worksheets and editable MS Word versions of most of the tools presented in Chapter 7 and 8. 4. A tool to help develop, structure and sequence, but not track, corrective actions developed to guidance provided Chapter 12. 5. The functionality to automatically draw lines between identified causal factors on the Cause Road. This graphic can then be printed and used as a presentation like the Human Performance Précis graphic presented in this chapter. 6. A unique Generalized Culture Assessment tool. To access these materials just login to our website as an existing user or register as a new user, and then register your book by following these simple instructions: I T ’S EASY – LOGI N OR REGI ST ER ON OUR WEBSI T E FIRST, login as an existing user or register as a new user at www.rothstein.com/register. New users will receive an email link to conﬁrm. T HEN REGI ST ER Y OUR BOOK Logging in or registering takes you to our Product Registration page. You’ll see a list of books. Simply select your book by clicking the corresponding link to the left and just follow the instructions. You will need to provide proof of purchase. You will receive a conﬁrming email within a few hours with additional information and download instructions. Your registration will also conﬁrm your eligibility for future updates if applicable.

203.740.7400 or 1-888-ROTHSTEin fax 203.740.7401 4 Arapaho Road, Brookﬁeld, Connecticut 06804-3104 USA [email protected] www.rothstein.com