602 68 4MB
English Pages 354 [356] Year 2023
Functional Safety from Scratch
This page intentionally left blank
Functional Safety from Scratch
A Practical Guide to Process Industry Applications
Peter Clarke xSeriCon
Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2023 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-443-15230-6 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Candice Janco Acquisitions Editor: Anita Koch Editorial Project Manager: Dan Egan Production Project Manager: Sruthi Satheesh Cover Designer: Matthew Limbert Typeset by TNQ Technologies
Contents About the author ............................................................................................ xvii Acknowledgements ........................................................................................... xix Abbreviations .................................................................................................. xxi Glossary .......................................................................................................xxvii Introduction ...................................................................................................xxix Chapter 1: Introduction to functional safety ......................................................... 1 1.1 What could possibly go wrong?...................................................................................1 1.2 Hazard and risk .............................................................................................................2 1.2.1 What is a hazard? ............................................................................................... 2 1.2.2 What is harm? ..................................................................................................... 3 1.2.3 What is risk? ....................................................................................................... 3 1.2.4 What is tolerable risk?........................................................................................ 5 1.2.5 Risk management through functional safety...................................................... 6 1.3 Functional safety standards: IEC 61508 and IEC 61511 ............................................7 1.3.1 Purpose of the standards..................................................................................... 7 1.3.2 Scope of IEC 61511 ........................................................................................... 8 1.3.3 Why comply with IEC 61511?........................................................................... 9 1.4 IEC 61511 key concepts...............................................................................................9 1.4.1 The functional safety lifecycle ........................................................................... 9 1.4.2 Intrinsically safer design................................................................................... 12 1.4.3 The safety requirements specification (SRS)................................................... 13 1.4.4 Assuring that functional safety is achieved ..................................................... 13 1.4.5 Random and systematic failures....................................................................... 14 1.4.6 Competency....................................................................................................... 20 1.5 The structure of IEC 61511........................................................................................22 1.6 The origins of IEC 61511...........................................................................................22 Exercises ............................................................................................................................24 Answers..............................................................................................................................24 Question 1dAnswer ........................................................................................... 24 Question 2dAnswer ........................................................................................... 25 Question 4dAnswer ........................................................................................... 25 References..........................................................................................................................25 Further reading ..................................................................................................................26 v
Contents
Chapter 2: Basic terminology: SIF, SIS and SIL.................................................. 27 2.1 The meaning of SIF, SIS and SIL..............................................................................28 2.1.1 What is a SIF? .................................................................................................. 28 2.1.2 What is a SIS? .................................................................................................. 28 2.1.3 SIL, reliability, and integrity ............................................................................ 29 2.1.4 What is an interlock (or trip)?.......................................................................... 29 2.2 Anatomy of a SIF .......................................................................................................30 2.2.1 The sensor subsystem ....................................................................................... 31 2.2.2 The logic solver subsystem .............................................................................. 34 2.2.3 The final element subsystem ............................................................................ 34 2.2.4 Permissives and inhibit functions..................................................................... 39 2.2.5 Other important aspects of a SIF ..................................................................... 39 2.3 Development of a SIF.................................................................................................41 2.3.1 SIL assessment.................................................................................................. 41 2.3.2 SIL verification ................................................................................................. 42 2.4 Failure..........................................................................................................................43 2.4.1 Failure modes .................................................................................................... 43 2.4.2 Failure rates....................................................................................................... 45 2.4.3 Hardware fault tolerance .................................................................................. 46 Exercises ............................................................................................................................48 Answers..............................................................................................................................49 Question 1dAnswer ........................................................................................... 49 Question 2dAnswer ........................................................................................... 49 Question 3dAnswer ........................................................................................... 50 Question 4dAnswer ........................................................................................... 50 Question 5dAnswer ........................................................................................... 51 Question 6dAnswer ........................................................................................... 51 Question 7dAnswer ........................................................................................... 51 Question 8dAnswer ........................................................................................... 51 References..........................................................................................................................52
Chapter 3: Risk evaluation ............................................................................... 53 3.1 Identifying hazardous scenarios .................................................................................53 3.2 Expressing risk in numbers ........................................................................................54 3.3 Tolerable risk...............................................................................................................55 Defining a tolerable risk per event ..................................................................... 56 Defining a total tolerable risk per risk receptor ................................................. 57 3.4 How much precision is needed?.................................................................................57 3.5 The ALARP concept...................................................................................................60 Exercises ............................................................................................................................61 Answers..............................................................................................................................61 Question 1dAnswer ........................................................................................... 61 Question 2dAnswer ........................................................................................... 61 Question 3dAnswer ........................................................................................... 61 References..........................................................................................................................62 vi
Contents
Chapter 4: Introduction to SIL assessment ......................................................... 63 4.1 Safety instrumented function (SIF) operating modes................................................63 4.1.1 What are low demand, high demand and continuous modes?........................ 63 4.1.2 Selecting an operating mode ............................................................................ 64 4.1.3 Formal definition of operating modes.............................................................. 65 4.1.4 The significance of operating modes ............................................................... 65 4.1.5 Tips on selecting the operating mode .............................................................. 67 4.2 The objectives of SIL assessment ..............................................................................68 4.2.1 Low demand mode SIFs................................................................................... 68 4.2.2 High demand and continuous mode SIFs ........................................................ 68 4.2.3 Why not use default SIL targets?..................................................................... 70 4.2.4 Prevention or mitigation? ................................................................................. 72 4.3 Identifying and documenting SIFs .............................................................................73 4.3.1 Objective............................................................................................................ 73 4.3.2 Using process control narratives, interlock descriptions ................................. 74 4.3.3 Using cause & effect diagrams (C&EDs)........................................................ 75 4.3.4 Using HAZOP and old SIL assessment study reports .................................... 77 4.3.5 Using binary logic diagrams............................................................................. 79 4.3.6 Using interlock logic diagrams ........................................................................ 80 4.3.7 Using piping & instrumentation diagrams (P&IDs)........................................ 82 4.4 Separating complex interlocks into SIFs ...................................................................83 4.5 The double jeopardy rule............................................................................................84 4.6 Independent protection layers.....................................................................................85 4.6.1 Pressure relief devices (PRDs) ....................................................................... 86 4.6.2 Alarms with operator response....................................................................... 87 4.6.3 Control loops................................................................................................... 90 4.6.4 Autostart of standby equipment ..................................................................... 91 4.6.5 BPCS interlocks .............................................................................................. 91 4.6.6 Interlocks in other PLCs................................................................................. 92 4.6.7 Check valves ................................................................................................... 92 4.6.8 Other mechanical protective devices.............................................................. 93 4.6.9 Operating procedures ...................................................................................... 93 4.6.10 Spill containment ............................................................................................ 93 4.6.11 Trace heating................................................................................................... 94 4.6.12 Backup utility supplies ................................................................................... 94 4.6.13 Another SIF..................................................................................................... 94 4.6.14 Typical IPL credit available ........................................................................... 95 4.6.15 Examples of insufficient independence.......................................................... 95 4.7 Critical common element analysis .............................................................................97 Exercises ..........................................................................................................................100 Answers............................................................................................................................101 Question 1dAnswer ......................................................................................... 101 Question 2dAnswer ......................................................................................... 102 Question 3dAnswer ......................................................................................... 102 vii
Contents Question 4dAnswer ......................................................................................... 102 Question 5dAnswer ......................................................................................... 102 Question 6dAnswer ......................................................................................... 102 Question 7dAnswer ......................................................................................... 102 Question 8dAnswer ......................................................................................... 103 Question 9dAnswer ......................................................................................... 103 Question 10dAnswer ....................................................................................... 103 Question 11dAnswer ....................................................................................... 104 Question 12dAnswer ....................................................................................... 104 Question 13dAnswer ....................................................................................... 104 References........................................................................................................................104
Chapter 5: SIL assessment methodology........................................................... 105 5.1 Introduction .............................................................................................................105 5.2 Overview of SIL assessment methods ...................................................................106 Features of SIL assessment common to all methods..................................... 109 5.3 Selecting initiating events.......................................................................................110 Typical initiating events .................................................................................. 111 Determine the initiating event in sufficient detail.......................................... 112 Control loop malfunctions............................................................................... 112 Failure of safeguards as initiating events ....................................................... 113 5.4 Assessing the likelihood of initiating events .........................................................114 5.5 Assessing the consequence severity .......................................................................114 5.6 Documenting the SIL assessment study.................................................................115 5.7 Risk matrix method ................................................................................................116 5.7.1 Method overview ........................................................................................ 116 5.7.2 Likelihood and severity categories............................................................. 116 5.7.3 The risk matrix............................................................................................ 117 5.7.4 Calibration of the risk matrix..................................................................... 118 5.7.5 Handling multiple initiating events ............................................................ 121 5.7.6 Handling enabling conditions and conditional modifiers .......................... 122 5.7.7 Handling independent protection layers (IPLs) ......................................... 122 5.7.8 Estimating the SIF demand rate................................................................. 122 5.7.9 Risk matrix and ALARP ............................................................................ 123 5.7.10 High demand and continuous mode SIFs .................................................. 124 5.8 Risk Graph method .................................................................................................124 5.8.1 Method overview ........................................................................................ 124 5.8.2 Parameters used in Risk Graph .................................................................. 125 5.8.3 Risk Graph examples .................................................................................. 125 5.8.4 Selecting parameter categories ................................................................... 125 5.8.5 Calibration of the Risk Graph .................................................................... 130 5.8.6 Handling multiple initiating events ............................................................ 131 5.8.7 Handling enabling conditions and conditional modifiers .......................... 131 5.8.8 Handling independent protection layers (IPLs) ......................................... 131
viii
Contents 5.8.9 Estimating the SIF demand rate ................................................................. 131 5.8.10 High demand and continuous mode SIFs .................................................. 131 5.9 Layer of protection analysis (LOPA) .....................................................................131 5.9.1 Method overview .......................................................................................... 131 5.9.2 Enabling conditions ...................................................................................... 132 5.9.3 Conditional modifiers.................................................................................... 133 5.9.4 Handling multiple initiating events .............................................................. 135 5.9.5 Estimating the SIF demand rate ................................................................... 135 5.9.6 Example LOPA worksheet............................................................................ 136 5.9.7 High demand and continuous mode SIFs .................................................... 136 5.10 Fault tree analysis ...................................................................................................138 5.10.1 Method overview ........................................................................................ 138 5.10.2 Documenting Fault Tree analysis............................................................... 140 5.11 Cost/benefit analysis ...............................................................................................141 5.11.1 Introduction ................................................................................................. 141 5.11.2 Calculating the cost of the outcome .......................................................... 141 5.11.3 Calculating the cost of the SIF .................................................................. 142 5.11.4 Selecting the optimal solution .................................................................... 143 5.12 The SIL assessment workshop ...............................................................................143 5.12.1 The SIL assessment team ........................................................................... 143 5.12.2 Overall objectives of the SIL assessment workshop ................................. 144 Exercises ..........................................................................................................................146 Answers............................................................................................................................151 Question 1dAnswer ....................................................................................... 151 Question 2dAnswer ....................................................................................... 151 Question 3dAnswer ....................................................................................... 151 Question 4dAnswer ....................................................................................... 151 Question 5dAnswer ....................................................................................... 152 Question 6dAnswer ....................................................................................... 152 Question 7dAnswer ....................................................................................... 152 Question 8dAnswer ....................................................................................... 152 Question 9dAnswer ....................................................................................... 152 Question 10dAnswer ..................................................................................... 154 Question 11dAnswer ..................................................................................... 154 Question 12dAnswer ..................................................................................... 154 Question 13dAnswer ..................................................................................... 154 Question 14dAnswer ..................................................................................... 155 Question 15dAnswer ..................................................................................... 155 Question 16dAnswer ..................................................................................... 155 Question 17dAnswer ..................................................................................... 156 Question 18dAnswer ..................................................................................... 156 References........................................................................................................................156
ix
Contents
Chapter 6: SIL assessment: special topics......................................................... 159 6.1 Redundant initiators ................................................................................................159 Handling redundant initiators.......................................................................... 160 6.2 Redundant safety functions ....................................................................................160 What determines if two SIFs are redundant?................................................. 162 One SIF as backup to another ........................................................................ 162 Redundant SIFs in low risk situations............................................................ 163 6.3 One SIFdtwo hazards............................................................................................163 6.4 The IPLs vary depending on demand case ............................................................163 6.5 The demand case is activation of another SIF ......................................................165 6.6 One SIF cascades to another ..................................................................................166 6.7 Initiating event involves multiple simultaneous failures .......................................167 Example 1........................................................................................................ 167 Example 2........................................................................................................ 169 6.8 Permissives ..............................................................................................................170 Demand frequency........................................................................................... 171 Defining physical initiators and final elements .............................................. 171 6.9 Multiple sensors distributed across a wide area ....................................................172 6.10 Operator action as initiator.....................................................................................172 6.11 Duty and standby pumps ........................................................................................173 Variable number of pumps running ................................................................ 175 Duty pump switchover .................................................................................... 175 6.12 Alarms from cascade control loops........................................................................176 6.13 Final elements are shared between the basic process control system (BPCS) and the SIS ................................................................................................177 6.14 Selecting primary final elements ............................................................................177 6.14.1 Introduction ................................................................................................. 177 6.14.2 The safe state .............................................................................................. 177 6.14.3 Selecting primary final elements ................................................................ 178 Exercises ..........................................................................................................................180 Answers............................................................................................................................182 Question 1dAnswer ....................................................................................... 182 Question 2dAnswer ....................................................................................... 182 Question 3dAnswer ....................................................................................... 182 Question 4dAnswer ....................................................................................... 183 Question 5dAnswer ....................................................................................... 183 Question 6dAnswer ....................................................................................... 183 Question 7dAnswer ....................................................................................... 183 Question 8dAnswer ....................................................................................... 184 Question 9dAnswer ....................................................................................... 184 Reference .........................................................................................................................184
x
Contents
Chapter 7: Key functional safety documents ..................................................... 185 7.1 The how and why of documentation .......................................................................185 7.2 The functional safety management plan ..................................................................186 7.2.1 Introduction ..................................................................................................... 186 7.2.2 The functional safety lifecycle ....................................................................... 186 7.2.3 Management of change and configuration management ............................... 188 7.2.4 Management requirements in the FSMP........................................................ 189 7.2.5 Why the FSMP is important........................................................................... 193 7.3 The Safety Requirements Specification (SRS) ........................................................194 7.3.1 Introduction ..................................................................................................... 194 7.3.2 What is the purpose of the SRS? ................................................................... 194 7.3.3 When is the SRS developed? ......................................................................... 194 7.3.4 What should the SRS contain?....................................................................... 195 7.3.5 Common cause failures .................................................................................. 203 7.3.6 SIF demand rates ............................................................................................ 204 7.3.7 Selecting a spurious trip rate target ............................................................... 205 7.4 The safety manual.....................................................................................................207 7.5 Maximising the effectiveness of documentation .....................................................208 Minimise repetition ........................................................................................... 208 Automate, but be careful................................................................................... 208 Consider the future............................................................................................ 209 7.6 Complete overview of functional safety documentation .........................................209 Exercises ..........................................................................................................................214 Essay or discussion question............................................................................. 215 Answers............................................................................................................................216 Question 1danswer .......................................................................................... 216 Question 2danswer .......................................................................................... 216 Question 3danswer .......................................................................................... 216 Question 4danswer .......................................................................................... 216 Question 5danswer .......................................................................................... 216 Question 6danswer .......................................................................................... 216 Question 7danswer .......................................................................................... 217 Question 8danswer .......................................................................................... 217 Question 9danswer .......................................................................................... 217 Question 10danswer ........................................................................................ 217 Question 11danswer ........................................................................................ 217 Question 12danswer ........................................................................................ 217 Question 13danswer ........................................................................................ 218 Question 14danswer ........................................................................................ 218 Question 15danswer ........................................................................................ 218 Question 16danswer ........................................................................................ 218 Question 17danswer ........................................................................................ 218 Question 18danswer ........................................................................................ 218 Question 19danswer ........................................................................................ 218 Reference .........................................................................................................................218 xi
Contents
Chapter 8: Safety instrumented system design .................................................. 219 8.1 The goal of SIS basic design ...................................................................................219 8.2 PLC-based logic solvers ...........................................................................................220 8.2.1 What is a SIS PLC?........................................................................................ 220 8.2.2 PLC redundancy and diagnostics ................................................................... 224 8.2.3 Diagnostics for field devices .......................................................................... 225 8.2.4 Setting trip parameters.................................................................................... 229 8.2.5 Cybersecurity................................................................................................... 230 8.3 Selection of field devices..........................................................................................231 8.3.1 Preferred types of SIF initiator....................................................................... 231 8.3.2 Defining final element architecture ................................................................ 232 8.3.3 SIF architecture ............................................................................................... 233 8.3.4 Testing and maintainability ............................................................................ 234 8.3.5 Partial valve stroke testing.............................................................................. 235 8.3.6 Energise and de-energise-to-trip..................................................................... 236 8.3.7 Derating ........................................................................................................... 237 8.3.8 Hard-wiring of field devices........................................................................... 237 8.4 Independence.............................................................................................................237 8.4.1 Multiple SIFs in the same SIS ....................................................................... 238 8.4.2 Multiple systems tripping a motor via the same MCC ................................. 238 8.4.3 Communications between SIS logic solver and BPCS ................................. 239 8.4.4 Implementing BPCS and SIS in a single logic solver .................................. 240 8.4.5 Implementing non-safety functions in the safety PLC.................................. 241 8.5 Non-PLC based logic solvers ...................................................................................242 Susceptibility to spurious trips.......................................................................... 244 8.6 What comes next?.....................................................................................................244 References........................................................................................................................244 Further reading ................................................................................................................244
Chapter 9: Meeting SIL requirements: SIL verification ...................................... 245 9.1 What it takes to achieve a given SIL.......................................................................245 9.2 Calculating the random hardware failure measure ..................................................246 9.2.1 Introduction ..................................................................................................... 246 9.2.2 How the failure measure is calculated: SIL verification ............................... 247 9.2.3 High demand and continuous modes ............................................................. 254 9.3 More on proof testing ...............................................................................................254 9.3.1 Optimising the proof test interval .................................................................. 254 9.3.2 The effect of human error during proof testing............................................. 255 9.4 Architectural constraints ...........................................................................................256 9.4.1 Introduction ..................................................................................................... 256 9.4.2 Hardware type A and type B ......................................................................... 257 9.4.3 Safe failure fraction ........................................................................................ 257 9.4.4 HFT requirements in IEC 61508:2000........................................................... 258 9.4.5 HFT requirements in IEC 61508:2010........................................................... 258
xii
Contents 9.4.6 HFT requirements in IEC 61511:2016........................................................... 259 9.4.7 How to apply SFF requirements .................................................................... 260 9.5 SIL capability and SIL certification.........................................................................260 9.5.1 Introduction ..................................................................................................... 260 9.5.2 Assessing the element’s performance in the field ......................................... 261 9.5.3 What is the difference between ‘proven in use’ and ‘prior use’? ................ 262 9.5.4 What is meant by a “SIL 2 shutdown valve”? .............................................. 263 9.5.5 Software SIL capability .................................................................................. 263 9.6 Calculating predicted spurious trip rate ...................................................................263 9.7 What to do if SIS design targets are not met..........................................................264 Exercises ..........................................................................................................................266 Descriptive questions......................................................................................... 266 Numerical questions .......................................................................................... 267 Answers............................................................................................................................268 Question 1dAnswer ......................................................................................... 268 Question 2dAnswer ......................................................................................... 268 Question 3dAnswer ......................................................................................... 268 Question 4dAnswer ......................................................................................... 268 Question 5dAnswer ......................................................................................... 268 Question 6dAnswer ......................................................................................... 268 Question 7dAnswer ......................................................................................... 269 Question 8dAnswer ......................................................................................... 269 Question 9dAnswer ......................................................................................... 269 Question 10dAnswer ....................................................................................... 269 Question 11dAnswer ....................................................................................... 269 Question 12dAnswer ....................................................................................... 269 Question 13dAnswer ....................................................................................... 270 Question 14dAnswer ....................................................................................... 270 Question 15dAnswer ....................................................................................... 270 Question 16dAnswer ....................................................................................... 270 Question 17dAnswer ....................................................................................... 271 Question 18dAnswer ....................................................................................... 272 Question 19dAnswer ....................................................................................... 272 References........................................................................................................................272 Further reading ................................................................................................................272
Chapter 10: Assurance of functional safety ...................................................... 273 10.1 Introduction .............................................................................................................273 10.2 Verification ..............................................................................................................273 10.2.1 Introduction ................................................................................................. 273 10.2.2 How verification works in practice ............................................................ 275 10.2.3 Verification checklists ................................................................................. 275 10.2.4 Discrepancy handling.................................................................................. 276 10.2.5 Competency and independence requirements ............................................ 277
xiii
Contents 10.3 Validation ................................................................................................................278 10.3.1 Introduction ............................................................................................... 278 10.3.2 Hardware inspection ................................................................................. 278 10.3.3 End-to-end test .......................................................................................... 279 10.3.4 Specific tests for sensors........................................................................... 280 10.3.5 Specific tests for final elements................................................................ 281 10.3.6 Test equipment .......................................................................................... 281 10.3.7 Document inspection ................................................................................ 281 10.3.8 Discrepancy handling................................................................................ 281 10.3.9 Restoring the SIS after validation ............................................................ 282 10.3.10 Validation report........................................................................................ 282 10.3.11 Revalidation............................................................................................... 284 10.4 Functional safety assessment..................................................................................285 10.4.1 Introduction ................................................................................................. 285 10.4.2 Which stakeholders need to perform FSA? ............................................... 287 10.4.3 What sample size needs to be considered in FSA? .................................. 287 10.4.4 Independence requirements for FSA.......................................................... 288 10.4.5 How FSA is conducted in practice ............................................................ 288 10.4.6 Assessment tasks......................................................................................... 288 10.4.7 Common pitfalls to avoid ........................................................................... 290 10.4.8 Example: assessment of SIL verification ................................................... 290 10.5 Functional safety audit............................................................................................291 10.5.1 Introduction ................................................................................................. 291 10.5.2 Typical audit procedure .............................................................................. 291 Exercises ..........................................................................................................................293 Answers............................................................................................................................295 Question 1danswer ........................................................................................ 295 Question 2danswer ........................................................................................ 295 Question 3danswer ........................................................................................ 295 Question 4danswer ........................................................................................ 295 Question 5danswer ........................................................................................ 295 Question 6danswer ........................................................................................ 296 Question 7danswer ........................................................................................ 296 Question 8danswer ........................................................................................ 296 Question 9danswer ........................................................................................ 296 Question 10danswer ...................................................................................... 296 Question 11danswer ...................................................................................... 296 Question 12danswer ...................................................................................... 297 Question 13danswer ...................................................................................... 297 Question 14danswer ...................................................................................... 297 Question 15danswer ...................................................................................... 297 Question 16danswer ...................................................................................... 297
xiv
Contents
Chapter 11: The SIS operational phase ........................................................... 299 11.1 Introduction .............................................................................................................299 11.2 Training requirements .............................................................................................300 11.2.1 Operator training ......................................................................................... 300 11.2.2 Training for maintenance personnel........................................................... 300 11.3 Proof testing ............................................................................................................301 11.3.1 Introduction ................................................................................................. 301 11.3.2 Applying more than one test procedure per device................................... 302 11.3.3 Test before performing maintenance.......................................................... 302 11.3.4 Document the duration of testing and repair............................................. 302 11.4 Monitoring of SIS performance .............................................................................303 11.5 SIS modifications and partial decommissioning....................................................304 11.5.1 The Management of Change procedure..................................................... 304 11.6 Future challenges ....................................................................................................306 11.7 Closing thoughts .....................................................................................................306 Exercises ..........................................................................................................................307 Answers............................................................................................................................307 Question 1dAnswer ....................................................................................... 307 Question 2dAnswer ....................................................................................... 308 Question 3dAnswer ....................................................................................... 308 Question 4dAnswer ....................................................................................... 308 Question 5dAnswer ....................................................................................... 308 Reference .........................................................................................................................309
Appendix A Sample verification checklist.......................................................... 311 Appendix B What is affected by SIL................................................................ 315 Index ............................................................................................................ 317
xv
This page intentionally left blank
About the author Dr Peter Clarke is a graduate of the University of Oxford and Durham University in the United Kingdom. Originally from a chemistry background, he worked in biotech R&D for 5 years before spending 3 years in the UK fine chemicals and pharmaceutical industry in process and safety management roles. He subsequently moved into safety consultancy, where he has extensive experience in process risk management in oil and gas, petrochemical, semiconductors and energy. He has facilitated a considerable number of HAZOP, SIL assessment, LOPA and alarm management studies, in addition to Safety Case, Fault Tree, ALARP and HAZID. Dr Clarke is the Founder and Managing Director of xSeriCon, a consultancy, software and training firm based in Hong Kong and the United Kingdom. xSeriCon specialises in process hazards analysis and functional safety. He holds the CFSE certificate, as well as a professional qualification in Occupational Safety and Health, and is a Chartered Chemist.
xvii
This page intentionally left blank
Acknowledgements The author wishes to express his sincere thanks to Stephane Boily, Tom Cheng, AbdelAziz Hussein Izzeldin, and Jo Wiggers, all of whom reviewed chapters of the manuscript and made detailed and insightful comments. Several anonymous reviewers reviewed the book’s outline and statement of purpose at an early stage and made many helpful suggestions. Koen Leekens provided valuable assistance with Chapter 8, and his colleague Steven Elliott supplied the photos of logic solvers therein. Mike Organisciak generously agreed to the inclusion of his cartoon in Chapter 10. The support and guidance of Simon Lucchini is also gratefully acknowledged.
xix
This page intentionally left blank
Abbreviations AC AC ALARP ANSI ARV Beta (b)
BPCS CCPS CED, C&ED
CFD CHAZOP COMAH CPD CPU DC DCS DTT e-HAZOP E/E/PE EDV EMI
alternating current architectural constraints (a requirement for a given amount of hardware fault tolerance in a SIF’s subsystems) as low as reasonably practicable (a concept related to risk tolerability) American National Standards Institute, an organisation that publishes standards automatic recirculation valve (a self-actuating valve that adjusts recirculation flow from a pump) “Common cause factor”: the fraction of a device’s failure rate that is associated with a common cause, whereby a single stressor will result in multiple devices failing simultaneously basic process control system. Similar meaning to DCS. Center for Chemical Process Safety (a division of the American Institute of Chemical Engineers) cause and effect diagram (a table listing SIF initiators in the left side, SIF final elements along the top row, and indicators in the relevant cells showing which final elements are affected by each initiator) consequence of (SIF) failure on demand control (systems) hazard and operability (study) control of Major Accident Hazards (UK regulations based on the EU’s Seveso Directive) continuing (or continuous) professional development central processing unit (the heart of a PLC that executes the software) direct current distributed control system de-energise-to-trip (a SIF configuration that causes the SIF to trip on loss of an energy source). The opposite of ETT. electrical (systems) hazard and operability (study) electrical, electronic and programmable electronic (the types of SIS technology to which IEC 61508 and its daughter standards apply) emergency depressurization valve (also known as blowdown valve) electromagnetic interference
xxi
Abbreviations EPC EPCIC ESD ESDV ETA ETT FAT FC FEED FIT FL FMEA FMEDA FO FR FS FSA FSMP FT FTA HART HAZOP HFT HIPPS HMI HSE I/O IEC
engineering, procurement and construction (role of a contractor in a FS project) engineering, procurement, construction, installation and commissioning (role of a contractor in a FS project) emergency shutdown system (a typical application of a SIS) emergency shutdown valve event tree analysis energise-to-trip (a SIF configuration that requires a source of energy to achieve the design intent of the SIF). The opposite of DTT. factory acceptance test (a detailed test of the SIS logic solver hardware and application software at the vendor’s premises, prior to delivery to site) fail closed (a valve configuration in which loss of pneumatic or hydraulic pressure to the actuator causes the valve to move to the fully closed position) front end engineering design (role of a contractor in a FS project) 9 failures in time (a unit of failure rate equal to 1/(10 hours)) fail last (a valve configuration in which loss of the energizing medium does not drive the valve to the open or closed position). Also known as ‘stay put.’ failure modes and effects analysis (a technique for calculating the expected failure rate of a device) failure modes, effects and diagnostics analysis (an extension of FMEA taking the effect of diagnostics into account) fail open (a valve configuration in which loss of pneumatic or hydraulic pressure to the actuator causes the valve to move to the fully open position) frequency reduction functional safety functional safety assessment functional Safety Management Plan fault tree fault Tree Analysis highway Addressable Remote Transducer Protocol (a protocol for ‘smart’ devices to communicate by overlaying digital information on a 4e20mA analog loop) hazard and operability (study) hardware fault tolerance (the maximum number of hardware faults that can be present in a system without preventing the system from achieving its design intent) high integrity pressure protection system (a SIL-rated safety function designed to close a shutdown valve on high upstream pressure) human-machine interface (the hardware and software required for a computer to communicate with an operator) health, safety and the environment (an engineering discipline) input/output (e.g. I/O cards on a DCS or safety PLC) International Electrotechnical Commission, an organisation that develops and publishes technical standards
xxii
Abbreviations IPL ISO KVM Lambda, lamda: lDU, lDD, lSU, lSD, lNE LAN LOP LOPA MCC MMS MoC MooN
MOV MTBF MTTFS MTTR OS P&ID PCS PCV PFD PFDavg PFH
independent protection layer (a safety device designed to prevent or reduce harm in a given scenario) International Organisation for Standardisation, an organisation that develops and publishes standards keyboard, video monitor, mouse (a set of devices collectively forming a typical user interface) rate of failure of the following types: dangerous undetected, dangerous detected, safe undetected, safe detected, no-effect. local area network layer of protection (which may not necessarily be an IPL) layer of protection analysis motor control circuit (a device that controls the power to a motor, and typically monitors the motor’s performance) machine monitoring system (a system that monitors parameters of a rotating machine such as vibration, and takes an action if the parameters are outside preset limits) management of change M out of N (M and N are integers, 1 M N). A group of sensors in MooN configuration means that, if any M of the N sensors reach the threshold to trip, the SIF will be tripped. A group of final elements in MooN configuration means that at least M of the N final elements must execute their intended action successfully if the harm is to be avoided. motor-operated valve mean time between failures mean time to fail spurious. A measure of the spurious trip rate of a SIF (longer MTTFS ¼ less frequent spurious trip) mean time to restore (or repair) (the mean time from occurrence, or discovery, of a dangerous fault until it is repaired and the SIF is back in operation) operating system (underlying standard software providing basic functions required for a computer to operate) piping and instrumentation diagram process control system pressure control valve (usually self-actuating) probability of failure on demand (when specified without the ‘avg’ suffix, this usually refers to IPLs rather than SIFs) probability of failure on demand of a SIF, averaged over the lifetime of the SIF probability of failure per hour (a failure measure used for SIFs in high demand and continuous modes)
xxiii
Abbreviations PHA PLC PLL PMC PRD PRV psig PSSR PST PTC PTD PTI PV PVST QMS QRA RAID RAM RBD RRF SAT SFF SIF SIL SIS SIT
SOV SRS TBC TCV
process hazards analysis (a collective term for several techniques including HAZOP, CHAZOP, e-HAZOP and SIL assessment) programmable logic controller probable loss of life (the mean number of fatalities arising from a specified incident) project management consultant pressure relief device pressure relief valve pounds per square inch, gauge pressure pre-startup safety review (an assessment whose scope of work, with relation to a SIS, is similar to FSA stage 3) partial stroke testing. Same meaning as PVST. proof test coverage (the fraction of a device’s dangerous failure rate that is covered by a given proof test procedure) proof test duration (the mean time during which a device must be taken offline for proof testing) proof test interval (the planned duration between proof tests) process variable (a measurable quantity, such as pressure, at a specific location in the process) partial valve stroke testing quality management system quantitative risk assessment (or Analysis) (a calculation of the frequency, severity and location of harm relating to a range of incidents at a specific facility) redundant array of inexpensive disks (a high-integrity data storage medium) reliability, availability, maintainability (study) reliability block diagram (a model of the system architecture) risk reduction factor (a measure of the safety performance of a SIF) site acceptance test safe failure fraction (the fraction of the overall failure rate of a device that constitutes safe failures) safety instrumented function. A set of hardware and associated software designed to detect a particular hazardous scenario, and protect against harm arising from it safety integrity level. A measure of the integrity (reliability, availability) of a SIF, expressed in categories numbered 1e4 safety instrumented system (a collection of SIFs in a single system) site integration test (a test to demonstrate correct functioning of the SIS loops and communications after all components are connected together in their final configuration) solenoid-operated valve (a process-contacting valve driven by a pneumatic or hydraulic actuator, with the air or hydraulic fluid being controlled by a solenoid valve) safety requirements specification to be confirmed temperature control valve (self-actuating) xxiv
Abbreviations UEL UPS USB USD USD
unmitigated event likelihood (expected frequency of an undesired event in the absence of the SIF designed to prevent it) uninterruptible power supply (battery-backed power supply for a PLC, KVM or other component of a SIS) universal serial bus US dollar unit shutdown (a localised shutdown of a single piece of equipment) Symbols appearing in the text in the form - , such as TAHH-102, are examples of equipment tag numbers.
xxv
This page intentionally left blank
Glossary Annunciation failure Dangerous failure Detectable failure Discoverable failure Functional safety Functional Safety Assessment Functional Safety Audit Hardware fault tolerance Harm Hazard Independent Protection Layer
No effect failure Operating mode
Period
Phase Process safety time Random failure Risk Risk receptor Safe failure
A failure that prevents the functioning of a diagnostic A failure that, in the absence of hardware fault tolerance, would render a SIF unable to achieve its design intent A failure that can be revealed by an automatic diagnostic test In this book, we describe failures that can be revealed by manual proof testing as discoverable. Risk reduction achieved using active automatic safety systems that employ an electrical, electronic or programmable electronic (E/E/PE) logic solver The task of confirming that the functional safety lifecycle, defined by the Functional Safety Management Plan (FSMP), has met its objectives The task of assessing whether functional safety-related procedures are being followed The maximum number of elements in a system that may have a dangerous failure without removing the system’s ability to achieve its design intent Unwanted impact on a risk receptor, such as injury to personnel or damage to equipment Some physical aspect of the process equipment which has the potential to cause harm A device or system designed to prevent the propagation of an incident, and sufficiently independent of the initiating event and other Independent Protection Layers A failure that is not a dangerous, safe, or annunciation failure Each SIF is designated as operating in Low Demand, High Demand or Continuous Mode, depending on the frequency of demand and other factors Several phases of the functional safety lifecycle grouped together. In this book, we refer to the “analysis period”, the “design period” and the “operational period”. One defined section of the functional safety lifecycle containing a specific task, such as hazard and risk analysis or SIS detailed design The time between the detection of an event and the last possible moment to begin the corrective action that can prevent the corresponding incident Mechanical failure of a device due to stress, when the design envelope of the device has not been exceeded A measure combining the likelihood of an unwanted event and the severity of harm arising from that event Anything that could suffer harm, such as personnel, the environment or revenue A failure that, in the absence of redundancy, would cause a spurious trip
xxvii
Glossary Safety instrumented function
Safety instrumented system SIL Assessment SIL Assignment SIL Determination SIL Selection SIL Target Selection SIL Verification SIS validation Systematic failure Verification
A set of hardware and associated software designed to detect a specified hazardous condition, and act to place or maintain a process in a designated safe state A system in which a set of safety instrumented functions is implemented The task of defining a target SIL, and other parameters, for a SIF The task of defining a target SIL, and other parameters, for a SIF The task of defining a target SIL, and other parameters, for a SIF The task of defining a target SIL, and other parameters, for a SIF The task of defining a target SIL, and other parameters, for a SIF The task of demonstrating, by calculation, that the target failure measure and hardware fault tolerance of a SIF is achieved by the proposed design The task of confirming that the commissioned SIS meets the requirements of the Safety Requirements Specification Failure of a device or system due to a human error The task of demonstrating, for each phase of the functional safety lifecycle, that the phase has generated correct outputs
xxviii
Introduction This book is about functional safety, which concerns automated systems dedicated to protecting against harm arising from specific risks. It covers the ‘how, what and why’ of functional safety in detail, explaining what needs to be done, in terms of both engineering and management requirements. Although it frequently refers to the applicable international standards, IEC 61508 (general to all industries) and IEC 61511 (specific to the process industry), the aim is to explain how functional safety works in practice, assuming only basic prior knowledge of the principles of risk management. Functional safety follows a lifecycle approach, starting from risk analysis and safety system design, through installation and commissioning, to operation of the process with the safety system in place. This book covers the complete lifecycle, explaining every step with extensive practical details. However, it does not go into detail on safety hardware and software development; for those, please refer to the excellent resources listed in the References section of each chapter.
Which industries are covered? The book focuses on the process industry, especially oil and gas, chemical processing and non-nuclear power. However, most of the principles and details described are relevant in many other industries where functional safety is applied to manage risk, such as machinery, medical equipment, mining, nuclear, paper and pulp and rail transport. A rather different approach is taken in the automotive sector, as the risk profile is different.
Who is the book suitable for? Engineers and managers in the process sector working with functional safety systems will benefit from this book, especially if you have no prior experience of functional safety. The substantial number of exercises in most chapters should be helpful for those preparing for ¨ V FS. qualifications such as Certified Functional Safety Professional/Expert or TU
xxix
Introduction
Who developed this book? The book’s author is the Founder and Managing Director of xSeriCon, a risk management consulting firm providing consultancy and training needs in process risk assessment, functional safety, alarm management and Safety Case development. xSeriCon also provides software products, SILabilityÔ (for SIL verification) and Vizop (process hazards analysis including Fault Tree Analysis). Details are available at xsericon.world and silability.com.
xxx
CHAPTER 1
Introduction to functional safety Abstract To manage hazards in the process industries, the associated risk of undesired incidents needs to be evaluated and managed. As an illustration, a fictional incident on a badly managed process plant is narrated, along with proceedings at the subsequent board of enquiry (also fictional). Functional safety, which is safety achieved by means of automatic systems, is one approach available to manage risk. Functional safety standards applicable to a range of industry sectors are available, in particular the process sector standard IEC 61511. Key functional safety concepts, in particular the functional safety lifecycle, random and systematic failures, and competency management, are introduced and explained.
Keywords: Competency; Functional safety; Functional safety lifecycle; Harm; Hazard; IEC 61511; Intrinsically safer design; Random failure; Risk; Systematic failure.
1.1 What could possibly go wrong? All’s quiet in the control room. A routine Sunday evening, and the information screens glow in bright colours; status information on all the tanks and pumps outside to the operator’s left, and a half-finished Solitaire game to the right. A flow quantifier ticks over silently on number 1 tank, registering an incoming transfer of flammable solvent via pipeline from another site several kilometres away. The operator flips through the file of open maintenance work orders on the desk. That faulty level sensor again; the work order has been open for three months now. Maybe they’ll get round to it eventually. The high level trip bypass warning light has been glowing for so long that nobody even notices any more. Anyway, who cares? We’ve got backup systems, the operator thinks. This place is safe as a rock. An alarm sounds; a discreet baapebaap from a speaker on top of the console. Irritated by the disturbance, the operator stretches out a lazy finger and stabs the well-worn Acknowledge button. Just that low oil pressure alarm on number 4 cooling pump again, I suppose. Checked it out twice before, false alarm every time. Anyway, half the alarms that come up, I don’t even know what they mean. Forget it. Back to the Solitaire game.
Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00013-6 Copyright © 2023 Elsevier Inc. All rights reserved.
1
2 Chapter 1 Solvent flows into number 1 tank, just like it always does on a Sunday night transfer. The level transmitter registered a fault an hour ago, so the operator switched over to a backup transmitter. The level creeps up, 10 cm/min. The actual level’s already too high and the operator should have shut it off by now, but the level is showing only 65% on the screen. The backup transmitter has never been used before and it is miscalibrated, set up for a denser solvent that used to occupy this tank. The operator, satisfied that the transfer still has an hour to go, flips back to the gaudily coloured site overview screen. The level creeps up to the high level alarm sensor. It hasn’t worked for years and nobody can test it, because the wiring is on an inaccessible part of the top of the tank. Up, up goes the level to the high level trip point. This time, the last-chance level sensor works but the trip is bypassed; that same work order the operator just flipped through was closed a month ago but nobody reset the bypass. Solvent hits the tank’s overflow pipe and starts to pour out into the spill containment bund. A flammable vapour detector picks it up and raises an alarm in the control room, but the operator ignores it because of all the false alarms. Every time the wind blows, the vapour alarm rings. They should fix that someday. It’s a warm, still summer’s evening. Not a breath of wind. The solvent, gushing out of the tank into the bund, evaporates to form a relentlessly expanding cloud of vapour, an invisible ball of disaster waiting to strike. Spreading outwards, now 50 m, now a hundred metres from the tank, it silently envelops the site and creeps over the fence to the neighbouring facility. Next door’s nightwatchman is out on patrol. So many rats around here; what if they chew the cables, he wonders. An unexpected chemical smell catches his attention. Glue? Paint? Who could be painting at this time of night? He walks across to a storeroom at the back of the warehouse, facing the tank storage site. The smell is really strong just here. Maybe the rats knocked over some can of chemical waste? He pulls his flashlight from his belt and flips it on. There is a spark .
1.2 Hazard and risk 1.2.1 What is a hazard? The chairperson pulls her desktop microphone closer and flicks the switch. “Good morning, everybody. This is day four of the board of enquiry into the explosion and fire at ABC Solvents on 16th August last year. Today, we have an expert witness from the National Safety Council, Mr Ben Kim. Welcome, Mr Kim.” The witness nods and settles in his seat, looking round the room. “Mr Kim, yesterday, we heard from another witness that the safety features on the tank were hazardous because they were not in proper working order. Can you tell us, in your view, what should have been done to keep them working properly?”
Introduction to functional safety 3 “Thank you, Madam Chairperson. May I offer a correction to your question? The term hazard cannot correctly be applied to the safety features themselves. First, we should understand what a hazard means: it is some physical aspect of our equipment which has the potential to cause something we don’t want to happen. In this case, it is best to think of the solvent, not the equipment, as the hazard. To explain what I mean: suppose the tank were filled with water instead of solvent. Could the fire have happened? Of course not, as the hazard arises from the nature of the solvent. “The failure of the safety features is better defined as an initiating event, because it initiates a chain of events made possible by the existence of the hazard. “Actually, the international standard IEC 61511 standarddto which we will refer later in this enquiryddefines a hazard very concisely as a “potential source of harm.” By harm we mean, essentially, any consequence that we don’t want to happen.” Mr Kim pauses for a sip of water.
1.2.2 What is harm? “Mr Kim, thank you for the clarification. Can you give an example of what we mean by harm? Is it the overflow of the tank?” Mr Kim continues, “Harm is the final, undesired outcome at the end of a chain of events. The various types of unwanted event we should consider can be grouped according to what suffers the ill effectsdfor example, people, environment or profits. These are known as risk receptors. If I may, I’d like to show some relevant examples of harm, classified by risk receptor, on the screen.” Mr Kim straightens his tie and continues. “When analysing the harm accruing to risk receptors, an operating company will generally select a small subset of these types of harmdusually not more than 5e6 items are relevant and significant for their specific situation. The selected harm types need to be quantified (for example, in money terms) or classified (in 3e5 severity categories).” (Readers of this book can find more detail in Chapter 5.)
1.2.3 What is risk? “Mr Kim, thank you, I’m much clearer on hazards and harm now. Another term we have heard from previous witnesses is risk. Now, I understand that risk is the ‘combination of the frequency of occurrence of harm and the severity of that harm’ (according to IEC 61511). Can you explain why risk is an important concept for our enquiry?” “With pleasure, Madam Chairman. One of the major advances in safety management in recent decades has been a shift in focus from hazard to risk. The concept of risk says that
4 Chapter 1 Table 1.1: Typical risk receptors and types of harm considered in the process industry.
Risk receptor People
Types of harm that may be considered in functional safety analysis Injury to personnel Illness of personnel
Types of harm not typically considered in functional safety analysis Psychological harm such as stress, low morale
Injury to visitors onsite Injury to persons offsite Illness of persons offsite (as a direct result of a specific incident) Surrounding environment (biological effects)
Surrounding environment (chemical and physical effects)
Harm to significant populations of wildlife, especially in the long term, due to release of substances (e.g. harmful gases, hot water effluent)
Short term events with no long term impact, e.g. emergency depressurization venting of hydrocarbons
Illness of persons offsite (as an indirect result of release of harmful substances, e.g. contamination of watercourses)
Long term impacts that are part of normal operations and addressed in other ways, e.g. CO2 emissions
Damage due to release of chemicals (e.g. corrosion or blackening of nearby structures due to acid gases, soot)
One-time planned impacts such as plant construction spoiling the view of local residents
Physical damage (leading to financial loss, injury or complaints), whether onsite or offsite, e.g. noise, earth tremors from mining or fracking Breach of permit conditions, e.g. excessive flaring Financial
Equipment damage (direct and indirect, e.g. due to fire)
Costs associated with idle time (e.g. personnel salaries, lease or depreciation of equipment)
Loss of production capacity (may be calculated in gross or net terms)
Long term loss of business due to inability to supply customers
Loss of materials (e.g. destruction of product inventory, damage to catalyst)
Generation of additional waste
Cost of rework Consequential losses such as demurrage of ships in port waiting for loading or unloading
Introduction to functional safety 5 Table 1.1: Typical risk receptors and types of harm considered in the process industry.dcont’d
Risk receptor Legal
Types of harm that may be considered in functional safety analysis
Types of harm not typically considered in functional safety analysis
Fines and compensation as a result of an incident Cost of defending legal cases Jailing of senior staff
Reputation
Adverse publicity
Loss of shareholder value
Requirement for public notification or evacuations Loss of privilege to operate
Withdrawal of operating licenses
Loss of public confidence or acceptance
Withdrawal of environmental permits
we pay more attention to harmful outcomes that are more serious, more likely to occur, or both. This means that we can focus our effortdand expendituredwhere it will give the biggest safety return. However, it also means that we need to identify both the frequency and severity of harm that can arise from an incident.” (Chapters 3 and 5 of this book cover these points in detail.) “That leads us to the question of deciding how safe a facility should be; or, to put it another way, how much risk can be tolerated.” The chairperson reaches for her microphone. “Indeed, Mr Kim, that is one of the points I want to ask. How is the level of risk tolerance generally determined in the process industry?”
1.2.4 What is tolerable risk? Mr Kim nods. “Good question, Madam Chairperson. If risks exist in our facilitydas they surely willdwe must determine whether they are tolerable. At first sight, the idea that any kind of risk can be tolerated is counter-intuitive, and may even seem inhuman if the risk in question could lead to fatalities. However, tolerance of risk is a reasonable and, in fact, entirely necessary part of everyday life. We took a calculated risk, for example, by taking some form of transport to come here today, determining subconsciously that the benefits of getting here outweigh the risk of an accident.” (For a detailed discussion of the sociopolitical aspects of the tolerable risk concept, see Ref. [1], p. 29ff.)
6 Chapter 1 “So, by determining the amount of risk the facility is willing to tolerate, we are able to make reasoned judgments on questions like these: • • • • •
Do the benefits of operating the facility outweigh the risks? How well controlled are the risks? Can I justify the safety case of the facility to the government and the general public? Am I using my safety resources optimally? Do I need to add more risk control measures, and if so, how well must they perform?
“Deciding on a level of tolerable risk is a critical aspect of risk control strategy. Arguably, this is more a question of politics than of engineering, as it touches on sensitive questions like the relative tolerability of human fatalities and lost profits. Fortunately, it is rarely necessary for an individual organization to go through a traumatic decision-making process about tolerable risk. There is now widespread consensus on tolerable risk levels, enshrined in local best practice and, in some places, mandated by law.”
1.2.5 Risk management through functional safety Mr Kim gathers his thoughts and continues. “An operating company needs to perform analysis to determine the current risk levels in their process, and then compare them with the defined tolerable risk levels. If there is a significant gap between the actual and tolerable risk, this may indicate that the risk is not adequately controlled. “At this point, the operating company should consider a hierarchy of risk management measures.” Mr Kim’s assistant displays a slide on the screen, showing the following series of questions: •
• •
Have we explored feasible options for reducing the inherent risk, such as substituting less hazardous materials, reducing inventories or improving the segregation of hazards and risk receptors? Are the risks already as low as we can reasonably make them? (This is the ALARP concept, which is covered further in Chapter 3.) Do we need further risk reduction measures? If so, should they be implemented through: • design upgrades, e.g. increase in design pressure; • passive protective systems, e.g. relief valves; • improved operating/maintenance procedures and training; • alarms, with defined response from operational personnel; • mitigation systems to reduce the severity of an incident, e.g. fire protection systems; • active protection systems, which automatically detect a dangerous condition and act to keep the plant in a safe state?
Introduction to functional safety 7 Mr Kim explains further. “Madam Chairperson, as you can see, a series of protective measures are available. The last of these, active protection systems, belongs to the realm of functional safety: that is, risk control measures implemented through an active Safety Instrumented System or SIS. That’s what I’m here primarily to discuss with your panel today.” The chairperson writes the words Functional Safety on her jotter and circles them. “Mr Kim, are there any generally accepted standards covering the management of such systems?” “Indeed there are. Sound management of functional safety is the objective of the international standards IEC 61508 and IEC 61511, which, with your permission, I’ll introduce to the panel now.”
1.3 Functional safety standards: IEC 61508 and IEC 61511 1.3.1 Purpose of the standards As Mr Kim explained in the fictional board of enquiry above, functional safety is the task of achieving risk reduction by means of an automatic system, which is designed to respond automatically to prevent an incident or to maintain safe operation. It covers a range of activities: risk analysis, safety system design, construction, commissioning, testing, operation, maintenance, and modification. A management system is put in place to ensure everything is done correctly throughout the project lifecycle. Achieving functional safety is a complex task, requiring cooperation between numerous parties: design and instrument engineers, equipment designers and vendors, safety consultants, software specialists, and operations and maintenance personnel, to name a few. International standards help to clarify expectations between the various parties, and provide a level playing field throughout the industry and across national boundaries. For this reason, the IEC released the first complete edition of IEC 61508, its framework standard on functional safety, in 2000, with a significantly updated second edition issued in 2010 [2]. IEC 61508 covers the entire spectrum of functional safety in general terms, with particular emphasis on the development of hardware and software for functional safety applications. The intent is that specific industry sectors will develop their own flavours of this standard, couched in terms applicable to their sector, and focusing on the most relevant aspects of
8 Chapter 1 Table 1.2: Sector-specific functional safety standards. Industry sector
Latest year of issue as of 2021
IEC 61508
General
2010
Hardware and software design, risk analysis
IEC 61511
Process
2016
Risk analysis, SIS design, SRS, FSAa
IEC 61513
Nuclear power
2011
I&C architecture using hardwired and/or computer-based systems
IEC 62061
Machinery
2021
Design, integration and validation of safety-related control systems
ISO 26262
Automotive
2018
Development cycle
IEC 62279
Rail
2015
Software for railway control and protection
ISO 13849
Machinery
Part 1: 2015 Part 2: 2012
Design and validation. All safety technologies, not just E/E/PE
IEC 62304
Medical devices
2006 þ 2015 amendment
Software development
EN 50129
Rail
2018
Hardware and software, design and implementation
ISO 25119
Machinery for agriculture and forestry
2018e19 þ 2020 amendments
Safety lifecycle
Standard
a
Major focus
Refer to the abbreviations list at the beginning of this book.
functional safety. The resulting sector-specific standards are listed in Table 1.2. Some countries have implemented their own national standards, which are essentially identical to the IEC standards and can be treated as such. An example is ANSI/ISA-61511:2018, the US implementation of IEC 61511:2016.
1.3.2 Scope of IEC 61511 The standard for the process industry sector covers electrical, electronic and programmable electronic (often abbreviated to E/E/PE) safety equipment. Purely mechanical and/or pneumatic systems are, strictly speaking, outside the scope of IEC 61511, but the principles in the standard are often useful in managing such systems. Also out of scope are conventional process control systems (e.g. PCS, DCS, BPCS) unless they
Introduction to functional safety 9 are required to play a part in high-integrity risk control measures (which, usually, they should not). In practice, the standard is normally applied to Safety Instrumented Systems (SISs) implemented using: • •
a safety-rated PLC; or safety relay logic.
IEC 61511 is intended to protect specific risk receptors: only “protection of personnel, protection of the general public or protection of the environment” are explicitly within its scope. However, it can bedand often isdapplied to other risk receptors, as listed in Table 1.1.
1.3.3 Why comply with IEC 61511? One of the most significant features of the IEC series of functional safety standards is that they are mostly performance-based, rather than prescriptive. That means, they expect entities to set their own safety targets, meet those targets, and demonstrate that the targets are metdwithout specifying the way in which this is achieved. Older prescriptive standards laid down rules constraining some quite specific aspects of design such as how many redundant items of hardware were required, irrespective of the actual safety performance achieved thereby. The advantages of the performance-based approach translate directly into benefits for the end user: • • • •
Solutions to risk management problems can be tailored to suit specific situations. This often results in better safety performance at less cost. Analytical methods can be selected to provide the optimal balance between analysis costs and design costs (this point is covered further in Chapter 5). Local and best practice can evolve over time, taking advantage of experience gained in real-world applications.
Compliance with IEC 61511 is not mandatory under law, but is widely regarded as representing best practice. As such, stakeholders such as end users, insurers and holding companies regard IEC 61511 compliance as evidence of “all reasonable measures” being taken to protect health and safety and avoid losses [3].
1.4 IEC 61511 key concepts 1.4.1 The functional safety lifecycle Developing and implementing a Safety Instrumented System (SIS) is a stepwise process. First, we must identify the hazards within the scope of the project, and determine the risks they generate. Next, risk reduction measures must be developed, and assessed to ensure they are adequate. If the risk reduction measures require a SIS, we design the SIS and
10 Chapter 1 check the design meets the risk reduction needs. Then the SIS is installed and commissioned. During its operational lifetime, it may need to be reassessed and modified according to changing circumstances. Eventually, parts of the SIS will be decommissioned, and we must make sure this does not compromise the safety performance of the remaining systems. Successful execution of each step requires completion of all previous steps. Thus, the standard requires a plan, detailing the steps required, the actions to be performed in each one, and how the sequence as a whole will be executed and managed. The steps of the lifecycle are known as phases. For convenience, in this book we will sometimes group phases together into 3 periods: the analysis period, the design period and the operational period. Fig. 1.1 shows the periods of the lifecycle, and Fig. 1.2 shows the lifecycle phases typically included in each period. Earlier, we noted that a key aspect of the standard is to demonstrate that safety performance targets are met. To do this, we must measure performance and compare with the goals that were set. If targets are not achieved, we should return to earlier steps and
Figure 1.1 Main periods of the functional safety lifecycle.
Introduction to functional safety 11
Figure 1.2 Phases included in each main period of the functional safety lifecycle.
revise the work that was done. This means that looping back within the sequence of steps is an intrinsic part of managing functional safety. For this reason, the steps are arranged in a functional safety lifecycle. A recommended scheme for a safety lifecycle is set out in IEC 61511 (and a slightly different version in its parent standard, IEC 61508). In keeping with its performance-based philosophy, the standard does not compel us to use its recommended lifecycle; we are free to substitute one of our own, as long as it achieves all
12 Chapter 1 the same objectives. However, in practice, the lifecycle model set out in the standards is almost universally adopted, as it is clear, comprehensive and intuitive. Another important reason for adopting a cyclic, rather than linear, approach to safety design is that operational needs change over time. Processes may need to be altered for a number of reasons, such as: •
• • • • •
Changing process parameters as operational experience is gained (e.g. optimization of yield or manpower utilization, maintenance problems, avoidance of unnecessary tripping) Obsolescence or deterioration of equipment Adoption of new technology Changing product profile to match customer demands Changes in aspects of plant management, such as equipment utilization or manning Changes in environmental protection requirements
Any process change should prompt a return to early phases of the safety lifecycle, so that the impact on the demands and performance of the SIS can be assessed. We’ll come back to this topic in greater detail in Chapter 11.
1.4.2 Intrinsically safer design In a typical functional safety project, the hazard identification and risk analysis phases start with a substantially frozen design already embodied in P&IDs and equipment data sheets. However, this tends to squeeze out the opportunity to apply ‘intrinsically safer’ design principles: the concept that it is generally better to eliminatedor at least reducedhazards in the design, rather than managing the risks generated by those hazards. During early-stage hazard identification studies such as HAZID and HAZOP, the analysis team should be given the chance to question whether hazards could be better managed by design changes rather than relying on layers of protection. Examples of intrinsically safer design principles include: • • • • •
Replacing a hazardous material with a less hazardous one Reducing the inventory of hazardous materials Applying less hazardous operating conditions (e.g. lower temperatures and pressures) Increasing the design pressure of piping and equipment, so that upset conditions are less likely to lead to a loss of containment Reducing the opportunity for human errors, e.g. eliminating hose changeovers between items of equipment in a batch process
Introduction to functional safety 13
1.4.3 The safety requirements specification (SRS) This crucial component of functional safety is a document (or set of documents) spelling out exactly what the SIS must do. It lists the design intent of the SIS, every detail of its design specification, and a slew of information needed during the operational phase, such as maintenance requirements. The SRS is first draughted when the need for the SIS is identified; this takes place immediately after risk analysis is completed. Then, after the full design details of the SIS have been elaborated, the SRS is updated to contain all the information necessary for complete execution of the lifecycle. The purpose of having a centralised document of this type is to provide a single point of reference for all parties responsible for each phase of the lifecycle. Since the people involved are likely to be spread across many departments and organizations, it is critical to have an unambiguous definition of the SIS’s function and operation. Indeed, some of them will be performing their duties many years after the SIS is commissioned. Another important function of the SRS is to provide a benchmark, against which the SIS itself can be validated, and its performance assessed. This allows reviewers to confirm or revise assumptions made during the safety analysis and SIS design periods. Extensive coverage of the SRS is provided in Chapter 7.
1.4.4 Assuring that functional safety is achieved A key aspect of the standard is that we must demonstrate successful control of risk. There are two main aspects to this: • •
minimizing the scope for undetected human error, and ensuring that each phase of the lifecycle has been completed competently.
The standard identifies four separate activities for assuring this has been achieved, as outlined briefly in Table 1.3. This is one of the more challenging areas of functional safety, and often causes confusion. Areas of misunderstanding typically include: • • • • •
The differences between the various activities What is involved in each activity When they should be performed, and how often Whether they can be delegated or outsourced to consultants Whether the activities need to be undertaken by independent parties
We’ll cover these topics in detail in Chapter 10.
14 Chapter 1 Table 1.3: Activities for assuring that functional safety is achieved. Activity
Brief description
Verification
The inputs and outputs required for each lifecycle phase should be defined. Verification involves confirming that the required output has been generated.
Validation
During the analysis and design periods of the lifecycle, a document known as the safety requirements specification (SRS) is generated. Validation confirms that the commissioned SISdincluding hardware, software and operating and maintenance proceduresdmeet the stipulations of the SRS.
Functional Safety Assessment (FSA)
FSA is a wide-ranging assessment of how effectively the functional safety lifecycle is followed. It can be executed at up to five stages of the lifecycle, although it is compulsory at only one stagedbetween commissioning and process startup.
Audit
A review of evidence to demonstrate compliance with site-specific procedures relating to functional safety.
1.4.5 Random and systematic failures The safety lifecycle approach recognises that there are two fundamentally different ways in which the SIS can fail to perform its intended function. These are known as random and systematic failures. Because this concept underpins every aspect of the safety lifecycle, a clear understanding of failure types is crucial. Random failures are hardware failures. Every item of equipment has a finite lifetime, during which some component within the equipment may break due to natural wear-andtear processes caused by fatigue. This is true even if the equipment is installed correctly, operated within specification, and maintained properly. Random failures can never be eliminated entirely, but they can be handled mathematically. Although it is impossible to predict when any individual item of equipment will fail, we can know a great deal about typical failure behaviour, given data from a large enough population of equipment in service. For example, we can determine the item’s useful lifetime, and the probability that it will fail during a given period of time. This information is essential during risk analysis, because it allows us to calculate the extent of risk reduction that a particular design of SIS can be expected to providedhence, whether it is sufficient to meet the tolerable risk target (as we discussed in the Section 3.3).
Introduction to functional safety 15 Systematic failures are device failures ultimately caused by human errors. The lifecycle presents numerous possibilities for human errors to occur; a few examples are • • • • • •
Incorrect risk analysis (failing to identify hazards, underestimating risks) Administrative errors (working from out-of-date versions of documents, incorrect drafting of documents, miscommunication) Incorrect design of SIS Software bugs Incorrect installation of SIS Failure to maintain equipment, or errors during maintenance (such as failing to remove overrides after completing the maintenance procedure)
While some of these are under the direct control of the process plant owner or design and construction contractor, others are not. For example, a safety equipment manufacturer may make a design error, which could lie hidden for many months or years until a particular combination of circumstances brings it to light. When the error is finally revealed, severe consequences could occur without warning; for example, an emergency trip may fail to operate on demand, leading to a fire or explosion. Unlike random failures, systematic failures cannot currently be mathematically modelled. Since it is impossible to test every combination of circumstances and events that could ever arise, we can never know for sure whether errors exist in our SIS, how many, or how serious they are. Statistical treatment is of little value, since error rate data collected in one environment is unlikely to be applicable to another. The only practical way to address systematic failures is to minimise them. The two main ways of doing this are: •
•
Reduce the number of errors made in the first placedfor example, by ensuring individuals are competent, providing clear requirements and procedures, and reducing the number of opportunities for error (fewer and simpler operations); and Provide opportunities to detect errorsdfor example, by verification and review, and by recording and investigating every unexpected incident involving the SIS.
For this reason, IEC 61511 places great emphasis on software development techniques, management procedures, cross-checking of work completed (as discussed in Section 7.3) and competency of individual safety practitioners. Practical ways of addressing systematic failures are listed in Tables 1.4 and 1.5, while Table 1.6 and Fig. 1.3 suggest ways to distinguish between random and systematic failures.
16 Chapter 1 Table 1.4: Practical methods for reducing errors that can cause systematic failures. Type of method Ensure competency
Practical steps involved
Chapter in this book
Define the competency level required for each lifecycle task, including qualifications, experience and knowledge
1, 6
Assign individuals to tasks for which they are competent Encourage individuals to query any information or instructions they do not understand or agree with Information availability
Ensure resources are available, e.g. access to up-to-date versions of standards and codes of practice
6
Provide and implement a document control system, to ensure everyone works from the latest version of each document. (This is often part of an ISO 9000 quality management system.) Use the SRS and other key lifecycle documents as the sole means of transferring information between individuals Use adequate labelling (of equipment and wiring) and commenting (of software code) Ensure procedures and manuals are available and fit for purpose: clear, unambiguous, complete, and provided in the local language. Simplification
Do not use equipment with more features than actually required
9
Make unneeded features (especially software features) unavailable Use passwords and other means of access control to limit the number of individuals that can change things (such as documents, wiring and software settings) Use restrictive languages for the application program Avoid unnecessary diversity. Use the same brand or type of equipment and software for all similar applications where practical.a Familiarity
Avoid unnecessary novelty. Use well-established and familiar equipment, procedures and methods
9
Suitability
Use equipment and software only for its intended function. Pay attention to any restrictions listed in the equipment’s Safety Manual.
8
Use SIL-certified equipment and validated tools (software development tools, analytical software, test equipment). Alternatively, use equipment with a good, documented track record of prior use (see Chapter 9 for detailed coverage) a
However, this can conflict with avoidance of common cause failures. See Chapter 8 for further discussion.
Introduction to functional safety 17 Table 1.5: Practical methods for detecting errors that can cause systematic failures. Type of method
Practical steps involved
Chapter in this book
Follow a properly designated review procedure, especially for software development. Ensure an adequate degree of independence between the executing engineer and the reviewer. Record deviations and errors found, not for disciplinary purposes but to allow an assessment of whether systematic failures are properly under control.
10
Compare the expected and actual performance of the SIS, especially in terms of trip rate (real trips and spurious trips). If the actual trip rate is much higher than expected (based on random failure rate calculations), it indicates the presence of systematic failures in the design and/or implementation of the SIS.
11
Investigation
Record and investigate all incidents of unexpected SIS behaviourdespecially unwanted (spurious) trips, diagnostic alarms, test failures, issues found during maintenance, and events when the SIS is found to be in an abnormal state (e.g. unauthorised bypasses, parameters changed). Most of these will indicate the presence of some kind of systematic failure.
11
Maintenance
When maintaining the SIS, always inspect and test it before carrying out any maintenance works such as cleaning and repair. Record the ‘as-found’ condition of the SIS, since this more accurately represents the ‘real’ state of the SIS during the majority of its working lifetime. Investigate the root cause of issues such as loose connections, corrosion or other physical damage, unauthorised or unexpected alterations from design (compare back with the SRS), and any other finding that could compromise the functioning of the SIS.
11
Review
Table 1.6: Guidelines for classifying failures as random or systematic. Random failures
Unconnected to any specific causal event Occurs within the design envelope of the SIS Not attributable to a specific design or operating error
Systematic failures
May be associated with a design error May be associated with exceeding the design envelope of the SIS Attributable to a specific root cause May be avoidable by a design change May be controlled by improved training and procedures
18 Chapter 1
Figure 1.3 Decision flow diagram: classifying a failure as random or systematic.
Introduction to functional safety 19
Figure 1.3 Cont’d.
20 Chapter 1 Why is this type of failure known as systematic? The term arises from the idea that the underlying error will systematically lead to a failure when a given set of conditions arise, step by step, with essentially 100% probability. For example, if there is a division-by-zero error in a line of computer code that runs as part of a housekeeping procedure once a month, the program will crash on a specific date. Unfortunately, the term systematic is prone to confusion, because it can also refer directly to a failure in a system (e.g. management system). The terms deterministic, causative or induced would be preferable.
1.4.6 Competency Competency is a core concept of IEC 61511. It requires us to • •
determine the level of competency required to perform each safety lifecycle task, and assign individuals only to tasks for which they meet the competency requirements.
The competency requirements should be defined in terms of qualifications, general experience, directly relevant experience, background knowledge (e.g. of functional safety concepts and relevant regulations and codes of practice), and specific knowledge (of the process, equipment and procedures concerned). All this should be documented, to provide an audit trail for verifying systematic failure controls are effective. Fig. 1.4 shows the core aspects of competency required by IEC 61511. The standard requires only that each person is competent for the tasks they are performing. It is not necessary for every engineer in the project to have a full in-depth knowledge of every aspect of the SIS. The standard does not make any specific stipulation about what competence actually means in practice: that is up to each individual organization to decide. One of the reasons for placing so much emphasis on assuring competency is that a great many serious incidents in the past have been traced back to competency failures. One chilling example relates to the collapse of a coal mine spoil heap at Aberfan, Wales, in 1966. According to Trevor Kletz, “responsibility for the siting, management, and inspection of tips was given to mechanical rather than civil engineers. The mechanical engineers were unaware that tips on sloping ground above streams can slide and have often done so.” [4] The official report of the board of inquiry described the Aberfan Disaster as
Introduction to functional safety 21
Figure 1.4 IEC 61511 competency requirements. a terrifying tale of bungling ineptitude by many men charged with tasks for which they were totally unfitted, of failure to heed clear warnings, and of total lack of direction from above. Not villains but decent men, led astray by foolishness or by ignorance or by both in combination, are responsible for what happened at Aberfan [4].
The result was 144 fatalities, 116 of whom were children in a nearby junior school. Another reason for requiring evidence of competency is that many lifecycle activities are heavily outsourced. An end user will typically delegate a bundle of safety engineering activities to an EPC (engineering, procurement and construction) contractor. The EPC will, in turn, purchase SIS components from manufacturers, and hire consultants to help with safety analysis and verification activities. In each case, responsibility for ensuring competency is effectively being transferred from one entity to another, further and further from the final end user. Unless there are clear definitions of what constitutes competency or how it is controlled, the end userdwho is ultimately responsible for safetydhas no way of assuring the effectiveness of the safety products and services provided.
22 Chapter 1 Chapter 7 explains how competency management can be achieved in practice.
1.5 The structure of IEC 61511 The IEC 61511 standard itself does not make easy reading, especially if English is not your mother tongue. It is, thankfully, more digestible than its parent standard, IEC 61508, whose readability suffers from having to be comprehensive and cover every kind of industry and situation. It is probably unnecessary for the individual safety engineer to read the standard from cover to cover; however, each user should at least understand the Safety Lifecycle, the documentation and verification requirements, and the aspects of the standard applicable to one’s own responsibilities. It may be helpful, then, for us to take a quick tour of the standard here. First, the standard is in three parts. Part 1 is the core of the standard and addresses the whole lifecycle, explaining the purpose and requirements for each phase. The phases covered in the greatest detail are SIS design and software development. It also contains a brief discussion of management and documentation issues. Importantly, it includes a substantial glossary of abbreviations and definitions. Unlike the similar glossary in Part 4 of IEC 61508, it has the advantage of being arranged, for the most part, in alphabetical order. Part 2 is a series of Annexes. Annex A contains clause-by-clause guidance on many of the clauses in Part 1, although the guidance is of limited practical value for most users. Annex F is a worked example of the entire functional safety lifecycle. The remaining annexes cover special topics, mainly around application program development. Part 3 focuses mainly on risk analysis methods. It can be treated as a textbook of background knowledge required for the risk analysis period of the safety lifecycle. For a first-time reader, it would be most helpful to focus on Part 1, clauses 1 to 7 and 19; the clauses and annexes of Parts 1 and 2 most relevant to your own role; and, if you are involved in risk analysis, the relevant clauses and annexes of Part 3.
1.6 The origins of IEC 61511 One important characteristic of IEC 61511 and its parent standard, IEC 61508, is that it places a strong emphasis on developing reliable software. The need for a focus on software reliability became apparent during the 1980s, as increasingly sophisticated
Introduction to functional safety 23 control hardware became available. While it was easy to write elaborate software to provide safety functions, it proved extremely difficult to prove the software was reliable. The difficulty lay in two separate aspects: getting the specification right, and writing applications that met the specification under all possible conditions. At the same time as these software difficulties were becoming obvious, hardware was rapidly advancing in complexity, to such an extent that it became impossible to demonstrate hardware integrity by testing alone. Without being able to demonstrate safety in both hardware and software of instrumented safety systems, end users could not have confidence that major hazards were adequately controlled. This problem was further compounded by the ever-growing trend towards automated plants managed remotely by a small number of operators in a control room. Since the operational staff were increasingly dependent on self-contained trip systems to manage major upset conditions, the importance of confirming the dependability of those systems was clear. The response of the International Electrotechnical Commission (IEC), an independent body based in Geneva with member committees representing the interests of 89 countries plus 84 affiliate members, was to set up separate groups to study the issue for hardware and software. The aim was that each group would develop a standard to assist developers and end users in claiming safety capability in their respective applications. The studies were merged in the early 1990s, giving birth eventually to an umbrella standard IEC 61508 that covered both hardware and software integrity in detail. The merging of the two aspects of functional safety in a single standard was a recognition that many of the issues are the same: overall safety management, competency, the lifecycle approach, and configuration management are just a few of the aspects pertinent to both. The major differences between hardware and software lie in the methods used to achieve and demonstrate integrity; this is reflected in the two separate parts (Part 2 and Part 3, respectively) that IEC 61508 dedicates to them. IEC 61511 was then developed as a specialization of IEC 61508 for the process industry, as we described earlier in the Section 1.3.1.
24 Chapter 1
Exercises 1. Consider the fictional incident in the Section 1.1. Select two of the equipment failures described. Are they likely to be random or systematic failures? 2. Traditionally, a cable trailed across the floor in an office environment has been regarded as a “hazard.” How does this fit with the concepts of “hazard” discussed in this chapter? 3. Look up the definition of one of this chapter’s safety management concepts (such as hazard, risk, harm and risk receptor) in Wikipedia. Given that Wikipedia aims at a broad, non-specialist readership, how does its discussion compare with the one here? What does this say about society’s attitude to safety? 4. Classify the following failures as random or systematic, according to the discussion in the Section 1.4.5. For each failure, describe how it should best be addressed (to minimize the chance of it causing harm). (a) A shutdown valve sticks open when required to close. The valve is suitably designed for its operating environment and process fluid, and is within its usable life. (b) Same case as (a) except the valve missed its last proof test. (c) Same case as (a) except the valve has exceeded its useful life. (d) A pressure transmitter has worked loose on its mountings. As a result, it is vibrating severely. It fails due to a crack in the PCB (printed circuit board). (e) A software bug causes a safety function in a SIS to receive an incorrect “override” signal. As a result, the safety function does not trip when required. (f) A manual “override” key switch is faulty and overrides a safety function incorrectly. As a result, the safety function does not trip when required. (g) A forklift truck strikes an instrument air line. As a result, air supply to a shutdown valve is lost, and the valve closes spuriously.
Answers Question 1dAnswer The failure of the primary level measurement in the solvent storage tank, and the false vapour alarms triggered by the wind, could be random failures. All the other failures mentioned are systematic, as they are associated with specific errors in design, operations or maintenance. The high level trip in the solvent tank is not really a failure, as it is not faulty, but bypassed. However, the root cause (poor management of bypasses) could be addressed by the same type of solution as systematic failures, so it could usefully be categorized as a systematic failure.
Introduction to functional safety 25
Question 2dAnswer Categorizing a trailing cable as a hazard is a simple concept, but it rather unhelpfully focuses attention on the cable itselfdleading us towards imperfect risk management solutions such as “tape the cable to the floor” or “put the cable under the carpet”. If we trace the causal chain back to the “equipment with potential to cause harm”, it helps us address our attention to the copying machine attached to the cable. Relocating the machine could be a better solution, and might also draw our attention to other related issues such as noise, dust and ozone emanating from the machine. In other words, treating the machine as the hazard may yield a more effective analysis of the risk. (This can also help turn our attention towards intrinsically safer risk management solutions.) Another possible approach is to look for the ‘reservoir of energy’ with the potential to cause harm. The injury resulting from tripping over the cable derives from the potential energy of the person’s body. Again, this helps us change our focus to the real issue: the problem is not the cable, but the person having to cross the cable. Can we find a means to separate people from cables? If so, we have removed one route by which the stored energy can cause harm.
Question 4dAnswer Not all practitioners agree on the boundaries between random and systematic failures, so your answers may differ from those suggested here. (a) Random (b) Random. The fault is not related to whether the valve was tested, provided the valve is still within its useful life. (c) Systematic. The valve should not be used beyond its useful life. It is very likely to fail as a result of using it beyond its useful life. (d) As the transmitter is probably not designed for a high vibration environment, this would likely count as a systematic failure. (e) Systematic. All failures arising from software bugs are systematic failures. (f) Random. (g) Systematic. However, this does not count as a dangerous failure, because it causes a spurious trip. So there is little purpose in classifying it as a random or systematic failure. (See Chapter 2 for discussion of the term “dangerous failure.”)
References [1] E. Marszal, E. Scharpf, Safety Integrity Level Selection, ISA, Research Triangle Park, 2002.
26 Chapter 1 [2] Anon, An Introduction to Functional Safety and IEC 61508. Application Note AN9025, MTL Instruments Group, 2002. http://www.mtl-inst.com/images/uploads/datasheets/App_Notes/AN9025.pdf (retrieved on 8 January 2022). An admirably clear and readable starting point for readers who are unfamiliar with the area of functional safety. [3] P. Clarke, Setting the Standard, Control Engineering Asia, May 2011, pp. 12e18. Contact the author for a copy via, www.xsericon.world. Focuses on the benefits of compliance for SIS designers and end users. [4] E. Davies, Report of the Tribunal Appointed to Inquire into the Disaster at Aberfan on October 21st 1966, HL 316, HC 553, HMSO, 1967 (retrieved on 6 January 2022), https://www.dmm.org.uk/ukreport/553-04. htm.
Further reading The following resources provide wide-ranging coverage of the functional safety lifecycle. [1] Center for Chemical Process Safety (CCPS), Guidelines for Safe and Reliable Instrumented Protective Systems, Wiley, New York, 2011 (Chapter 1 provides an especially lucid introduction to the role of functional safety in risk management). [2] P. Gruhn, S. Lucchini, Safety Instrumented Systems: A Life-Cycle Approach, ISA, Research Triangle Park, 2019 (Detailed and extensive coverage of SIS design and implementation, for process applications. Especially detailed on hardware and software design and SIS validation). [3] K. Kirkcaldy, Exercises in Process Safety, Available from: Amazon, Self-published, Milton Keynes, 2016. [4] K. Kirkcaldy, D. Chauhan, Functional Safety in the Process Industry: A Handbook of Practical Guidance in the Application of IEC 61511 and ANSI/ISA-84, Available from: Amazon, Self-published, Milton Keynes, 2012 (See especially chapters 13e19). [5] SINTEF, Guidelines for the Application of IEC 61508 and IEC 61511 in the Petroleum Activities on the Continental Shelf (Guideline GL 070), Offshore Norge, Stavanger, 2018 (Concise coverage of many aspects of functional safety for the oil & gas industry). [6] D. Smith, K. Simpson, Safety Critical Systems HandbookdA Straightforward Guide to Functional Safety, IEC 61508 (2010 Edition) and Related Standards, Including Process IEC 61511, Machinery IEC 62061 and ISO 13849, third ed., Butterworth-Heinemann, Oxford, 2011 (Chapter 4 on software is particularly useful).
CHAPTER 2
Basic terminology: SIF, SIS and SIL Abstract A Safety Instrumented Function (SIF) is a set of equipment designed to reduce the risk of harm from a specified type of incident, by taking an automatic action when required. A number of SIFs may be implemented together in a Safety Instrumented System (SIS), which is the sum of all equipment required to implement the constituent SIFs. The amount of risk reduction required of each SIF is expressed as a Safety Integrity Level (SIL). A SIF is implemented using one or more sensors, a logic solver, and one or more final elements (devices acting directly on the process, such as a shutdown valve). To demonstrate that the SIF achieves its SIL target, a calculation known as SIL verification is executed. This requires an understanding of the failure modes of the hardware used to implement the SIF, the way failure rates are expressed in terms of l values, and the concept of hardware fault tolerance.
Keywords: Failure mode; Failure rate; Final element; Hardware fault tolerance; Logic solver; Safety instrumented function; Safety instrumented system; Safety integrity level; SIL verification.
For an engineer starting out on the road of IEC 61511, it’s vital to have a clear understanding of the terms useddsafety instrumented system, safety integrity level, safe failure fraction, systematic capability and all the rest. The learning curve is not made easier by the waves of acronyms such as SIF, SIS and SIL, PFDavg, RRF and IPL; besides, the definitions in the standard are not always that helpful, since they tend to focus on precision rather than clarity. To smooth this journey through functional safety, we will home in on its two main focal points: the Safety Instrumented Function, or SIF; and the Functional Safety Management Plan, or FSMP. Really, the whole of IEC 61511 is oriented around these. The job of the standard is to help us determine • • • • • •
whether we need any SIFs, what they should do, how reliable they must be, how they should be designed, how we can be sure they meet the safety requirements defined for them, and how to ensure the SIFs continue to fulfil their purpose, which is to provide risk reduction, for as long as needed.
Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00011-2 Copyright © 2023 Elsevier Inc. All rights reserved.
27
28 Chapter 2
2.1 The meaning of SIF, SIS and SIL 2.1.1 What is a SIF? A Safety Instrumented Function (SIF) is a set of equipment that automatically acts when necessary to prevent some kind of harm. The SIF should take its action under clearly defined circumstancesdfor example, on high level in a knock-out drum. The overall purpose of the SIF is to reduce the risk of a specific harmful event, or in other words, to achieve risk reduction. A SIF is normally implemented using three groups of hardware components known as subsystems. The subsystems are: • • •
Sensorsddevices to measure one or more variables in the process equipment, such as pressure Logic solverda device that analyses the information provided by the sensors, and decides whether the SIF needs to act Final elementsdelements, such as valves and motor controllers, that act when the logic solver commands them to do so.
The SIF includes a definition of everything needed to achieve its objective of risk reduction. This includes aspects such as: • • • • •
identification of the hardware needed the software programmed into the hardware to give the required function a definition of what the function must do (e.g. close a shutdown valve), and when (e.g. when a specified pressure transmitter sees a high pressure) how reliable the function must be; and a great deal of other information, such as testing requirements.
2.1.2 What is a SIS? For a given plant or unit, a collection of SIFs is known as a Safety Instrumented System (SIS). Another common name is an Emergency Shutdown (ESD) system. The SIS is simply the sum of all the SIFs defined for the unit. A common misunderstanding is that the term SIS means only the logic solver, or safety PLC (a specialised type of computer) at the heart of the functional safety systemdthe part that makes the decisions about whether the SIFs should act. Although the term SIS is often used in this way, it is not in line with how the term is used in IEC 61511.
Basic terminology: SIF, SIS and SIL 29
2.1.3 SIL, reliability, and integrity We mentioned above that the SIF must have a definition of how reliable it should be (the reason this is important is explained in Chapter 4). Its reliability is defined numerically using two values: • •
the SIF’s probability of failure (PFDavg or PFH; their exact meaning is explained in Chapter 4), and the SIF’s mean time to fail spurious (MTTFS)da measure of the likelihood that it will take action at the wrong time, i.e. when there was no need to act.
The full range of possible PFDavg and PFH values is divided into bands, each of which is associated with a safety integrity level (SIL). SILs are numbered from 1 to 4, with 1 the lowest level of integrity (i.e. highest probability of failure). A SIL is assigned individually to each SIF in the system. A typical SIS could contain SIFs at SIL 1, SIL 2 and SIL 3 (SIL 4 is rarely applied in the process industry, as it is hard to achieve in practice). The definition of SIL in terms of PFDavg and PFH bands is shown in Chapter 4, Tables 4.2 and 4.3. During the risk analysis stage, a target SIL will be selected for each SIF, according to how much risk reduction the SIF is required to provide (the task of selecting a SIL is explained in Chapters 4 to 6). Then, during the design stage, the SIF will be designed to achieve the target SIL. The SIL that is achieved by the SIF, as actually implemented in the plant, must equal or exceed the target SIL. What is the difference between reliability and integrity? In functional safety, ‘reliability’ is a mathematical term expressing the probability that the SIF has not suffered a random hardware failure. However, control of systematic failures is also necessary. The term ‘integrity’ expresses the degree of confidence that both random and systematic failures are adequately controlled. Thus, the term ‘SIL’ (safety integrity level) implies attention to all aspects of the functional safety lifecycle.
2.1.4 What is an interlock (or trip)? When we look at the representation of SIS functions on a P&ID, Interlock Logic Diagram or Cause & Effect Diagram, we often see “S” or “IS” numbers such as IS-101 (depicted in diamonds on a P&ID), indicating “safety shutdown” functions. These are typically called interlocks or trips.
30 Chapter 2 These interlocks do not simply correspond one-to-one with SIFs. A SIF is defined as providing protection against a specified hazardous event. However, the S/IS number interlocks on a P&ID may be protecting against multiple hazardous events, and therefore could correspond to combinations of SIFs, or parts of SIFs. If you see an interlock that does any of the following, it’s likely to be a composite of multiple SIFs: • • •
triggers another interlock when it trips incorporates many separate functions in one S/IS number acts on multiple final elements when it trips
At an early stage of the safety lifecycle, it’s important to analyse the interlock and divide it into small, manageable SIFs. Each SIF is defined in terms of the hazardous condition it prevents, and the hardware elements (such as sensors and shutdown valves) needed to achieve its goal. The advantages of dividing up the interlock into SIFs include: • • •
•
A SIF that targets a smaller number of hazards is likely to have a lower SIL requirement. The fewer components there are in a SIF, the easier it is to achieve the required SIL, and the cheaper the components can be. If one SIF in the interlock has a higher SIL than another, we avoid using high SIL-rated components for the lower SIL-rated SIF (instead of applying the higher SIL requirement to the whole interlock). It is easier to select the target SIL, and calculate the achieved SIL, for simpler SIFs.
Even though we ignore some of the sensors and actions in the interlock when defining the SIF, please be aware that we do not physically eliminate them from the interlock, or change the interlock’s program logic in any way. In summary, it’s essential to make sure that SIFs’ architectures are correctly defined. Don’t assume an “interlock” will translate directly into a single SIF. We’ll address the topic of developing SIFs from interlocks in detail in Chapter 4.
2.2 Anatomy of a SIF What makes up a SIF? Certain parts of the design must be present for every SIF, as we will explain in this section. The hardware used to implement the SIF is considered to form three subsystems: sensors, logic solver and final elements. We consider these first.
Basic terminology: SIF, SIS and SIL 31
2.2.1 The sensor subsystem The SIF receives input information from some kind of sensor. This detects the value of a parameter in a specific location. Examples of parameters measured by sensors include: • • • •
conditions inside process equipment such as pressure, temperature, flow, level, density, pH, and conductivity signals indicating malfunction of mechanical equipment, such as vibration, temperature, and axial displacement flame detection inside fired heaters proximity of an object, such as part of a human body (these sensors may be used to prevent injury or damage from moving equipment)
Most sensors will generate an analog value (called a process variable or PV) within a given range (called the span or calibration range). Process variable measurements can be used in several ways to trigger the SIF: • • •
compare with a defined value (known as the setpoint, trip point or threshold), and trigger the SIF if the PV passes through the threshold trigger the SIF if the PV’s rate of change is greater than a defined value (this could be either positive or negative) trigger the SIF if the PV is invalid. Invalid PVdthat is, a PV outside the span of the instrumentdmay indicate instrument malfunction. Also, many “smart” instruments can be programmed to transmit an invalid PV to indicate that a fault condition has been detected.
In many cases, the signals from multiple sensors are used to determine when the SIF should act. Examples include: •
•
•
Voting schemes. For instance, three identical sensors may be provided, and the SIF programmed to trip when at least two of them go past their setpoint. Voting arrangements are intended either to improve the reliability of the SIF, or reduce its tendency to trip spuriously (when not required), or both [1]. Differential measurements. The signals from two sensors are compared, and the difference is calculated (PVA PVB, where PVn means the measured value of the process variable from sensor n). The SIF is programmed to trip when the difference passes through a specified value. Note that level and flow are often measured by differential pressure, so this is an example of a differential measurement. Deviation measurements. The signals from two sensors are compared, and the SIF is activated when the absolute difference between them exceeds a preset value (for example, 2% of the span of the sensors). Although not common, deviation
32 Chapter 2 measurements are occasionally used in situations where, for example, the flows of two different streams of feedstock need to remain in balance for safe and stable operation. Apart from PV measuring sensors, other signals may also be used to initiate a SIF. These include: • • •
•
•
discrete (on/off) signals indicating equipment status, e.g. a motor trip signal discrete signals from limit switches, which are used to indicate the actual position of valves, inspection doors etc. flame, heat, smoke or toxic gas leak. Collectively, these are known as “fire and gas sensors” and are often connected to a separate system known as the Fire & Gas System (FGS). However, occasionally a SIF will be defined with one of these sensors. a command from another SIF (one SIF may activate another). For example, if a SIF shuts down a process unit, another unit further upstream may need to be tripped or put into a recycle state. handswitches (e.g. an emergency stop button). See Chapter 6 for discussion of this special case.
Digital signals can also be compared to generate discrepancy or disagree events. For example, a valve’s actual position can be measured by limit switches, and compared with the expected value (a digital signal within the safety PLC). If the values disagree, perhaps after a defined time delay, it would indicate a misoperation or a hardware malfunction. This disagreement be used to activate a SIF. Other components of the sensor subsystem Additional hardware may be required to transfer the signal from the initiator to the logic solver. Examples of such hardware are shown in Table 2.1. There is also likely to be firmware in the sensor, if it includes a “smart transmitter”. The firmware may be used to define the span (minimum and maximum value of the sensor’s working range, e.g. 20 to 100 C), and also the required behaviour in the event of an out of range value or detection of a fault. This is covered in more detail in Chapter 7. The MooN concept for initiators SIFs are often equipped with multiple initiators. These may be designed so that any one of the initiators is capable of detecting the hazardous situation. This architecture is known as 1 out of N (or 1ooN for short), where N refers to the total number of initiators provided. Another possible architecture is N out of N (NooN), where the SIF is required to act only if all N of the detectors see the hazardous situation. SIFs can also be designed such that the SIF is activated if at least M out of the N detectors sense the situation: this is an M out of N (MooN) architecture, of which a common example is 2oo3.
Basic terminology: SIF, SIS and SIL 33 Table 2.1: Additional hardware that may be included in the sensor subsystem. Hardware item
Function
Comments
Intrinsically safe barrier
Ensure the wiring voltage in areas where explosion hazards may exist is kept low, so that any spark (e.g. caused by a wiring fault) has insufficient energy to cause an explosion
Wiring
Transmit signals from sensors to PLC. Signals may be either binary (on/off) or analog (4e20 mA loop).
SIS sensors are normally individually hard-wired to the inputs of the PLC. The routing of SIS wiring must be carefully considered. Any factor that might cause loss of SIS integrity should be taken into account, such as vulnerability to physical or chemical damage, vibration, and electromagnetic interference. In exceptional cases, it might be necessary to have diverse routing for wiring from redundant sensors, to reduce the risk of commoncause failure.
Bypass/test switch
Provide means to bypass the sensor for testing or startup overrides (e.g. on a low pressure trip on a pump)
Great care must be taken to ensure that it is impossible to leave the sensor inadvertently bypassed; this can be done, for example, using bypass alarms on timers, or through a management system such as Permit-to-Work.
It is extremely important to understand exactly what architecture should be defined for the SIF. Misunderstandings are rife in this area. For example, a reactor’s catalyst bed may be provided with 20 temperature sensors; if any two of them detect a high temperature, they will trip the reactor. Is this a 2oo20 architecture? It depends on whether those two detectors can sense trouble in any part of the beddor, to put it another way, whether the 20 detectors are functionally equivalent (all can be considered to detect exactly the same event). In practice, a hot spot on one side of the bed may only be detectable by, say, four detectors in the immediate vicinity. For that particular hot spot, the other 16 detectors are irrelevant. If this is the case, the initiator architecture is not 2oo20 but 2oo4.
34 Chapter 2 Why is it so important to specify the architecture correctly? The defined architecture has a strong effect on the calculated reliability performance of the SIF. Specifying that the SIF is 2oo20 would imply that 18 of the 20 detectors (90%) could fail, and yet the SIF would still work. On the other hand, a 2oo4 SIF can tolerate no more than two of its detectors (50%) in a failure state. Thus, the 2oo4 SIF places much greater demands on the availability of the sensors. For detailed coverage of this topic, refer to Ref. [2].
2.2.2 The logic solver subsystem The logic solver is the part of the SIF that decides whether to activate the SIF. Nowadays, it typically consists of a safety-rated PLC (a specialized computer), although other forms of logic solver are available (see Chapter 8). The logic solver’s primary role is to receive data from the initiators, process the data and send appropriate output to the final elements. For this purpose, it normally needs to be programmed to know how to interpret the input data. This programming is specific to each application, and so it is known as the application program. The logic solver may carry out a number of additional functions, as detailed in Table 2.2. Many of these are not essential for achieving the objectives of the SIFs, but help in smooth operation of the plant and detection of faults. Physically, the safety PLC normally comprises a powerful computer mounted in a cabinet, an Uninterruptible Power Supply (UPS: a battery-backed system for ensuring power remains available even if the mains supply fails), a set of Input/Output (I/O) cards for connecting to sensors and final elements, and means of communication to the basic process control system (BPCS). The PLC may be connected to a network, usually for remote diagnostic purposes; however, since this introduces the potential for security compromise, some operating companies require safety PLCs to be standalone. The computer is normally dedicated to SIS functions alone and runs only the minimum set of software required for reliable SIS operation, such as the software required to implement the SIFs, event logging (historian) and backup, diagnostics and event analytics, and possibly a purpose-built user interface. The underlying OS is a dedicated, high integrity system.
2.2.3 The final element subsystem Once the safety PLC has determined that the SIF should act, it sends the appropriate commands to the final elements. When these are received, the final elements are required to take positive action to prevent (or reduce) harm. The most common types of final element are discussed in the following sections.
Basic terminology: SIF, SIS and SIL 35 Table 2.2: Additional functions typically performed by a safety PLC. Function
Purpose
HMI
Human-machine interface: provide information to operators and maintenance personnel about the status of the SIFs and the equipment; provide means to initiate operations such as manual trips, overrides and testing; provide means to inspect and analyse event logs; provide means to program SIS functions
Event logging
Automatic recording of all events such as activation of sensors, diagnostic messages received from any part of the SISs, human-initiated events such as manual trips and overrides, alarm events
Alarms
Provide alarms to the operator. These fall into two main categories: (1) Sensor events such as pre-alarms and trip alarms, and (2) diagnostic events such as hardware or software fault warnings, loss of signal integrity, deviations between multiple sensors, communications failure, UPS fault, and overrides left in place beyond a preset time limit.
Communications with other systems
The SIS PLC often has a data link to the BPCS, so that data from SIS sensors can be shown in the BPCS user interface. This also allows SIS alarms to be repeated in the BPCS, and also enables deviation measurements to be made between BPCS and SIS sensors, as a way of detecting sensor failure. Data transfer in the opposite directiondfrom BPCS to SISdis minimised, as it creates an opportunity for failure and security compromise. This is one of several reasons why BPCS sensors are normally not used in SIFs.
Diagnostics
Detect failures in hardware attached to I/O cards (sensors, final elements). Detect failures in its own hardware and software, and perform functions to prevent loss of integrity (e.g. switch to backup execution channel, raise alarm, and/or activate SIFs to move the process to a safe state).
Actuated valves Actuated valves are provided, in most cases, to prevent harm by: • • •
releasing pressure to a safe location (emergency depressurization valves, EDV’s, or blowdown valves, BDV’s) cutting off flow of liquids, gases or solid particles to prevent overfilling or contamination; or cutting off flow of liquids or gases to prevent overpressure downstream, especially in gas breakthrough situations (emergency shutdown valves, sometimes known as SDV’s or ESDV’s).
The valve is operated by either a motor, or by an actuator driven by pneumatic pressure (usually instrument air) or hydraulic pressure. The pneumatic or hydraulic supply is controlled by a solenoid-operated valve (SOV). The SOV, in turn, is controlled by the presence or absence of low voltage power (typically 24 V DC) from the safety PLC.
36 Chapter 2 Actuators come in two types. The most common, in SIL-rated applications, is singleacting or spring return. This means that the actuator needs pneumatic or hydraulic pressure to keep it in one position; when the pressure is released, it will move to the other position by spring force. The other type of actuator is not equipped with a spring and must be positively driven to move in either direction; this type is known as double-acting. Double-acting actuators are generally smaller and lighter (as they do not need to contain a powerful spring), so they may be used in applications where space is at a premium. The process valve is set up to move to a defined positiondopen or closeddwhen the instrument air (or hydraulic fluid) pressure is lost. The valve symbol on the P&ID is normally marked “FO” (fail open) or “FC” (fail closed) accordingly. Occasionally, the valve is required to remain in its present position when the air fails; this is known as fail last (“FL”), fail in position or stay put. The decision about whether the valve assembly should be FO, FC or FL depends on: • •
whether the SIF should close or open the valve when it trips. Normally, the selected fail position will be the same as the trip position; and how much harm is caused by a spurious trip. In rare cases, a spurious trip can be so undesirable that the design team decides to define the failure position as the opposite of the trip position, or as FL. Confusion alertd“fail closed” The term “fail closed” can cause confusion during risk analysis studies (e.g. HAZOP). For example, the HAZOP team should consider what happens if the SIF acts spuriously, or the valve’s actuator loses its air supply. This might be described as “valve fails closed”. Also, it may be necessary to consider what happens if the SIF does not act when required. This situation might be described as “fail to close”. Both of these usages have different meanings from the “fail closed” phrase described in this section. For this reason, it’s best to avoid using the term “fail to close” and “fails closed” during risk analysis; use an alternative term such as “valve closes inadvertently” and “valve does not close when required”.
Motor-operated valves (MOVs) are process valves directly moved by an electric motor. They are nearly always fail last type. MOVs are less commonly applied in SIFs than SOVs, because of the need for a power supply, and the fact that ‘fail last’ is often not the preferred configuration for a final element (as it means the valve cannot automatically move to the tripped position on loss of power). MOVs are generally selected for SIS applications only if the torque requirement is too large for an air- or hydraulicallyoperated actuatorde.g. for large diameter process valves.
Basic terminology: SIF, SIS and SIL 37 Motor control circuits After actuated valves, the next most common final element is a motor control circuit. The SIF can act to start or stop motors on pumps, compressors, fans and material handling equipment, to name a few examples. In the case of centrifugal pumps, the purpose of the SIF may be to prevent damage to the pump due to low flow or cavitation. Compressors may be stopped to prevent damage due to surge or liquid ingress. A special case is that a SIF may act to start a firewater pump. Since the action of the SIF is to attempt to put out the fire, some safety practitioners consider this not to be a true SIF, as it is not guaranteed to achieve the safe state (fire out with no damage or injury). Other final elements Alarms. SIFs may be defined to activate an alarm, to which an operator is expected to respond. The most common example is fire and gas detection alarms, where the decision to apply firefighting measures rests with the operator. As in the case of the firewater pump, a SIF that acts by sounding an alarm only is not actually a complete SIF, because it does not achieve the safe state by itself. Ideally, the SIF definition should also include the operator’s action and the hardware used to achieve that action (e.g. the firewater deluge system); but this is often not included in practice. Software actions. Cause and effect diagrams often show that the SIF should trigger other actions through software signals. For example, it may act on other control valves through the BPCS, or trigger a further SIF. In such cases, the SIF should be redefined to include the actual valves (or other final elements) acted upon, if they are critical. A soft signal is not a meaningful final element for a SIF, because it cannot achieve a safe state by itself. Other elements of the final element subsystem As with the initiator subsystem, the final element subsystem is likely to include additional hardware. Examples of such hardware are shown in Table 2.3. The MooN concept for final elements A single SIF may be equipped with more than one final element. There are several possible reasons for this: •
•
The SIF may need to take several actions to achieve the process safe state. In this case, all the final elements need to work correctly on demand. This architecture is known as N out of N (NooN). An example is multiple gas wellheads connected to a manifold: when high pressure is detected in the manifold, all online wells must be shut down successfully to ensure no overpressure occurs. Multiple final elements are provided to increase reliability. For example, two or more shutdown valves can be provided in series. The architecture is described as 1 out of N
38 Chapter 2 Table 2.3: Additional hardware that may be included in the final element subsystem.
•
Hardware item
Function
Intrinsically safe barrier
Ensure the wiring voltage in areas where explosion hazards may exist is kept low, so that any spark (e.g. caused by a wiring fault) has insufficient energy to cause an explosion
Wiring
Transmit signals from PLC to the active component (e.g. valve motor, solenoid or motor control circuit). Signals are likely to be binary (on/off) and could be low voltage (e.g. signals to MCC) or medium voltage (powering a solenoid or valve motor directly).
Interposing relay
Provide electrical isolation between the PLC and the final elements, e.g. for sending trip signals to motor control circuits.
Trip amplifier
These are small modules that take the place of a safety PLC. They receive an analog input from a sensor, compare the signal with a defined threshold, and send a discrete trip signal to a final element when necessary.
Other utilities (instrument air, hydraulic fluid)
Provide motive force for valve actuators.
Power supply
A separate power supply may be required to drive devices such as motorized valve actuators or emergency exhaust fans.
(1ooN), because correct functioning of any one of the valves is sufficient to control the risk. An example is a double block valve on the fuel supply to a fired heater. Multiple final elements are provided to reduce the rate of interruption to the process. For example, two or more shutdown valves can be provided in parallel. Spurious closure of any one valve does not stop the flow, so it does not disturb the process. However, all the valves must work to control the risk, so the architecture is NooN. This scenario (providing multiple final elements to reduce spurious trip rate) is quite rare.
Combinations of these architectures are also possible. For example, suppose we have a fired heater with 3 separate lines supplying fuel. Each line has a set of 1oo2 shutdown valves (e.g. a ‘double block and bleed’ arrangement), and all three sets must work to achieve the safe shutdown. This could be described as 3oo(3 1oo2). It would not be correct to describe the architecture as 3oo6 (six valves in total, any three must function correctly), as the six valves are not equivalent. Open-to-trip cases As mentioned in the text, multiple shutdown valves in series form a 1ooN architecture. If the valves are in parallel, the architecture is NooN.
Basic terminology: SIF, SIS and SIL 39 The opposite is true if the valves are open-to-trip (e.g. depressurization valves). In this situation, valves in series must all function correctly on demand, so the architecture is NooN. If the valves are in parallel, it is necessary to check how many valves must open to allow enough flow to prevent the harm. So the architecture could be 1ooN, 2ooN or any other M in MooN.
Fig. 2.1 suggests points to consider when selecting final elements.
2.2.4 Permissives and inhibit functions Some SIFs are intended to prevent, or allow, a sequence to proceed. For example, a SIF description in a burner management system may specify: “Do not allow the burner startup sequence to begin if the fuel gas pressure is low.” As it stands, this definition indicates that the final element is a software function: don’t allow a programmed sequence to proceed. However, software functions are not meaningful as final elements: we should trace the system all the way through to the critical hardware involved. In this case, it may be that the important action is to inhibit opening of the fuel gas valves to the burners: in other words, the SIF could be specified as “on low fuel gas pressure during startup, close the fuel gas valves to the burners”. The fact that they should already be closed at this point (since the burner is not yet lit) might not be important. The SIF is best defined as taking a positive action on a specified item of hardware. If we don’t write it this way, the SIS designers will not know which items of hardware need to be SIL-rated and included in the SIS. Enable functions (permissives) should be rephrased as inhibit functions. For example, a SIF may be written as “when the pressure and temperature in the stack are low, allow the operator to unlock the inspection door”. Instead, it would be better to write “on not-low pressure or not-low temperature in the stack, lock the stack inspection door”. This makes it clear that the final element is the lock itself. The point is that if the locking action fails, the risk to the operator is not properly controlled. The SIF should always be defined in such a way that its failure leads to loss of risk control.
2.2.5 Other important aspects of a SIF So far, we have discussed all the hardware that comprises a SIF. To define a SIF fully, we also need to determine many other aspects of its behaviour and performance. These are covered fully in our discussion of the Safety Requirements Specification (SRS), which you can find in Chapter 7. Briefly, here are a few key aspects of the SIF that need to be defined as early as possible in the design process:
40 Chapter 2
Figure 2.1 Decision flow diagram: Selecting final elements. This diagram provides suggestions relating to the selection of final elements for a range of situations. Hints on potentially achievable SIL targets are also shown; see Chapter 4 for detailed discussion.
Basic terminology: SIF, SIS and SIL 41 • • • • • • • • •
What logic the PLC is required to perform. The response time: the SIF must complete its action within this time to fully bring the process to the defined safe state. Any conditions or limitations on when the SIF is required to function: for example, “during process startup”. How the SIF is to be tested and maintained. What additional protective measures are required when the SIF is offline for maintenance or repair. (for example, the process may be required to run at reduced capacity.) What diagnostics are required in each of the three subsystems, and how faults are to be handled when detected. What SIS-related information the operator needs on the HMI. What overrides are required and permitted, and who is allowed to override SIFs. How overrides are to be controlled to prevent misuse. What arrangements are provided to reset the SIF after it is activated.
2.3 Development of a SIF A SIF goes through several distinct stages during the design and development phase of a plant. These work in parallel with the normal engineering design process of the plant as a whole. As we mentioned in Chapter 1, the IEC 61511 standard requires the use of a “lifecycle” approach, which we cover in Chapter 7. At this point in our study of the SIF, we need to highlight a few key milestones in the lifecycle. These represent key activities that are often challenging and may require assistance from experienced practitioners.
2.3.1 SIL assessment During risk analysis, the need for a SIF is identified. Next, the process design team will outline the initial design of the SIF. At this early stage, the design may be limited to the following parameters: • • •
the design intent of the SIF (what condition it should detect, what harm it is intended to prevent, and what action it should take) the type of hardware to be used for the sensors and final elements, and their location in the process equipment what other protective measures are provided against the same harm (e.g. alarms, relief valves)
This information is used as input for the SIL assessment process (also known as SIL analysis, SIL determination, SIL classification or SIL target selection). The objective of SIL assessment is to determine the Safety Integrity Level (SIL), and optionally the Risk Reduction Factor (RRF), required for each individual SIF. This is also a good time to
42 Chapter 2 develop a formal definition of each SIF. SIL assessment is a team exercise led by an experienced (and usually independent) facilitator. The SIL assessment activity is critical and complex. A variety of procedures are available, which are covered in Chapters 4 to 6 and in other texts [1]. The output from SIL assessment is required relatively early in the lifecycle, because the design team must then proceed to design the SIF so that it meets the SIL and RRF requirements. The design task basically comes down to • • • • • •
hardware selection (which brands and models of equipment are needed) architecture design (how much redundancy is required) software design (what functions need to be programmed into the safety PLC) development of testing and maintenance protocols definition of auxiliary functions such as resets, overrides and screen displays and certain detailed aspects such as physical layout (to reduce the risk of common cause failures).
2.3.2 SIL verification Once the design of the SIF is complete, it is important to confirm that the design meets the key criteria. Most critically, it must achieve or exceed the SIL specified in the SIL assessment process, and also the target RRF, if any. This step is usually known as SIL verificationdthe terminology is unfortunate, because it is easily confused with another lifecycle task also known as verification, but entirely different in scope (SIL verification is covered in detail in Chapter 9. We discover the other type of verification in Chapter 10). SIL verification, the task focused on confirming SIL and RRF targets are met, requires a detailed calculation of the probability or frequency of failure of the SIF. The calculation is sometimes executed by mathematically combining the failure probabilities of each device in the SIF in a Fault Tree. Another method is to develop a Markov model of all the possible states that the SIF can be in, and the rate of transition between those states. Using matrix mathematics, the Markov model can be used to calculate the probability that the SIF is in any particular state and, hence, the overall probability that the SIF will succeed or fail at the moment it is required to act [2]. Incidentally, these methods also provide a calculation of the spurious trip rate of the SIF. This can be compared with the maximum acceptable spurious trip rate selected by the designers, which is usually driven by economic considerations (spurious trips cost money). The most practical way to perform a SIL verification calculation is to use dedicated software. Since the outcome of the calculation has a real impact on the safe operation of
Basic terminology: SIF, SIS and SIL 43 the plant, the software must be dependable. In IEC 61511 terms, it thus qualifies as a “software tool”, and therefore must be assessed to confirm its validity. This rules out the use of tools such as spreadsheets which, however widely used and trusted they may be, are difficult to subject to the rigours of a validation process. Dedicated, dependable SIL verification software is available on the market. Besides this important milestone, there are three more critical assurance tasks in the life of a SIF: SIS validation, functional safety assessment and functional safety audit. These are major topics and further discussion of them is reserved for Chapter 10.
2.4 Failure 2.4.1 Failure modes The SIL verification calculation we just discussed can only be performed if we can predict the random hardware failure rate of the devices used to build the SIF. Hardware failure rates of SIF components can be determined by calculation, starting from the failure rates of all the individual components in those devices; and the component failure rates, in turn, are known from statistical analysis of industry failure data [3]. The technique used to obtain the component failure rate is known as failure modes and effects analysis (FMEA), detailed coverage of which can be found elsewhere [4]. First, what do we mean by failure modes? Every distinct way in which a device can fail is a failure mode. Let’s take a shutdown gate valve as an example. Here are some of the failure modes it may experience: • • • •
binding of the valve stem (so that the gate cannot be moved up and down) seat failure (so that the valve cannot fully stop the flow) seal, bush or packing failure (so that the valve leaks process fluid to the outside) casing fracture (the valve body breaks, causing loss of control and containment)
The main types of failure mode that we need to distinguish are: •
• • •
Dangerous failures: device failures that prevent the SIF from achieving its design intentdthat is, the SIF does not put the process into the defined safe state on demand (see Chapter 7 for a discussion of the ‘safe state’) Safe failures: device failures that result in, or contribute to, a spurious trip Annunciation failures: failures of diagnostic detection systems, so that they can no longer detect faults No effect failures: failures that don’t fall into any other category.
44 Chapter 2 Redundancy In the definition of dangerous failures, we said they prevent the SIF from achieving its design intent. That is correct for subsystems with no redundancy, i.e. NooN architecture. For architectures with redundancy, i.e. MooN where M < N, the dangerous failure of one device does not directly prevent the SIF from acting, but puts it into a “degraded state”. For safe failures, the situation is somewhat similar. A 1ooN subsystem will trip on safe failure of one device, whereas a MooN with M > 1 will not trip on a single safe failure.
“No effect” on what? When we talk about “no effect failures”, the “no effect” refers to the impact on the behaviour of the SIF only. It doesn’t mean that there is literally no effect of any kind from the failure. For example, suppose a shutdown valve has a leak from the stem seal. The valve may still stroke on demand, and successfully stop the flow to downstream equipment: in other words, the SIF still functions correctly and, in functional safety terms, it “achieves the process safe state”. The leak creates a new hazardous conditionda fire, corrosion or toxic exposure from the leaking materialdbut this is irrelevant to the SIF, because the SIF is not designed to prevent this. For purposes of determining SIF failure modes, we only need to think in terms of the design intent of the SIF.
At this point, we need to consider the ways in which dangerous and safe failures can be revealed: •
•
•
Proof testing: we force the SIF to activate, thus testing the performance of as many of its components as possible. In this book, we describe failures that can be revealed by proof testing as discoverable. Partial valve stroke testing (PST or PVST): we move the final element valve through a small part (usually no more than 20%) of its stroke, to confirm that it is not stuck in the ‘normal’ (untripped) position. The reason for performing only a partial stroke is to minimize upset to the process (or losses to the relief system, in the case of an emergency depressurization valve). Diagnostics: the component itself (e.g. a smart transmitter), or another component such as the safety PLC or a valve position sensor, may be able to automatically detect a malfunction, and signal this to the safety PLC. Usually the PLC will be programmed to raise an alarm, and possibly also attempt to trip the SIFdperhaps after a time delay to allow for the possibility of repair. Failures that can be revealed by diagnostics are detectable.
Basic terminology: SIF, SIS and SIL 45 Dangerous failures may be detectable or undetectable, depending on the failure mode, so we can speak ofdangerous detected (DU) failures and dangerous undetected (DU) failures. The same applies to safe failures, so we have safe detected (SD) failures and safe undetected (SU) failures. When safe detected failures occur, the diagnostics can raise an alarm instead of allowing the SIF to trip, so they reduce the spurious trip rate. For no effect (NE) failures, it makes no difference to the SIF whether they are detected or not. To summarise, we adopt the following terminology in this book: • • • • • • • • •
a fault that prevents the SIF from achieving the process safe state when required is a dangerous failure a fault that leads to a spurious trip is a safe failure a fault that prevents the functioning of a diagnostic is an annunciation failure a fault that is not a dangerous, safe or annunciation failure is called a no effect failure if a fault can be found by diagnostics, it is detectable if a fault can be found by proof testing, it is discoverable if a fault cannot be found by proof testing, it is undiscoverable when a fault is found by any means (diagnostics, proof testing, spurious trip or failure on demand), it is revealed when a fault exists but has not yet been found, it is unrevealed.
2.4.2 Failure rates Device failure rates are usually reported in terms of l (the Greek letter lamda or lambda), and have the dimensionality of time1. The normal unit of measure is known as FIT, an abbreviation of failures in time, which is defined as failures per 109 h in service. Databases and manufacturers’ manuals ideally list l values subdivided by category (lDD, lDU and so on). l values are mostly quoted as averages for all operating environments, but sometimes separate values are provided for specific conditions such as: • •
•
sensors (transmitters) where some faults result in a false high output (giving ‘lD high’), while others give a false low output (giving ‘lD low’) valve assemblies where some faults result in inability to close properly on demand (giving ‘close to trip’ lD), while others result in inability to open on demand (giving ‘open to trip’ lD) valves requiring tight shutoff (TSO) to achieve the process safe state
46 Chapter 2 •
valves in service with abrasive fluids (this may be termed severe service). Separate lD values may also be provided for fail-high and fail-low applications, or fail-open and fail-closed
In each case, the analyst needs to ensure they understand which failures are dangerous and which are safe in their specific application. For transmitters, failure rates may be expressed as fail-high and fail-low lD values. These may be treated as either dangerous or safe failures, depending on whether the transmitter is configured to trip on high or low measured value, as detailed in Table 2.4. When l values for SIF elements have been obtained, we can use them to calculate the predicted reliability of the SIF, and the expected spurious trip ratedoften expressed as mean time to fail spurious (MTTFS). Details of the calculations are given in Chapter 9.
2.4.3 Hardware fault tolerance Although the IEC 61508 family of standards is generally not prescriptivedthat is, it does not lay down specific requirements for the design of the SISdthere is one aspect in which it does make specific hardware demands. This is hardware fault tolerance (HFT), which is the ability of a subsystem to achieve the design intent of the SIF, even in the presence of hardware faults. HFT is achieved by providing redundant equipmentdfor example, two sensors in a 1oo2 architecture. Each subsystemdinitiators, logic solver and final elementsdis required to have a particular level of HFT. For example, if HFT of 1 is specified, the subsystem must not fail dangerously in the presence of any one element having a hardware fault (random failure). The example just mentioned, two sensors in a 1oo2 architecture, has a HFT of 1. By ‘not Table 2.4: Interpretation of l values for transmitters when ‘lD high/low’ values are provided. Name of lD value
Meaning
In a high trip application, this contributes to
In a low trip application, this contributes to
‘lD high’
Failures resulting in false high output
Safe failure rate
Dangerous failure rate
‘lD low’
Failures resulting in false low output
Dangerous failure rate
Safe failure rate
‘lD freeze’ or just ‘lD’
Failures resulting in false output that is not guaranteed to be high or low (e.g. frozen or fluctuating output)
Dangerous failure rate
Dangerous failure rate
‘lS’
Failures that always cause a trip signal
Safe failure rate
Safe failure rate
Basic terminology: SIF, SIS and SIL 47 fail dangerously’ we mean that the SIF can still detect the dangerous condition, and that the final elements can act to fully achieve the safe state defined in the SIF’s specification. The random failure might lead to a spurious trip of the SIF, but that is a ‘safe failure’ by definition. Detailed coverage of the requirements is provided in Chapter 9. The required HFT depends on the target SIL of the SIF, and also on which standard is being applied: IEC 61508 and IEC 61511. The requirement even varies between the first and second editions of the standards. A more detailed discussion can be found in Chapter 9. The purpose of providing an HFT requirement is to prevent over-reliance on low random hardware failure probability to achieve a high SIL. If there were no HFT requirement, a user could claim that a simple 1oo1 SIF can theoretically achieve a very high SIL by intensive testing (or by applying unreasonably low lDU or unfeasibly high proof test coverage (see Section 7.4) in the failure probability calculation; unfortunately, this practice is commonly seen in the industry). The HFT requirement also provides greater safety assurance in high demand and continuous mode functions, where faults may lead to accidents before they can be discovered by proof testing.
48 Chapter 2
Exercises 1. One of the available technologies for large scale production of polyethylene involves circulating ethylene in a loop reactor at 300 bar pressure. The reactor is equipped with a set of emergency depressurization valves (EDV’s). These are used to vent the ethylene gas to atmosphere whenever a major upset condition occurs that could otherwise lead to a runaway reaction. Such a runaway is likely to result in polymer plugging the reactor, necessitating prolonged downtime for cleaning. Discuss whether the EDV’s should be fail open or fail closed type. (Note: The EDV’s are not primarily intended as overpressure protection; other safeguards are provided against overpressure.) 2. A compressor handling natural gas is surrounded with five flammable gas detectors. They are part of a SIF whose function is to detect gas leaks from the compressor piping, and shut off the gas at an upstream shutdown valve. What MooN architecture should be specified for the gas detectors? 3. Steam turbines are typically provided with a “trip and throttle valve” that starts and stops the steam supply, and a “governor valve” that modulates the steam flow to maintain the required speed. A permissive function on a steam turbine is provided to “allow the turbine to be started if the lubrication oil pressure is sufficient”. How should the description of this SIF be phrased? What is the final element? 4. In the Section 2.3.2, it was stated that “spurious trips cost money”. Why is this? Name several effects of a spurious trip that have a negative financial effect on the operation of a plant. 5. You are using a simple shutdown valve in a SIF. You obtain a manufacturer’s data sheet quoting failure rate data for the valve only, excluding any other components such as the actuator and solenoid. The data sheet quotes values for lD and lNE only, but not for lS. Why is this? 6. Referring to the same situation as question 5: you are using dedicated software to model the SIF that includes this shutdown valve. The software requires you to enter values for lDU and lDD, but the manufacturer’s data shows only a composite lD value. What values should you assign? 7. A SIF is provided with an emergency depressurization valve (EDV). The valve is required to open on trip, and is fail-closed type. It is operated by a pneumatic spring return actuator. Partial valve stroke testing (PVST) is performed automatically once per month. Refer to the list of failure modes shown below. Classify the failure modes into DU, DD, SU, SD and NE (no effect).
Basic terminology: SIF, SIS and SIL 49 (a) (b) (c) (d) (e) (f)
Valve stem stuck Loss of air supply Actuator has broken spring Valve does not shut off fully (letting by) Valve seal leaking Broken wire from safety PLC to the solenoid that operates the actuator
8. What is the hardware fault tolerance (HFT) of the following subsystems? (a) Three temperature switches in a 1oo3 architecture (b) Three temperature switches in a 2oo3 architecture (c) Two shutdown valves in series (d) Two shutdown valves in parallel (e) Two emergency depressurization valves in parallel (f) A single shutdown valve and actuator with two solenoids in parallel
Answers Question 1dAnswer Fail open EDV’s could lead to spurious depressurization events, which disrupt production, cause wear and tear on the equipment (due to the physical shock of sudden large pressure swings) and release material to the environment. These events are also not without risk of their own, as there is the possibility of an explosion in the EDV vent stack if its purge is lost. Moreover, venting may deposit polyethylene product in the depressurization valves and lines, which then need to be cleaned and tested before production can resume. On the other hand, fail closed EDV’s could result in a catastrophic event if the valves are in a failure state when required to open. In practice, the reactor design is likely to specify a substantial number of redundant EDV’s (not just one), and include measures to reduce the likelihood of failure on demand (especially through common cause and common mode mechanisms). Also, the reactor is typically contained within a blast-proof structure and strictly isolated from personnel when in operation. Therefore, fail closed EDV’s are probably an acceptable solution.
Question 2dAnswer The MooN architecture depends on whether the gas detectors are equivalentdthat is, whether any one of the five detectors can sense any leak from the equipment under surveillance. In practice, this is unlikely to be the case: it is more likely that any leak of
50 Chapter 2 concern may be picked up by, say, two of the detectors. (Modelling studies can be carried out to determine the coverage of each detector.) Thus, the architecture should be specified as 1ooP, where P is the number of detectors that can sense a leak from a specific location. For detailed discussion on this point, refer to Chapter 6. Sometimes, to reduce the occurrence of spurious trip, gas detector trips are configured as 2ooN. In that case, the SIF should be defined as 2ooP.
Question 3dAnswer The SIF description should be phrased such that its failure could allow damage to the turbine: “on low lubrication oil pressure during startup, close the trip and throttle valve”. The final element is thus the trip and throttle valve (1oo1). Sometimes, the SIF is defined as also closing the governor valve (in a 1oo2 architecture with the trip and throttle valve), but some practitioners disagree with this, as the governor valve may not be designed to provide a safety function.
Question 4dAnswer For high throughput operations such as refineries, the major impact is downtime. Spurious trips often result in immediate loss or turndown of production, and take time to recover from (because certain conditions must be met before the SIF can be reset, and even then, throughput and efficiency take some time to return to their equilibrium levels). Financial impact can also result from: • •
•
•
•
Lost materials: product or other materials may be dumped to waste or flare during a trip. Damage to equipment: trips may result in large swings in physical conditions (e.g. pressure), resulting in shock to equipment. This can reduce the lifetime of vulnerable components such as heat exchangers and catalyst beds (which may suddenly expand or contract, causing the particles to disintegrate). Cleanup: trips may result in material being deposited in emergency depressurization or deinventory valves and the downstream piping. This must be decontaminated before restart, and the valves and piping proven to be in good order, so they can respond properly to a real trip. Cost of investigation: every trip should be logged and investigated, for several reasons as discussed in Chapter 11. This costs engineering time, and may also result in design changes and/or retraining. Fines or other legal repercussions resulting from flaring or other environmental incidents as a result of repeated trips.
Basic terminology: SIF, SIS and SIL 51
Question 5dAnswer Recall that the definition of a ‘safe failure’ is one that leads to an immediate event such as a spurious trip or diagnostic alarm. A simple valve in a typical SIF application cannot move by itself (without being driven by the actuator), therefore there are no possible failure modes leading to safe failure. So, for the valve only, we can set lS ¼ 0.
Question 6dAnswer At a simple level, since there are no diagnostics, you could simply assign lDU ¼ lD and lDD ¼ 0. However, if the owner intends to implement automatic partial stroke testing, this can be treated as a diagnostic. In that case, you will need to find out the test coverage (PSTC) of the partial valve stroke test. This may be available in a certification report, in the manufacturer’s safety manual, or the FMEA report for the valve. Once you have this value, calculate lDD ¼ PSTC $ lDD and then lDU ¼ lD lDD.
Question 7dAnswer (a) DD. This failure prevents the SIF from operating, and is detectable by PVST. (b) DD. (c) SU or NE depending on the type of valve. Loss of spring force means there is nothing holding the valve closed. The valve may remain in this position (NE) or it may drift open, leading to what amounts to a trip. It is not possible to detect this before the valve opens, so the failure would be Safe Undetected (SU). The actuator would be unable to reset the valve to its normal position after trip, but this does not determine the failure mode. (d) SU or NE, depending on whether the valve is required to be tight shutoff. (e) NE. Seal leak probably does not affect the ability of the valve to perform its safety function, although it creates an additional danger due to loss of containment. (f) DD, provided that either: (1) the PVST is performed automatically by the safety PLC (and not, for example, by a PVST device installed on the valve assembly), or (2) The broken wire can be detected automatically by a loop continuity diagnostic built into the input/output (I/O) card of the logic solver. If neither of these conditions is met, the failure mode is DU.
Question 8dAnswer (a) 2. The subsystem can tolerate faults in two of the switches and still work correctly. (b) 1.
52 Chapter 2 (c) 1. If one of the valves fails (does not close on demand), the other can still achieve the design intent of the SIF. (d) 0. Both valves must function correctly. (e) 1, provided that each valve individually has enough flow capacity to depressurize the plant within the time specified in the safety requirements specification (SRS). (f) 0. HFT is determined at the subsystem level. Providing redundant solenoids, but not redundant shutdown valves, means that the overall HFT is zero. See the Section 9.4.7 in Chapter 9 for more discussion.
References [1] E. Scharpf, H. Thomas, T. Stauffer, Practical SIL Target Selection: Risk Analysis per the IEC 61511 Safety Lifecycle, 3rd ed., exida, Sellersville, 2021 (A useful textbook covering many aspects of functional safety, especially focusing on risk analysis). [2] W. Goble, H. Cheddie, Safety Instrumented Systems VerificationdPractical Probabilistic Calculations, ISA, Research Triangle Park, 2012 (The standard textbook on SIL verification calculations). [3] OREDA, OREDA Handbook, sixth ed., OREDA, Trondheim, 2015. A publication arising from the Offshore and onshore reliability database (OREDA) project. See also failure rate data sources referenced in Chapters 5 and 9 of this book. [4] B. Skelton, Process Safety Analysis: An Introduction, Institution of Chemical Engineers (IChemE), 1997. See (Chapter 5), pages 67e80. An excellent and concise textbook and learning resource, with an especially helpful section on Fault Tree Analysis.
CHAPTER 3
Risk evaluation Abstract Risk is expressed in terms of the frequency, or likelihood, and severity of the associated harm. The first step in controlling risk is to identify hazardous scenarios that can result in harm. The purpose of a safety instrumented function (SIF) is to provide a specified amount of risk reduction, expressed in terms of a Risk Reduction Factor, which is the ratio of the frequency of harm occurring with and without implementing the SIF. To determine how much risk reduction is required, a tolerable risk level must be defined for each risk receptor. One type of tolerable risk definition applies the notion that risk should be reduced until it is ‘As Low As Reasonably Practicable’ (ALARP).
Keywords: As low as reasonably practicable (ALARP); HAZOP; Process hazards analysis; Risk reduction; Risk reduction factor; Tolerable risk.
3.1 Identifying hazardous scenarios Safety instrumented functions (SIFs) are designed to prevent harm arising from specific causes. The engineering team must identify the root causes in detail, so that the frequency of those causes can be predicted. These root causes are the ‘initiating events’ that can trigger a chain of events leading to the point where something irreversible occurs, such as a loss of containment of a hazardous material. From this point, often known as an ‘incident’, the situation can further propagate to a dangerous outcome such as a fire or explosion, resulting in harm. In a typical project in the process industry, the number and complexity of possible initiating events can be considerable. To identify initiating events, a structured approach is needed, of which the most common example is Hazard and Operability (HAZOP) Study [1e4]. In HAZOP, the scope of the project is separated into manageable sections called nodes. A node typically contains one major item of equipment such as a tank, along with its associated ancillary equipment such as pumps, piping, valves and instruments. Working from the piping and instrumentation diagram (P&ID), an analysis team works systematically through the node, looking for potential causes that can credibly lead to upsets from intended operation. As a guide, the team follows a predefined list of Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00001-X Copyright © 2023 Elsevier Inc. All rights reserved.
53
54 Chapter 3 deviations to search for, such as no flow, high pressure, and utility failure (e.g. loss of instrument air supply). For example, while searching for ‘no flow’ causes, the team will note that, if a running pump stops inadvertently, flow upstream and downstream of the pump will stop. ‘Pump stop’ would then be identified as an initiating event. Each cause is then analysed further to determine the credible worst-case consequences arising from it. The engineering design is likely to contain measures and equipment, known as ‘safeguards’, that are intended to avoid or mitigate these consequences. Initially, when assessing the consequences, the analysts assume that all safeguards are unable to function correctly: for example, alarms are missed, pressure relief devices malfunction, and control loops do not respond. By analysing the situation in this way, the team can make a judgement on whether the safeguards provided are adequate, given the potential severity of the outcome. The results of HAZOP analysis are documented in a worksheet laid out as a table, with columns for causes, consequences, safeguards and any recommended follow-up actions. The output of HAZOP study is an essential resource for safety management throughout the life of the plant. It can be used to • • • • • •
Provide input to the next stage of the functional safety lifecycle, as described in Chapter 4 Identify areas where further information or study is needed for proper understanding of the risks associated with plant operations Train operators in risk awareness for their plant Provide input for alarm management (the Section 4.6.2 in Chapter 4) Support Management of Change activities (the Section 11.5.1 in Chapter 11) Support the plant’s Safety Case (a dossier of evidence for adequate process safety management, required by the regulating authority in many countries).
3.2 Expressing risk in numbers Risk can be evaluated in terms of two parameters, both relating to the harm that could result from the risk: frequency or likelihood (how often harm is expected to occur) and severity (how bad the harm could be). Later in this chapter, we explore how these parameters can be evaluated. The job of a SIF is to reduce risk arising from a specific hazardous event. In practice, this is usually done by reducing only the frequency of harm from that event. Exceptionally, a SIF may reduce only the severity, but not affect the frequency: an example is a firefighting system (the fire is already in progress, and some harm is inevitable, but the SIF can reduce
Risk evaluation 55 it by controlling the fire). It is rare for a single SIF to reduce both frequency and severity of harm, as these would normally be achieved through different mechanisms. The amount of risk reduction provided by a SIF is related to its probability of failure. For example, if the SIF has a probability of failure of 1 in 100, it reduces the frequency of the harm happening by a factor of 100dfor example, from 103/year to 105/year. For most SIFs in the process industry, this frequency reduction ratio (100 in this example) can be called the risk reduction factor (RRF). For example, a SIF that reduces the frequency of a harmful event from 103/year to 105/year has an RRF of 100, i.e. 103/105. (Note that some SIFs use a different kind of measure other than RRF; we explain more on this point in Chapter 4.) We also need to distinguish between: • •
the target RRF, which is the RRF required to achieve sufficient risk reduction for the specific application; and the RRF achieved (sometimes called “theoretical RRF”), which is the RRF predicted by calculation from a given configuration of hardware elements and parameters.
The target RRF of the SIF must be defined, since the SIF’s design varies depending on the RRF requirement. RRF is normally defined during the SIL assessment process, which was introduced in Chapter 2 and enjoys detailed coverage in Chapters 4e6. To define the RRF, we evidently need to know two frequency values: • •
The predicted frequency of harm if the SIF did not exist (usually known as the unmitigated event likelihood or UEL), and The maximum tolerable frequency of harm for that type of event.
Unfortunately, there’s no industry consensus on the names and mathematical symbols for these quantities; here, we will use UEL for unmitigated event likelihood and Ft for tolerable frequency. Using these symbols, target RRF ¼ UEL/Ft. If the target RRF comes out at less than 1, it should be rounded up to 1, as RRF USD10M
Risk evaluation 57 Severity descriptors need to be clear and sufficiently detailed to allow each case to be unambiguously assigned to a category. In this regard, a common problem is that the scope of financial loss may not be clear. The matrix needs to specify what kind of financial loss is included, such as the following: • •
• • •
Cost of lost production due to downtime (this is often the dominant cost for upstream oil and gas facilities) Whether production is counted as ‘lost’ if it can be made up later, e.g. by using up buffer stocks or working overtime. In this event, the production may be considered ‘deferred’ rather than ‘lost’ Cost of equipment repair Cost of destroyed product (this cost may be dominant for speciality chemicals and pharmaceuticals) Financial penalties, e.g. delays to loading and unloading ships, loss of gas supplies to critical consumers
Assumptions about the duration of downtime also need to be specified. For example, what on-site spares and technical capabilities are assumed? If there are multiple production trains, can production continue (maybe at reduced capacity) when one train is unavailable? How long does it take to restart the plant after a trip?
Defining a total tolerable risk per risk receptor Sometimes, tolerable risk is expressed as a total risk to a specific risk receptor from all possible hazardous events. Most commonly, the risk receptor would be a person onsite. This type of risk tolerability is sometimes known as individual riskdmeaning the risk to an individual person, not to be confused with the risk from an individual incident, which we studied in the preceding section. A typical individual risk tolerance statement could be: “The fatality risk to any individual onsite shall not exceed (a defined frequency)”. A reasonable frequency value might be 103 to 104/year, as the frequency of death from all causes for an average individual of working age normally falls in this range. If we encounter a total risk tolerable risk statement, we need to find a way of allocating ‘portions’ of risk tolerance to each applicable SIF; this could be done following a procedure as shown in Fig. 3.1.
3.4 How much precision is needed? How much precision is reasonable in numerical values of frequency and severity? This is a balancing act. If a low level of precision is used, it is necessary to build in a ‘safety
58 Chapter 3
Figure 3.1 Using a combined frequency to determine tolerable risk for individual SIFs.
margin’ to ensure that the worst case is covered with reasonable confidence. This means that the SIF might have to be ‘over-designed’ in one or more of the following ways: • • • •
More reliable components More diagnostics (e.g. provide smart level transmitters instead of simple mechanical level switches) More redundancy More frequent or more stringent testing requirements
Risk evaluation 59 All of these add to the total cost of the SIF over its lifetime. One may wish to avoid these additional costs by developing a more precise estimate of the RRF, so that over-design is unnecessary. However, this too comes with an associated cost: more detailed risk analysis studies will be required during the project design phase. For example, consider a case where the expected harm is several fatalities arising from an explosion. One may take the easy, conservative analytical approach: determine the maximum number of people that could possibly be in the general area at any time, and treat that number of fatalities as the expected harm. Alternatively, one could determine a more precise harm in terms of probable loss of life (PLL), which is the mean number of people killed (and is often a fractional number, e.g. 0.7). Determining PLL may require a detailed study known as quantitative risk analysis (QRA), where the effect distance of the explosion and the number of people affected are calculated for a range of different scenarios and then integrated. Such studies are expensive and can cause delays if not carefully planned in the project execution timeline. For more on the concept of PLL, refer to Ref. [6], p. 108. Besides, the event frequency and severity estimates are usually based on a considerable number of assumptions, such as the number of people expected in a given location or the probability of an operator being unable to respond to an alarm. Some of these assumptions are known to one significant figure of precision at best. Therefore, it would be mathematically invalid to deduce a high precision of event frequencies or severities from them. The balance point between spending on SIF analysis and SIS implementation may be determined by a number of factors. One common issue is the need to save time and money upfront, when the project is burning cash without generating returns. A typical major project conforms to the ‘rule of 20’: 20 weeks in the analysis phase, 20 months in design and construction, and 20 years of operation. The obvious corollary is that the cash-driven desire to shave a week or two from the analysis phase will likely result in unnecessary costs year after year during the facility’s operational lifetime. Time and money pared from proper risk analysis is nearly always a false economy in the long run. A reasonable solution is to perform a two-pass risk analysis. First, a basic, qualitative study is performed to determine approximate risk levels and the likely target SILs. Second, a detailed study is performed, but only for risks that may need a high SIL (say, SIL 2 or greater). This should be an optimal way to use design team resources, as the potential returns from detailed analysis of high-integrity SIFs are considerable. For example, a large cost saving could result from reducing the target RRF of a SIF from 6000 to 600, which is a highly credible outcome of a few thousand dollars spent on additional risk studies. Finally, the depth of the risk analysis may depend on which risk receptors are affected. (If you need a reminder of what ‘risk receptors’ are, refer back to Chapter 1.) Some projects
60 Chapter 3 will specify that SIFs need only consider safety and environmental risk receptors, for the following reasons: • •
The IEC 61511 standard requires consideration of only these risk receptors; and Reduction of design and construction costs is desired, as noted above.
There is, however, a good case for considering financial risk receptors as well. In fact, a proper cost/benefit analysis may show that, even if the SIL selection team specified SIL 1 for a given SIF, the lifetime cost of the SIF may justify increasing this to SIL 2 (for example). Cost/benefit analysis is covered in Chapter 5.
3.5 The ALARP concept In our discussions of tolerable risk so far, we have assumed that any given risk is in a binary state: either tolerable or intolerable, depending on the magnitude of the risk and the measures put in place to control it. In the fuzzy real world, there is always a way to reduce risk; but, beyond a certain point, further risk reduction measures may be unfeasible, or yield an unreasonably low benefit for the money spent. We can therefore assign two risk levels: • •
A higher level (sometimes called the risk tolerability level), at which the risk is deemed tolerable only if further risk reduction would be impractical or not cost-effective; and A lower level (the risk acceptance level), at which the risk is deemed acceptable and no further risk reduction need be considered.
As a practical example, a risk owner may designate that a single fatality incident has a tolerability level of 103/year, and an acceptance level of 105/year (see Fig. 3.2). If the actual incident frequency falls between these values, say 104/year, the risk owner must try to reduce the risk to a level that is “As Low As Reasonably Practicable” (hence, the term ALARP). In Chapter 5, we discuss how the ALARP concept can be applied in practice when determining SIL targets.
Figure 3.2 Tolerable risk levels in relation to ALARP (values are for illustration only).
Risk evaluation 61
Exercises 1. Give two examples of a risk receptor in the chemical process industry. 2. If we wish to demonstrate that a risk is ALARP, what do we need to prove? What kind of evidence could be used? 3. Why is a tolerable risk target needed?
Answers Question 1dAnswer Risk receptors typically considered in the chemical process industry include the following: • • • • • •
Injury or illness of onsite personnel Injury or illness of offsite personnel (the public) The natural environment Financial impactdequipment damage Financial impactdlost production Damage to reputation
For a more detailed list, see Table 1.1.
Question 2dAnswer The main task is to show that existing measures have brought the risk down to a level where the cost of further risk reduction would be grossly disproportionate to the benefit gained. There are several possible approaches to this. One reasonable approach is to consider the well-known hierarchy of risk management: eliminate the hazard, substitute with a less hazardous material or process, provide engineered controls, provide administrative controls (e.g. training and procedures), and provide personal protective equipment. Consider each step in the hierarchy and show that it has been applied to the maximum feasible extent.
Question 3dAnswer To manage a risk, the engineer needs to know “How much risk is acceptable?” In other words, at what risk level is it permissible to stop applying further risk reduction measures? A tolerable risk target provides an unambiguous answer to this question.
62 Chapter 3
References [1] T. Kletz, HAZOP and HAZAN, fourth ed., Institution of Chemical Engineers (IChemE), Rugby, 1999. Probably the best available book on the topic. [2] F. Crawley, B. Tyler, HAZOP Guide to Best Practice, third ed., Elsevier, 2015. Offers an alternative viewpoint to Kletz. [3] IEC 61882:2016, Hazard and Operability Studies (HAZOP Studies) - Application Guide, 2016. The international standard for HAZOP, although not widely applied in practice. [4] G. Wells, Hazard Identification and Risk Assessment, revised ed., Institution of Chemical Engineers (IChemE), Rugby, 2004. Chapter 5. [5] E. Scharpf, H. Thomas, T. Stauffer, Practical SIL Target Selection: Risk Analysis Per the IEC 61511 Safety Lifecycle, exida, Sellersville, 2021. A useful textbook covering many aspects of functional safety, especially focusing on risk analysis. [6] E. Marszal, E. Scharpf, Safety Integrity Level Selection, ISA, Research Triangle Park, 2002.
CHAPTER 4
Introduction to SIL assessment Abstract Each safety instrumented function (SIF) must be assigned an operating mode: low demand, high demand or continuous mode. The mode assignment is based on the expected demand rate of the SIF. SIL assessment is the task of determining the risk reduction target of each SIF. This is expressed quantitatively as a failure measure, which is the target risk reduction factor (RRF) for low demand mode SIFs, or the target probability of failure per hour (PFH) for high demand and continuous mode SIFs. The value of the failure measure determines the SIF’s safety integrity level (SIL) target. Before assigning the failure measure target, SIFs must be identified from project documentation such as control narratives, Cause and Effect Diagrams, hazard and operability (HAZOP) study reports, binary logic diagrams, interlock logic diagrams, and piping and instrumentation diagrams (P&IDs). Independent protection layers other than SIFs, and the cumulative effect of a single common element failure, should also be considered during SIL assessment.
Keywords: Continuous mode; Critical common element analysis; Failure measure; High demand mode; Independent protection layer; Low demand mode; Operating mode; Probability of failure per hour; Risk reduction factor.
The objective of SIL assessment is to select the appropriate SIL target for each SIF, and to determine several other parameters required at later stages of the functional safety lifecycle. This chapter covers some concepts relevant to the SIL assessment process, and explains how to prepare a set of SIFs in readiness for a SIL assessment workshop.
4.1 Safety instrumented function (SIF) operating modes 4.1.1 What are low demand, high demand and continuous modes? In the process industry, most SIFs will remain idle for long periods of time, typically months on end, waiting patiently for the conditions required to trigger them. Such SIFs are designed to protect against abnormal situations, and activation of them is an exceptional event. These are defined as low demand mode SIFs. In contrast, a small number of SIFs may be required to act frequently. Activation of these SIFs is a normal part of routine operation. An example is a startup permissive on a machine that is frequently stopped and started. This kind of SIF is said to operate in either high demand or continuous mode. Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00008-2 Copyright © 2023 Elsevier Inc. All rights reserved.
63
64 Chapter 4 High demand and continuous mode SIFs are relatively uncommon in the process industry, because most control functions are not the only line of defence against an incidentdadditional layers of protection, operating in low demand mode, are normally provided. (For this reason, IEC 61511 has not much to say on the subject of continuous mode SIFs.) They are, however, much more frequently encountered in other industries, especially where dangerous machinery is involved. For example: •
•
a guillotine-type cutting machine may have a SIF to lower the blade whendand only whendsafety conditions are met (such as proximity switches confirming no hand is inside the cutting zone), and when the two ‘operate’ buttons are pressed simultaneously an autonomous vehicle used in a warehouse will have an anti-collision safety function, which may experience multiple demands per day in areas where personnel are working
4.1.2 Selecting an operating mode The operating mode relates to how any dangerous failure in the SIF will most likely be revealed. There are three possible ways: • • •
Proof testing discovers the failure, triggering the operations and maintenance teams to take appropriate action such as initiating repairs and possibly taking the process offline. Automatic diagnostics detect the failure and take automatic action, such as switching to a redundant channel, tripping the SIF, and/or raising an alarm. A real incident occurs, placing a demand on the SIF, but the SIF is unable to achieve the process safe state due to the failure.
Proof testing and automatic diagnostics need to be sufficiently frequent, if they are to have a good chance of discovering the fault before its revealed by a real incident. The requirements of the standards are: • • •
To take credit for proof testing, the average demand rate on the SIF must be no more than once per year. To take credit for proof testing, the proof test interval (time between tests) should be no more than half of the average expected time between demands on the SIF. To take credit for automatic diagnostics, they should run to completion at least 100 times per demand on the SIF.
If these requirements are not met, we conservatively assume that proof testing and/or diagnostics have no chance of finding faults, i.e. we disregard proof testing and/or diagnostics when modelling the SIF to calculate the expected failure metrics (see Chapter 9 for details of how this is done). Once we decide whether proof testing and diagnostics meet the frequency requirements, the operating mode can be selected according to Table 4.1.
Introduction to SIL assessment 65 Table 4.1: Selection of a SIF’s operating mode. Mode
Proof testing frequent enough?
Diagnostics frequent enough?
Low demand
Yes
Yes
High demand
No
Yes
Continuous
No
No
4.1.3 Formal definition of operating modes The preceding section gave a useful ‘real world’ way of defining which mode a SIF is operating in. However, the formal definitions given in the IEC 61508 and IEC 61511 standards define operating modes in rather different terms. Formally, the definition of a SIF’s operating mode is very simple: a demand mode SIF operates only on demand (i.e. not as a normal part of process control), at a demand rate of less than once per year for low demand, and more than once per year for high demand. A continuous mode SIF works to keep the equipment in a safe state as part of normal operation: in other words, it needs to act frequently or all the time. Another way of distinguishing between modes is that continuous mode SIFs, being part of normal process control, will cause an incident if they fail, even in the absence of any other initiating event. On the other hand, failure of low and high demand mode SIFs does not cause an incident, but may allow an incident to occur if some other initiating event occurs.
4.1.4 The significance of operating modes The mode of each SIF should be identified at an early stage in the lifecycledideally during the SIL assessment process. Selecting the correct mode for each SIF is important for several reasons, explained below. Definition of SIL As we discuss later in this chapter, the SIL of a low demand mode SIF is related to its average probability of failure on demand (PFDavg). For high demand and continuous mode SIFs, a different quantity is used: the probability of failure per hour (PFH). The reason for this is subtly buried in the underlying mathematics. The probability that a SIF fails on demand is related to the length of time since the last test. For high demand and continuous mode SIFs, testing is frequent: the SIF is, in effect, tested every time it is used, even if we don’t consider diagnostics. Therefore, failure has little time to occur between tests. The result is that the probability of failure does not vary much over time: it is reasonable to assume a constant failure rate, and hence a constant measure can be useddthe probability of failure per hour (PFH). This is the probability that, if the SIF was working correctly at a specified time, it has failed by the time one hour has elapsed.
66 Chapter 4 The situation is different for low demand mode SIFs. As testing is infrequent (typically no more than once every three months, often much less), there is time for the probability of failure to rise significantly between tests. To a first approximation, the failure probability gradually climbs in a logarithmic curve, jumps down to a lower level when the SIF is tested, and begins climbing again. This cycle repeats throughout the life of the SIF. A typical PFD curve is shown in Fig. 4.1, with a proof test interval of 1 year. This means that the actual probability of failure at the moment of demand varies considerably, depending on exactly when the demand occurs. It becomes necessary, therefore, to use a time-averaged probability measuredPFDavg, which is the probability of failure on demand averaged over the entire lifetime of the SIF. If we attempted to use the PFH (probability of failure per hour) based on component failure rates, it would likely be over-optimistic, as the time from testing to demand will nearly always be much more than one hour. Chapter 9 covers the failure measures PFDavg and PFH, and how they are calculated, in more detail. Failure rates Recall that, in Chapter 2, we distinguished between different types of failuredsafe and dangerous, detectable and undetectabledand referred to them as failure modes. In continuous mode, there may not be enough time for diagnostics to detect a failure, as the demand on the safety function is frequent. Most of the dangerous failures must therefore be classified as ‘dangerous undetectable’ (DU). If a hardware manufacturer’s data claims
Figure 4.1 Typical behaviour of the instantaneous PFD of a low demand mode SIF over time.
Introduction to SIL assessment 67 that a significant fraction of failures fall into the DD category, these must be added into the DU failure rate when performing failure rate calculations on continuous mode SIFs. Also, in continuous mode, proof testing cannot be frequent enough to discover faults. Therefore, no credit is taken for proof testing. For high demand mode SIFs, we take credit for diagnostics, as they should be frequent enough to discover failures between demands. Therefore the dangerous failure rate to use in the dangerous failure probability calculation is lDU only. However, proof testing is assumed not frequent enough, and no credit is taken. SIL assessment methodology Some SIL assessment methods, in their standard form, are less suitable for continuous mode SIFs. This is because these methods make the underlying assumption that an incident requires an external initiating event, whose frequency can be estimated. This applies to hazard matrix and Risk Graph methods. These methods could be adapted for continuous mode; however, in practice, a quantitative LOPA approach is normally used instead. Detailed coverage of SIL assessment methods for all operating modes is provided in Chapter 5.
4.1.5 Tips on selecting the operating mode In the process industry, most SIFs are in low demand mode. In the author’s experience, no more than one in 20 SIFs in the oil and gas and petrochemical industries should be assigned high demand or continuous modes. Here are some tips for identifying these few SIFs: • •
•
•
Is the expected demand rate, after considering other layers of protection, more than once per year? If so, low demand mode cannot be assigned, by definition. Is the expected demand rate more than half of the proof test rate (inverse of the proof test interval)? If so, low demand mode is not valid. This could occur on plants where proof testing is planned only at infrequent major turnrounds. Is there only one demand scenario, especially a scenario that is a normal part of operating practice and not an upset condition? For example, suppose the SIF is required to keep an inspection door on a flue stack locked when the process is in operation. If the operator’s inspection of the flue stack is a routine operation, this is likely to be a high demand mode function. Is the SIF a permissive or inhibit function for equipment or processes that start or stop frequentlydfor example, an in-situ catalyst regeneration? If an incorrect permission to proceed can lead to an incident even if no other failure occurs, the SIF is likely to be in high demand mode.
68 Chapter 4 •
•
Is the SIF designed to protect against the consequences of an operator error during a frequent or routine manual operation, such as filling a charge tank? If the operator error leads to an incident with no other failure except the SIF failure, then the SIF is probably in high demand mode. Is it hard to identify any particular demand scenario or demand frequency, especially if it seems that the demand scenario is ‘normal operation’? This is a strong indicator that the function is continuous mode.
Fig. 4.2 provides further guidance on selecting a SIF’s operating mode. See Ref. [1], Section 3.5, for further coverage of SIF operating modes.
4.2 The objectives of SIL assessment 4.2.1 Low demand mode SIFs For low demand mode SIFs, the purpose of the SIF is to reduce the frequency of harm from the unmitigated event likelihood, UEL (the frequency if the SIF were absent) to a ‘mitigated’ level FM. The mitigated level must be no greater than the tolerable risk. These values UEL and FM are determined during the SIL assessment process, and then used to derive a target SIL. Several different SIL assessment methods are available. Some of these require explicit calculation of UEL and FM, and these are known as quantitative or semi-quantitative methods, of which the most common example is Layer Of Protection Analysis (LOPA). Other methods implicitly derive FU and FM without requiring a calculation; these are known as qualitative methods, most commonly Risk Graph. These methods are detailed in Chapter 5. In low demand mode, the SIF is considered to reduce the frequency of harm by a risk reduction factor RRF. The key relationship is: RRF ¼ UEL/FM. Finding the RRF (or its inverse, the average probability of failure on demand PFDavg ¼ 1/RRF) gives us the required SIL directly, by reference to a table given in IEC 61508:2010 part 1 and in IEC 61511:2016 part 1. The same information is presented here in Table 4.2.
4.2.2 High demand and continuous mode SIFs For high demand and continuous mode SIFs, the objective is to find the maximum SIF failure rate that keeps the overall risk at or below the tolerable risk level. This SIF failure rate is expressed as a target probability of failure per hour (PFH), which can be calculated by a variation of LOPA, as explained in Chapter 5.
Introduction to SIL assessment 69
Figure 4.2 Decision flow diagram: selecting SIF operating mode.
70 Chapter 4 Table 4.2: SIL target depending on PFDavg and RRF target, for low demand mode functions. SIL target
PFDavg target
RRF target
No SIL
PFDavg 0.1
RRF 10
SIL 1
0.1 > PFDavg 0.01
SIL 2
0.01 > PFDavg 0.001 3
4
SIL 3
10
SIL 4
104 > PFDavg 105
> PFDavg 10
10 < RRF 100 100 < RRF 1000 1000 < RRF 10,000 10,000 < RRF 100,000
Table 4.3: SIL target depending on PFH target, for high demand and continuous mode functions. SIL target
PFH target
No SIL
PFH 105
SIL 1
105 > PFH 106
SIL 2
106 > PFH 107
SIL 3
107 > PFH 108
SIL 4
108 > PFH 109
Once the target PFH is known, the required SIL is obtained by reference to a table given in IEC 61508:2010 part 1 and in IEC 61511:2016 part 1. This is provided here in Table 4.3.
4.2.3 Why not use default SIL targets? Finding UEL and FM and selecting the SIL can be an arduous process. It can easily take a SIL assessment workshop team a whole day to cover 12e15 SIFs, or even longer if there are many possible causes of each incident. Why not bypass all this effort and simply apply a pre-determined SIL for all cases, say a conservative SIL 3 for an offshore topsides oil platform, or a blanket SIL 2 for burner management system trips? There are two main factors that justify the trouble of selecting a SIL for each SIF individually: 1. Avoiding over-design. The cost of implementing a SIF increases dramatically as its SIL increases. This is due to the potential need for • higher-grade hardware • more complex architecture • more redundancy • more stringent software design and testing requirements
Introduction to SIL assessment 71 • • •
greater independence during assessment and audit of the safety lifecycle activities more frequent and more stringent hardware testing and maintenance potentially increased spurious trip rate, resulting in more plant downtime.
Such cost increases typically far outweigh the cost savings that would result from reduced SIL assessment workshop time. Indeed, the workshop study may (and often does) show that some SIFs are not actually necessary at all, or can be downgraded to non-SIL-rated functions. 2. Avoiding under-protection. In certain cases, the SIL assessment process may reveal that the unmitigated risk is too great to be managed by a single SIF, even at SIL 3 (which is the highest SIL that can normally be achieved in practice, for process applications). This would point to a need to reconsider other, non-SIF, protection layers (see Section 4.6) or make design changes to the process, such as changing plant layout or operating conditions. Two further benefits of completing the SIL assessment process for every SIF are: 3. It gives the engineering team an opportunity to confirm the critical elements of the SIF. As we saw in the example in Chapter 2, not all actions taken by the SIF are necessary for successful prevention of harm. If unnecessary elements are included in the SIF, it adds to the cost of the SIF and makes the SIL verification process difficult (this is covered in Chapter 9). 4. It compels the team to review the risk scenario in detail. This can often yield important findings, such as: • initiating causes for which the SIF is unable to mitigate the harm. For example, the SIF may be unable to detect certain specific causes, or unable to respond fast enough, or the SIF’s action may not completely prevent the harm from specific causes. • opportunities for design improvement. For example, the team may find that the SIL can be reduced if a pre-alarm is added or redesigned. (There’s more on this topic in Section 4.6.) Also, an opportunity for intrinsically safer design (see Chapter 1) may have been missed. • important factors that are critical for successful operation of the SIF, such as specific operator training needs • any secondary or ‘knock-on’ effects that can result when the SIF trips, either on demand (when the initiating cause is present) or spuriously (when then cause is absent). In many cases, an additional SIF may be needed to mitigate harm from these secondary effects. This type of ‘secondary function’ is covered in Chapter 6.
72 Chapter 4
4.2.4 Prevention or mitigation? So far, our discussion has assumed that successful operation of the SIF reduces the harm to zero. In practice, this is rarely the case. Two different SIF designs are relevant here: these are often known as prevention functions and mitigation functions. In the case of a prevention function, the harm to the risk receptor of interest is reduced effectively to zero. This is, however, at the cost of introducing a smaller degree of harm to another risk receptor. For example, a SIF may prevent overpressure by venting a system to flare. The risk of rupturedhence, injury and damagedis eliminated when the SIF operates, but in its place a flaring incident occurs. This results in (relatively minor) environmental and reputational damage, along with other effects such as loss of product. All in all, a major impact is traded for a minor one, usually on a different risk receptor. This minor impact is usually negligible, because its tolerable frequency is much higher than that of the major impact averted by the SIFdusually by 2 or 3 orders of magnitude. In exceptional cases, it may not be negligible, however. For example, a trip may prevent equipment damage, but result in plant shutdown resulting in substantial loss of production. In such a case, the analyst needs to assess the net effect of the SIF failure, i.e. the difference in level of overall harm (usually expressed in money terms) between SIF failure and SIF success, and then use that as the basis for determining the SIL target. In other situations, the SIF may create a new hazard, or increase the likelihood of an existing hazard. For example, if multiple depressurisation SIFs act simultaneously (on sitewide loss of power or cooling water, say), the flare may be overloaded, leading to new risks. In such cases, the SIS must be designed to prevent this secondary harm: in this example, the depressurisation rates of the SIFs could be limited, or time delays provided to stagger the trip events. Alternatively, so-called ‘secondary SIFs’ can be provided to manage the harm, as we discuss further in Chapter 6. A mitigation function is a function that reduces the severity of the harm to the primary risk receptor, but may not be able to eliminate it entirely. An example is a fire detection system, which senses heat, flame or smoke in a work area and activates a deluge. The deluge cannot completely prevent the risk of damage or injury, since the fire may have already caused some harm before the deluge acts. Such cases should be split and considered to make two separate risk contributions: • •
a lower frequency, higher severity risk case corresponding to failure of the SIF, and a higher frequency, lower severity risk case that applies whether the SIF works or not.
Determining a SIL target for a case like this could be done by Event Tree Analysis or Bowtie Analysis. The analyst can use these methods to calculate a maximum tolerable
Introduction to SIL assessment 73 probability of failure for the SIF, and then apply this directly to find the SIL target from Table 4.2.
4.3 Identifying and documenting SIFs 4.3.1 Objective When it is time to begin the SIL assessment workshop, we will ideally be presented with a complete and accurate list of SIFs for analysis. In the real world, this doesn’t often happen. Instead, it is often necessary to develop a SIF list using the information available to us. This is a crucial step for the success of later steps of the lifecycle, and can be challenging if the input information is not in a format that readily lends itself to SIF list development. The following sources of input information are suitable for SIF list development. Those at the top of the list are the most convenient resources for this purpose; the task becomes increasingly troublesome (time-consuming, complex or inefficient) as we go down the list. • • • • • •
Process control narratives and interlock descriptions Cause and effect diagrams (C&EDs) HAZOP study reports, and SIL assessment reports from previous versions of the design Binary logic diagrams Interlock logic diagrams P&IDs
Whichever resources are used, the target is to draw up a list of potential SIFs, and to develop a set of minimal input information for each one. This set of information needs to include: •
•
the sensors and final elements for each SIF. Often, this can conveniently be written in the form of an ‘On’ statement, such as: ‘On high pressure in D-100 from PI-1201 or low level in T-101 from LI-1202: stop pump P-130 and close valve XV-1234.’ The final element group need not, at this stage, be reduced to the ‘critical’ itemsdthe minimum set of actions required to achieve the safe state. It is important, however, to note the intended logic (OR, AND, 2oo3 etc) for the sensor group. In the example just mentioned, the logic is OR. the causes of demand on the SIF. Typically, these are either failure events (e.g. pump trip), an operator error (e.g. misoperation of a valve), a planned event (e.g. operator presses a button to light a burner) or an undesired but credible event (e.g. another SIF acts). Apart from the experience of the process engineer, the HAZOP report is usually the best source of this information. It is essential to capture all possible causes of demand, for reasons we will explore later. For example, if one possible cause is
74 Chapter 4 spurious closure of a shutdown valve, we should note all the reasons the valve could closede.g. valve failure, loss of air supply, action of an interlock on the valve, or operator action (intentional or not). Should time be invested developing a SIF list ‘offline’, that is, in the workshop facilitator’s office before the SIL assessment workshop begins? That depends on the sourcedand qualitydof the input information. If the information is: • • •
good quality (complete, detailed, accurate, up to date, relevant) and extensive (e.g. P&IDs or a HAZOP report, which may run to hundreds of pages) and understandable without detailed explanation from the design or operations team (this is often not the case for C&EDs and interlock logic diagrams),
then a ‘prefilling’ exercise like this is probably worthwhile and likely to save workshop time overall.
4.3.2 Using process control narratives, interlock descriptions These take the form of textual descriptions, explaining the inputs and outputs of the control and safety functions required in the design. Both basic process control system (BPCS) and SIS safety functions may be shown. This information can be used as a starting point for a SIF list by adapting the functions one-for-one into the form required. Example of an interlock description Interlock I-99-01: Protection of Export Gas Compressor C-1611 Objective: Prevent damage to C-1611 in the event of high discharge temperature or pressure Functional description: Initiators: Any one of the following events: PT-16097 high high (C-1611 discharge pressure) TT-16064 high high (C-1611 discharge temperature) TT-16158 high high (Aftercooler E-1613A/B gas outlet temperature) Actions: Close XV-16126 (gas turbine trip valve for turbine of C-1611) Close XV-16214 (export gas feed to C-1611) Open BDV-16213 (gas turbine feed gas to flare)
• • • • • •
Some of the functions may need to be split up into two or more SIFs if they include conditional cases. For example, the description may state that one of the actionsdclosing valve XV-1200, saydshould be done only if the SIF acts during startup. A single SIF
Introduction to SIL assessment 75 cannot contain a conditional action like this, so we are likely to need two separate SIFs: one for the startup case (with the additional action XV-1200), and one for all other times. A number of other special cases may arise, such as interlocks that are activated only by pushbuttons, or those that trigger other interlocks. These cases are discussed in detail in Chapter 6. Some of the interlocks may be highly complex and have large numbers of sensors and final elements. This situation often arises with master unit or plant shutdown interlocks, which may be given labels like ESD-1 and ESD-2 (ESD meaning ‘emergency shutdown’; you may also see PSD for ‘process shutdown’ and USD for ‘unit shutdown’). These cases are likely to need the assistance of the design team to unravel all of the constituent SIFs. A procedure for this is given in Section 4.3.6.
4.3.3 Using cause & effect diagrams (C&EDs) These are large tables showing a list of SIS initiators (on the left side) and final elements (along the top). Symbols are placed in the appropriate row/column intersections to show the final elements (effects) that act when each initiator (cause) detects a dangerous condition. Symbols can be chosen to indicate the specific behaviour of the final element (such as C ¼ close, O ¼ open, E ¼ energise, D ¼ de-energise), or a general symbol can be used (X ¼ do something). A highly simplified example is shown in Table 4.4. Although simple to interpret, C&EDs have their weaknesses. To fit on a normal-sized sheet of paper (A3 or 11 x 17 inches), the typeface often has to be shrunk to an uncomfortably small size. This can make it hard to work with, whether in hardcopy or softcopy. Another issue is that they are frequently out of date and, in practice, out of step with other project documents. If discrepancies are found during the SIL assessment workshop, they must be noted and meticulously corrected afterwards. Developing a SIF list from a C&ED begins with working down the initiators list on the left of the diagram. The initiators often have logical groupings, which may be indicated by Table 4.4: Simplified example of a Cause & Effect Diagram. Causes
Effects
Sensor
High or low
Voting
Response time (end to end)
XV-47
XV-48
XV-49
PIT-1234A
Low
2oo3
PFH 10e7
SIL 3
10e7 > PFH 10e8
SIL 4
10e8 > PFH 10e9
In this table, PFH means the probability of SIF failure per hour.
The IEF is required in events per hour; therefore, we include a conversion factor of 8760 h per year: IEF ¼
1e 4 z3e7=hr 1$1$ð0:4$0:1Þ$8760
From this, the SIL target is selected according to Table 5.26, which is taken from IEC 61511-1:2016, Table 5. The target maximum failure rate calculated above is treated as equivalent to the SIF’s probability of failure per hour. The result is a SIL 2 target.
5.10 Fault tree analysis 5.10.1 Method overview For some complex SIF demand scenarios, LOPA can become cumbersome or confusing. This may be the case when: •
•
The SIF provides mitigation in some circumstances, but not others (for example, depending on operating conditions such as feedstock composition or whether other units of the plant are online) Correlations exist between some elements of the LOPA. For example, suppose a plant is manned only during the daytime. Alarm response is therefore a valid IPL only during the daytime. Also, the probability of operator presence in the effect zone is different between day and night.
In such cases, a good solution is to apply Fault Tree Analysis. This can be used for SIL assessment in two ways: 1. Exclude the SIF under consideration from the analysis; use the Fault Tree to calculate the unmitigated event likelihood (see discussion on LOPA, above); from this, calculate
SIL assessment methodology
139
the risk reduction gap that the SIF needs to fill. This method is appropriate when the SIF can mitigate in all the situations modelled in the Fault Tree. 2. Include the SIF in the Fault Tree as a layer of protection for applicable cases. Treat its PFD as a variable, which is back-calculated by setting the top event frequency equal to the tolerable frequency. This is the method of choice when the SIF does not mitigate in every situation in the Fault Tree. Method 2 is applied in the following example. A liquid propane tank supplies propane to ‘Plant B’ for processing. Plant B requires gaseous propane at a temperature above 0 C. To achieve this, a water bath heater is provided to vaporize and heat the propane. If a fault occurs and the propane is sent to Plant B as cold liquid, it can lead to piping failure, resulting in a propane release, fire and possible operator fatality. A dispersion analysis has shown that vapour cloud explosion is not expected, as any propane gas cloud will disperse due to the small piping diameter and other local factors. Plant B operates in ‘low flow’ and ‘high flow’ modes. In low flow mode, the consumption of propane is low, and a small standby electric heater will start automatically. There is also a low temperature alarm, with sufficient time for operator intervention. In ‘high flow’ mode, the standby electric heater does not have sufficient capacity, and there is not enough time for operator intervention to the alarm (due to the higher flow velocity). For this case, a low temperature trip is provided to trip the propane pump. Fig. 5.6 is a schematic of the equipment described, and Fig. 5.7 shows a Fault Tree that can be used to calculate the RRF target of the low temperature SIF. The SIF failure probability value of 0.028 shown at the bottom right of the Fault Tree in Fig. 5.7 is initially unknown, and is back-calculated from the other values in the Fault
Figure 5.6 Propane heater exampledschematic.
140 Chapter 5
Figure 5.7 Propane heater exampledfault tree.
Tree, by setting the top event frequency to the tolerable frequency (1e-4/year). From this, the RRF target is calculated as 1/0.028 ¼ 36, giving a SIL 1 target.
5.10.2 Documenting Fault Tree analysis FTA tends to rely on a considerable amount of input data, and the team will likely make several assumptions while developing the Fault Tree. All these inputs and assumptions must be meticulously recorded to provide justification and traceability. A reasonable way to do this is to number each element in the Fault Tree. Then, a separate list is prepared
SIL assessment methodology
141
detailing accompanying notes for each numbered element. The notes list should be prepared during the workshop and agreed by all participants, to minimise the need for further discussions and rework later.
5.11 Cost/benefit analysis 5.11.1 Introduction When the severity of the undesired outcome can be expressed as a money value, this can be compared with the cost of various potential SIF designs at different SILs. This enables the optimal design to be selected in terms of benefit-to-cost ratio. This method is rarely applied for cases where the outcome is a health or safety impact, as it is ethically challenging to assess the money value of death and injury. It can reasonably be applied to cases where the major impact is financial, e.g. equipment damage, downtime, lost production, fines, and/or contractual penalties. Two sets of values need to be calculated: the cost of the outcome, and the cost of various potential SIF designs. These are calculated on an annualized basis. As an approximation, the costs could be assumed to be incurred in full on Day 1 of the plant’s operational lifetime. Alternatively, the ‘net present value’ can be calculated: this is the amount of money which, if invested on Day 1 at a specified interest rate, would generate the correct amount of interest to cover the rising price of repairs and labour (due to inflation) over the plant’s lifetime. Exam preparation tip Functional safety exam questions covering cost/benefit analysis will expect you to calculate costs on a net present value basis. The applicable formula is: PV ¼
C ð1 þ iÞn
where PV is the present value (the value you should calculate for comparison), C is the total value over the plant lifetime, i is the annual interest rate to be applied (as a decimal, e.g. 0.05 for 5%), and n is the plant lifetime in years [8,9].
5.11.2 Calculating the cost of the outcome First, the predicted ‘reasonable worst case’ incident cost is calculated, including: equipment damage, downtime, lost production, fines, compensation, cost of environmental remediation, contractual penalties, and cost of damage to the company’s reputation. The cost estimate should allow for inflation, increasing fines and environmental standards, etc.
142 Chapter 5 over the plant lifetime, and therefore should be higher than the predicted cost should an incident unfortunately occur on Day 1. This is multiplied by the number of times the incident is expected to occur during the plant’s lifetime. If no SIF is provided, the event is expected to occur at a frequency equal to the initiating event likelihood, multiplied by the reduction factors we would use for LOPA: enabling conditions, independent protection layers (IPLs), and conditional modifiers (see the Section 5.9 in this chapter, and the section on IPLs in Chapter 4). The calculation is done first without the SIF, and then repeated for each SIF design to be evaluated, taking credit for the risk reduction factor (RRF) of the SIF as an IPL. The SIF is normally assumed to reduce the frequency, but not the severity, of the incident. Example • • • •
Cost per incident (average over plant lifetime): $24M Incident frequency without SIF: 0.05/year Plant lifetime: 30 years SIF designs to be evaluated have target RRFs of 12, 75 and 300 (Table 5.27). Table 5.27: Example calculation of the cost of an incident. Cost per incident ($) (A)
Incident frequency (year¡1) (B)
Incident cost over plant lifetime ($) (C [ A $ B)
No SIF
24,000,000
5 $ 10e2
480,000
RRF ¼ 12
24,000,000
4.2 $ 10e3
100,800
RRF ¼ 75
24,000,000
6.7 $ 10
16,080
RRF ¼ 300
24,000,000
1.7 $ 10
4080
Case
e4 e4
5.11.3 Calculating the cost of the SIF The cost of the SIF is evaluated as the sum of: • • •
The upfront cost of design, procurement and installation. This is assumed to be fixed at Day 1 costs, as the cost will be incurred at present, not at a future date. The cost of replacement or refurbishment at the end of the SIS working life, if this is shorter than the plant lifetime (as it typically is). The ongoing cost of proof testing and maintenance. As this will increase over the plant lifetime, an average value should be used.
The ongoing cost should not be ignored, as in practice it can be a significant fraction of the total lifetime cost of the SIF (see example calculation in Table 5.28).
SIL assessment methodology
143
Table 5.28: Example calculation of the cost of a SIF. Total SIF cost ($) (G [ D þ E þ F)
Upfront cost (D)
Future replacement cost (E)
Maintenance cost over plant lifetime ($) (F)
No SIF
0
0
0
0
RRF ¼ 12
30,000
72,000
30,000
200,000
RRF ¼ 75
80,000
192,000
45,000
317,000
RRF ¼ 300
150,000
360,000
60,000
570,000
SIF design
5.11.4 Selecting the optimal solution Finally, the reduced lifetime incident cost achieved by each SIF design is compared with the cost of implementing the SIF. The optimal solution is the one that gives the highest lifetime net cost reduction, as this represents the best return on investment. In the example shown in Table 5.29, the RRF ¼ 12 design is the optimal solution. The RRF ¼ 300 solution is a bad investment, as the costs will not be recovered over the plant lifetime. Table 5.29: Example calculation of the optimal solution. Incident cost over plant lifetime ($) (C)
Incident cost saving relative to no SIF (H [ C e CNoSIF) ($)
Total SIF cost ($) (G)
Net reduction (H e G) ($)
No SIF
480,000
0
0
0
RRF ¼ 12
100,800
379,200
200,000
179,200
RRF ¼ 75
16,080
463,920
317,000
146,920
RRF ¼ 300
4080
475,920
570,000
94,080
SIF design
5.12 The SIL assessment workshop 5.12.1 The SIL assessment team A team with sufficient knowledge and information for the task is essential for successful completion of the SIL assessment. Experience of SIL assessment is not a must, as the chairman’s main role is to guide the team through the task and ask all the right questions. The team needs to be: •
able to answer the technical questions raised, without having to refer to outside authorities for most issues; and
144 Chapter 5 •
small enough to proceed efficiently, without getting bogged down in lengthy discussions.
The ideal SIL assessment team consists of around 6e8 members and includes knowledgeable representatives of all relevant stakeholders, including the risk owner, the main engineering contractor, the licensor of the process (if any), and the vendor of packaged equipment (if any). In terms of engineering know-how, the key disciplines that need to be represented are detailed in Table 5.30. Some disciplines need not be present full-time, but need to be available ‘on call’ to answer questions, in person or remotely. Table 5.30: Engineering disciplines required for SIL assessment workshops. Discipline
Full time or on call?
Type of input required
Process engineer
Full time
Impact of process upsets, and the possible outcomes. Aware of typical or historical problems such as fouling, runaway reactions. Cost (and other impact) of downtime.
Instrument engineer
Full time
Typical reliability and failure rate of field equipment (sensors and shutdown valves). Proposed design of control system and SIS architecture, to answer questions about redundancy, independence etc.
Utilities engineer
On call
Reliability and failure rates of utilities (power, instrument air, nitrogen etc.)
Electrical and/or rotating equipment engineer
Depends on equipment under study
Design of control systems for electrical equipment, to answer questions about independence of control and trip functions, backup/redundant equipment, autostart functions etc.
HSE engineer
On call
Design of mitigation systems. Design basis of relief systems. Tolerable risk targets. Procedural questions (any issues not covered in written SIL assessment procedure)
Documentation clerk
On call
Provide supporting documents on request
For more coverage of the role of the chairman and team members, see Chapter 14 of Ref. [8].
5.12.2 Overall objectives of the SIL assessment workshop Clearly, the main objective of SIL assessment is to identify the SIL and failure measure target for each SIF. However, further information about the SIFs will be needed at a later stage, during SRS development, and the SIL assessment workshop is a good opportunity
SIL assessment methodology
145
to determine this information. Thus, it is helpful if the workshop can develop the following 5 key outputs: • • • • •
Target SIL (and optional RRF, PFDavg or PFH target) SIF operating mode The minimum set of elements defining the SIF architecture (often known as the ‘critical elements’) Demand scenarios for the SIF Safe process state that the SIF is required to achieve (see Chapter 7 for a discussion)
Table 5.31 lists extra information that should also be captured in the SIL assessment report, for easy extraction when needed later.
Table 5.31: Additional output from the SIL assessment study. Information
Why this is needed
Predicted SIF demand rate
Must be documented in the SRS (see Chapter 7).
Consequences of spurious trip
This may be needed to assess the target spurious trip target when preparing the SRS. Sometimes this is already identified during HAZOP.
Alarms used as IPLs with credit taken
This information will be needed during an alarm rationalization study.
The HAZOP scenario number associated with each demand scenario
This makes it easy to trace backwards during verification (see Chapter 10) and Management of Change (see Chapter 11).
Names of key personnel participating in SIL assessment
Their competency needs to be checked during Functional Safety Assessment (see Chapter 10).
The following are some other issues that are good to confirm during the SIL assessment workshop, as there may be no other point in the lifecycle where they are naturally addressed: • • •
Ensure that every case identified in HAZOP as needing a SIF (i.e. where a significant risk gap exists in the absence of a SIF) actually has a corresponding SIF in the design. Ensure that every significant HAZOP scenario for which a SIF was nominated as a safeguard has been assigned to a SIF in the SIL assessment. Confirm the selected set point is appropriate for analog sensors.
146 Chapter 5
Exercises 1. During SIL assessment, which of the following is the typical way to handle multiple risk receptors (e.g. safety impact, environmental impact, financial loss)? (a) Evaluate the SIL and failure measure target of each risk receptor separately, then select the worst case as the final result. (b) Combine the risk receptors together to evaluate a cumulative SIL and failure measure target. (c) Consider safety and environmental impact only, as other risk receptors are outside the scope of IEC 61511. 2. For some projects, the tolerable risk criterion is expressed as a band of values: an upper ‘tolerable’ limit and a lower ‘acceptable’ limit, with the region between them known as the ‘ALARP’ region. Suggest a way to select a tolerable risk and, hence, SIL target in this case. 3. For each of the following events, state whether it should normally be considered a valid initiating event for SIL assessment: (a) Failure of a level control loop (b) Flow control loop inadvertently left in manual mode (c) Check valve failure (d) Autostart of a standby pump malfunctions and starts the pump when not required (e) Mechanical ‘slam shut’ pressure protection valve closes inadvertently (f) Pipeline scraper (pig) stuck in the pipeline (g) Spontaneous failure of a pressure vessel (h) Mistake by operatordskips a step in a manual procedure 4. Prepare, and calibrate, a risk matrix for SIL assessment based on the following categories. Likelihood categories
Category label
Frequency (year¡1)
1
0.05
2
0.05 to 0.3
3
0.3 to 1
4
1
SIL assessment methodology Severity categories Description per risk receptor
Category label
Tolerable frequency
Personnel
Environment
Financial loss
A
First aid injury
Minor release
USD1M
Once in 105 years
5. Prepare, and calibrate, a Risk Graph for SIL assessment based on the following categories. Categories for frequency of demand on the SIF Category label
Frequency (year¡1)
W0
0.05
W1
0.05e0.3
W2
0.3e1
Severity categories Category label
Severity description
Tolerable frequency
Ca
First aid injury
Once in 102 years
Cb
Up to 3 days of lost time
Once in 103 years
Cc
Long term injury, permanent disability
Once in 104 years
Cd
One fatality onsite
Once in 105 years
Categories for exposure parameter Category label
Description
Fa
Operator is rarely in the hazardous zone
Fb
Operator is frequently in the hazardous zone
147
148 Chapter 5 Categories for avoidance parameter
Category label
Description
Pa
It may be possible for the exposed person to avoid the consequences
Pb
It is almost impossible for the exposed person to avoid the consequences
6. Suppose the consequence of SIF failure is a plant outage, costing USD100k per hour until the plant is restarted. How should the severity category be selected in this case? Why? 7. Suppose a SIL is assigned a target of SIL ‘a’. Is it necessary to implement this SIF in a SIL-rated SIS? 8. Give 3 reasons why it is important to document input data (e.g. initiating event frequencies) and assumptions made during the SIL assessment workshop. 9. Explain the difference between ‘No SIF’ (or ‘SIL 0’) and SIL ‘a’. 10. Suppose your team is performing LOPA on a case where the initiating event can occur only when the plant is using a specific grade of feedstock, which occurs 10% of the time. How should this 10% factor be treated in LOPA? Choose the best answer. (a) As an enabling condition (b) As an IPL (c) As a conditional modifier (d) As a reduction in SIL target (e) It should be ignored, as LOPA considers the worst case only 11. Perform LOPA to determine the SIL target for the SIF described in the following scenario. Refer to Fig. 5.8.
Figure 5.8 Amine absorption columndschematic.
An amine absorption column is provided to strip H2S from natural gas. The column operates at a relatively high pressure, while the downstream amine regeneration equipment is a low pressure system. The amine solution level in the bottom of the column provides a seal between the two systems. If the level control malfunctions and the level is lost, gas breakthrough to the downstream system can cause overpressure, failure, fire/explosion and a fatality. A SIF is provided to isolate the low pressure system on low low level in the column. An independent low level alarm is provided in the column. If the level control valve is wide open, the time from alarm to empty is 4 min. A pressure relief valve is provided in the downstream equipment, with the required flow capacity for this case. The downstream relief system is suitable for the sour gas. The tolerable frequency for a single fatality case is once in 5 $ 104 years. 12. The preceding LOPA case did not explicitly consider other initiating events such as loss of lean amine supply to the column (e.g. due to pump trip, inlet valve closing inadvertently, etc.). Why need these initiating events not be considered? 13. Consider the Fault Tree example in Fig. 5.7 (the propane heater). Credit is taken for the low temperature alarm in the ‘low flow’ operating case. Does this break the ‘Rule of Two’ discussed in Section 4.6.5? If so, suggest a possible solution.
150 Chapter 5 14. Consider the Fault Tree example in Fig. 5.7 (the propane heater). Suppose the plant operating profile changes so that the high flow mode is now expected 80% of the time. Recalculate the RRF and SIL target of the low temperature SIF. 15. Can cost/benefit analysis be applied to continuous mode SIFs? Explain your reasoning. 16. What is the relationship between cost/benefit analysis and the ALARP concept? Could they be applied together in the same SIF? 17. A SIF is provided to prevent overflow of a waste acid tank. If the tank overflows, it will cause damage to surrounding equipment valued at $100k. Also, any person nearby could be injured by the corrosive liquid, leading to long term treatment; however, the area has only a 10% chance of occupancy when the incident occurs. Other relevant parameters are as follows: • Sum of the initiating event frequencies for all causes: 0.5/year • A high level alarm is provided, with sufficient time for safe operator response Select an appropriate SIL assessment method, and assign an appropriate SIL for this SIF, using the categories defined in Tables 5.7 and 5.8. 18. A turbo generator is provided with an overspeed prevention SIF. It is intended to protect in two different operating modes: • Startup: The generator runs with the breaker open (no electrical connection to the plant’s power network, i.e. no load) • Normal operation: The breaker is closed (power is supplied to the network). The overspeed SIF opens the breaker (to prevent damage to equipment on the power network) and de-energises the turbine (shuts down the steam supply). Other information is provided as follows: • Overspeed cause: Malfunction of turbine controller (0.1 times/year) • Overspeed is equally likely during startup and normal operation • The turbine is in startup less than 1% of the time • The consequence of overspeed is: • Damage to the turbo generator, resulting in downtime valued at $10M • Damage to the equipment on the power network, incurring repair costs of $1M but no downtime (all relevant equipment is spared) • For this exercise, the possibility of injury due to machine disintegration can be ignored. • The turbine contains a mechanical overspeed protection system that is independent of the SIF and of the root cause. Its PFD is 0.05. • A high speed alarm is provided, but there may not be enough time for operator response. Using a tolerable risk target of 10e4 years for a financial loss of $10M, select an appropriate SIL target for the SIF.
Answers Question 1dAnswer The correct answer is (a). Regarding answer (c), while it is true that IEC 61511 states only safety and environmental risk receptors must be considered, it also mentions that the same methodology can optionally be applied to other risk receptors.
Question 2dAnswer One solution is to select two SIL targets for each case, corresponding to the upper and lower bounds of the ALARP region. The SIS design team should then aim to achieve the higher SIL target. If it proves unfeasible to achieve the higher target, then an ALARP argument is required to justify implementing a SIL between the upper and lower targets. The lower target would be the absolute limit.
Question 3dAnswer (a) Valid (b) Not normally valid. ‘Failure of control loop’ is considered as a composite cause, without distinguishing among the individual root causes that may lead to loop failure. (c) Not validdthis is failure of a safeguard. (d) Valid. Although it could be considered failure of a safeguard, it can lead directly to an incident. (e) Valid, although this event should be raredso it may make a negligible contribution to the SIL target compared with other initiating events. (f) Valid (g) This event is sufficiently rare that it is not normally considered in SIL assessment. (h) Valid.
Question 4dAnswer In this matrix, because the likelihood categories are not simply 1 order of magnitude apart, each row must be calibrated separately. A target RRF is calculated at one point in the row, and then converted to a SIL target. The other columns in the same row can then be filled by stepping up or down by one SIL. For row 4, as no maximum frequency is defined, we have arbitrarily assigned a SIL target one SIL higher than row 3. Severity categories Likelihood categories
A
B
C
D
1
SIL ‘a’
SIL 1
SIL 2
SIL 3
2
SIL 1
SIL 2
SIL 3
‘b’
3
SIL 1
SIL 2
SIL 3
‘b’
4
SIL 2
SIL 3
‘b’
‘b’
152 Chapter 5 This risk matrix is an example of a matrix that would be difficult to work with in practice, due to the high SIL targets defined (partly as a result of the rather stringent tolerable risk criteria) and the lack of any cells with a ‘No SIF’ target. For example, every case with severity D (fatality) would be assigned at least SIL 3. Problematic risk matrices like this are often seen in practice. To improve it, the number of likelihood categories should be increased so that lower frequencies can be accommodated.
Question 5dAnswer The result is shown in Fig. 5.9. In this calibration, W0 was treated as 0.05/year; W1 as 0.3/year; and W2 as 1/year with SIL targets rounded up to allow for multiple initiating events (as F ¼ 1/year cases are on the boundary between two SILs). You may choose not to round up, in which case SIL targets for W1 and W2 will be the same.
Question 6dAnswer Assuming the plant can be quickly restarted once the outage is noticed, we identify a reasonable time period for the operator to notice the outage and restart the plantdfor example, 1 h. Then the severity is selected according to the cost of this length of outage. However, if the restart time is long, then the restart time is used as the basis of the cost calculation.
Question 7dAnswer The SIF need not comply with IEC 61511, and therefore need not be implemented in a SIL-rated SIS. However, if it is moved to the BPCS, it may lose independence from the initiating event or IPLs, or the ‘Rule of Two’ may be broken. As a result, the SIL assignment may become invalid.
Question 8dAnswer • • •
Traceability To enable the assumptions to be checked during the operational phase, when actual operational data is available To facilitate any future Management of Change study
Question 9dAnswer SIL ‘a’ means that a safety function is required to ensure the tolerable risk target is met, but the risk reduction required from the safety function is less than the lower limit of SIL 1, and the safety function therefore does not need to comply with IEC 61511.
SIL assessment methodology
Figure 5.9 Calibrated risk graph.
153
154 Chapter 5 ‘No SIF’ or ‘SIL 0’ means that the tolerable risk target is met already, and the safety function is not required (from a risk evaluation point of view; it may be required for other reasons, such as prescriptive requirements).
Question 10dAnswer The correct answer is (a), ‘As an enabling condition.’
Question 11dAnswer Initiating event Enabling condition IPLs
Conditional modifiers UEL Tolerable frequency RRF target of SIF SIL target
Level control malfunction None identified Low level alarm
Pressure relief valve None identified
1e-3 / 2e-5 =
F = 0.1/yr No credit (not enough time for operator response) PFD = 0.01 1e-3/yr 2e-5/yr 50 SIL 1
Question 12dAnswer Each of these scenarios from initiating events includes failure of the level control as an IPL. Therefore, they can lead to the outcome only if the level control malfunctions. As the level control malfunction is already included as an initiating event, it would be double counting to include scenarios that involve the level control as an IPL.
Question 13dAnswer This case probably breaks the Rule of Two in spirit. The initiating event (loss of heat supply to the water bath, or inadvertent closure of the water bath inlet/outlet valve) could be a BPCS fault, and the ‘heater start’ function is a BPCS function, therefore no further credit should be taken for the alarm if implemented in the same BPCS. The PFD of the alarm does affect the calculated RRF target of the SIF, and therefore the Rule should probably be applied even though the alarm is in a different Fault Tree branch from the SIF.
SIL assessment methodology
155
A reasonable solution would be to implement the ‘heater start’ function in the ESD system (or another separate control system). It need not be a SIF, as the target PFD is 0.1dabove the SIL 1 range. What if the analysis simply ignores the low temperature alarm? In that case, it is not possible to achieve the tolerable frequency target at the top event of the Fault Tree.
Question 14dAnswer The resulting RRF target is 54 and the SIL target remains SIL 1.
Question 15dAnswer A continuous mode SIF is effectively part of the control system. If it were absent, the incident would be expected to occur within a short period of time, and continue to recur frequently throughout the lifetime of the plant. Thus, it would not be feasible to calculate the cost of the outcome in the absence of the SIF. However, it would be possible to calculate whether upgrading the SIF to a higher SIL would yield an overall financial benefit. Suppose LOPA shows that, in order to reach the tolerable frequency of 10e3/year, the SIF’s probability of failure per hour (PFH) needs to be 5 $ 10e6/hr, corresponding to a SIL 1 target. If an alternative SIF design is feasible that achieves PFH ¼ 2 $ 10e7/hr (a factor of 4 lower), we could evaluate whether the cost saving of reducing the outcome frequency by this factor (to 2.5 $ 10e4/year) is greater than the lifetime cost difference of the SIF upgrade.
Question 16dAnswer The answer depends on which risk receptors are affected by the SIF’s failure. For financial risk receptors, cost/benefit analysis and ALARP have different goals. Cost/ benefit analysis leads the analyst to select the SIL target with the best return on investment, whereas ALARP leads towards the highest SIL that does not incur a large financial loss. In purely financial terms, applying cost/benefit analysis makes more sense than ALARP because ALARP could drive the analyst to select a higher SIL that gives a poorer (possibly even negative) return on investment. For safety and environmental risk receptors, cost/benefit analysis may not be appropriate or ethically acceptable, because it focuses purely on maximizing financial returns without any other consideration.
156 Chapter 5
Question 17dAnswer As we are provided with a combined initiating event frequency for all causes, and there is a conditional modifier for occupancy, Risk Graph may be the most suitable method. The Risk Graph parameters can be selected from Tables 5.18e5.21: • • • • •
Personnel impact: Cb or Cc Financial impact: Lb or Lc Frequency: W1 or W2 (we could select W2, giving us a margin of conservatism and thereby justifying selecting Cb and Lb) Exposure parameter: Fa Avoidance parameter: Assume Pb (unable to escape) unless there is a clear justification for Pa Reading from the Risk Graphs in Figs. 5.3 and 5.5, the raw SIL target is SIL 2 for both personnel impact and financial impact. A reduction of 1 SIL unit can then be applied for the high level alarm IPL, giving a final target of SIL 1.
Question 18dAnswer As the consequences of overspeed during startup and normal operation are similar (in cost terms), we can ignore the fact that there are two different modes. LOPA is an appropriate SIL assessment method for this case. • • • • •
Initiating event frequency: 0.1/year Enabling condition: None (we are not considering startup as a special case) IPL: Overspeed protection; PFD ¼ 0.05 IPL: High speed alarm; PFD ¼ 1 (not enough time for operator response) Conditional modifier: None
Unmitigated event likelihood: 0.1 $ 0.05 $ 1 ¼ 5 $ 10e3/year Tolerable frequency: 10e4 year Therefore, RRF target ¼ 5 $ 10e3/10e4 ¼ 50 SIL target: SIL 1, with RRF target of 50 For further exercises by the same author, see Ref. [10].
References [1] Center for Chemical Process Safety (CCPS), Guidelines for Initiating Events and Independent Protection Layers in Layer of Protection Analysis, Wiley, Hoboken, 2015.
SIL assessment methodology
157
[2] Center for Chemical Process Safety (CCPS), Guidelines for Process Equipment Reliability Data, with Data Tables, American Institute of Chemical Engineers (AIChE), New York, 1989. [3] OREDA, OREDA Handbook, sixth edition, OREDA, Trondheim, 2015. A Publication Arising from the Offshore and Onshore Reliability Database (OREDA) Project. [4] exida, Safety/Security Automation Equipment List, 2022. https://www.exida.com/SAEL. (Accessed 6 January 2022). [5] exida, SILSafe data, A Collection of Typical Dangerous Failure Rates for Safety-Related Hardware, 2022. http://silsafedata.com. (Accessed 6 January 2022). [6] M. Ottermo, S. Hauge, S. Ha˚brekke, Reliability Data for Safety Equipment, SINTEF, Trondheim, 2021. [7] E. Marszal, E. Scharpf, Safety Integrity Level Selection, ISA, Research Triangle Park, 2002. [8] E. Scharpf, H. Thomas, T. Stauffer, Practical SIL target selection: risk analysis per the IEC 61511 safety lifecycle. exida, Sellersville. A Useful Textbook Covering Many Aspects of Functional Safety, Especially Focusing on Risk Analysis, 2021. [9] Wikipedia, Present value. https://en.wikipedia.org/wiki/Present_value. (Accessed 6 January 2022). [10] P. Clarke, Practice Questions for Functional Safety Examinations (2 Volumes), xSeriCon, Hong Kong, 2021. Available from, https://www.xsericon.world. (Accessed 17 August 2022).
This page intentionally left blank
CHAPTER 6
SIL assessment: special topics Abstract Commonly encountered errors and misunderstandings in Safety Integrity Level (SIL) Assessment include: selection of the correct sensor architecture for cases with multiple sensors; two safety instrumented functions (SIFs) protecting against the same hazardous scenario; SIFs protecting against multiple hazards, whether by design or serendipitously; secondary SIFs (SIFs provided to mitigate a hazardous scenario created by the action of another SIF); cascaded SIFs; initiating events involving multiple simultaneous failures, such as cooling water pump failure; permissive functions; SIFs initiated by the operator; protection of pump sets against dry run; alarms from cascade control loops as potential independent protection layers (IPLs); final elements shared between the control system and the Safety Instrumented System (SIS); defining the process safe state correctly; and selecting primary (critical) final elements.
Keywords: Cascade control loop; Cascade function; Critical final element; Permissive; Primary final element; Redundancy; Safe state; Secondary safety instrumented function; Shared final element.
This chapter focuses on a range of special issues that often arise during SIL assessment. Many of them are widely misunderstood and could lead to dangerous errors, so it is important to know how to address them.
6.1 Redundant initiators Within a single SIF, there may be multiple sensors, each of which can detect the dangerous situation by itself. The most common example is three equivalent sensors in a 2oo3 configuration (for instance, in a HIPPS). The sensors are equivalent because they are equally able to detect the initiating event. They are also redundant because the safety function can still achieve its design intent even if some of them fail. Redundant initiators are not necessarily equivalentdthat is, they need not be the same type, or in the same location. For example, loss of feed to a pump can be detected by low flow or low pressure at either the suction or discharge of the pump (depending on the exact type and configuration of the pump). In order for initiators to be considered redundant, they must meet both of these criteria: •
Each initiator must be able to detect the dangerous condition for all demand cases of the SIF.
Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00017-3 Copyright © 2023 Elsevier Inc. All rights reserved.
159
160 Chapter 6 •
Each initiator (or group) must be able to detect the dangerous condition, even if all the other initiators fail.
If the second of these conditions is not met, it may be helpful to split the SIF into two: one with redundant initiators (for the demand cases that they can all detect by themselves), and one with non-redundant initiators (for the cases that not all initiators can detect by themselves).
Handling redundant initiators There are two options for handling redundant initiators in a SIF: 1. Treat the initiators as a MooN architecture, where N is the number of redundant initiators or groups. This provides a hardware fault tolerance of N e M, which helps to meet the architectural constraints requirements of the SIF. It also provides a lower overall PFDavg or PFH. 2. Remove some of the redundant initiators from the SIF. The advantage of this approach is that the smaller number of initiators could lead to a decreased spurious trip rate, as well as reduced testing and maintenance requirements. If initiators are to be removed, the choice of initiators to remove follows the same principles as for redundant SIFs, covered later in this chapter. When removing a redundant initiator from the SIF, the initiator can either be deleted completely from the SIS design (and not physically installed on the plant), or removed from the SIS specification but still retained in the overall design (e.g. as a DCS trip). In the latter case, the spurious trip reduction is not achieved, because the device is still physically present in the plant.
6.2 Redundant safety functions In some high-risk situations, two SIFs may be provided for protection, independently of each other. For example, a large distillation column could be damaged if the reboiler control fails and maximum heating is supplied. This could be detected by high temperature or high pressure in the column. The corresponding two SIFs could be as shown in Table 6.1. For the case of heating control failure, these SIFs could be handled via two different approaches: 1. Keep them separate; use one as the ‘main’ SIF, and use the other as an IPL to reduce the target SIL of the main SIF. 2. Combine them into a single SIF, defined as in Table 6.2.
SIL assessment: special topics
161
Table 6.1: Example of redundant SIFs in a distillation column.
SIF #
Demand cases
Safe state
Sensor
Consequence of failure
Primary final element
Other final elements
SIF 001
Heating control failure
Column temperature not high high
Temperature sensor in liquid phase
Column material failure and/or overpressure
Close steam shutdown valve
Activate SIF 002
SIF 002
Heating control failure, blocked vapour outlet, overfilling etc
Column pressure not high high
Pressure sensor at column vapour outlet
Column overpressure
Open depressurisation valve to flare
Activate SIF 001
Table 6.2: Combining the SIFs from Table 6.1.
SIF # SIF 003 (combined)
Demand cases
Safe state
Heating control failure
Column pressure not high
Sensor Temperature sensor in liquid phase OR pressure sensor at vapour outlet (1oo2)
Consequence of failure Column overpressure
Primary final element Close steam shutdown valve OR open depressurisation valve (1oo2)
Other final elements None
Approach 1 is generally preferred, as it will lead to a lower SIL for each SIF. Approach 2, however, does have the advantage that it considers common cause failure of the safety PLCdwhich is assumed negligible in approach 1. Table 6.2 shows an example of approach 2. In this example, notice that SIF 002 has other demand cases, not shared with SIF 001. Therefore, even if SIF 001 and SIF 002 are combined into SIF 003 for the heating control failure case, a separate SIF 002 still needs to be retained for the other demand cases. Another example is no-flow protection of a large pump (such as a cooling water pump). This could be detected by low flow or low pressure, and the pump stopped in two different ways (send a stop signal to the MCC, or open an independent circuit breaker). These could be configured as two separate, redundant SIFs, or a single combined SIF with higher SIL.
162 Chapter 6
What determines if two SIFs are redundant? Two SIFs can be considered redundant if they meet all these criteria: • • • •
The SIFs share all their demand cases in common. The sensors of each SIF can detect the dangerous situation, even if the sensors of the other SIF fail. The final elements of each SIF can achieve the safe states of both SIFs, even if the other SIF fails. The SIFs have no sensors and no final elements in common. (It is generally acceptable for them to share the same safety PLC, because a safety PLC usually has diagnostics, redundancy and high integrity and, if it fails, will nearly always fail safe.)
One SIF as backup to another A related case is where the failure of one SIF places a demand on another. For example, a pair of SIFs in a firefighting system may be defined as shown in Table 6.3. The first SIF starts the main firewater pump on demand; if that fails, the second SIF starts a backup pump. SIF 002 is provided to handle the case when SIF 001 is required to act, but its action fails. Its initiator is the standstill detector, which is a device that detects non-rotation of the primary firewater pump on demand. The demand rate of SIF 002 is equal to the failure rate of SIF 001 (assuming that the dominant cause of SIF 001 failure is failure of the pump itself, which is reasonable). This means that we have to work out the demand rate d1 of SIF 001 (due to real demand events, i.e. when firewater is actually required to mitigate harm), and its target PFDavg1. The demand rate d2 of SIF 002 is then d2 ¼ d1 $ PFDavg1. d2 is likely to be considerably lower than d1, and therefore the target SIL of SIF 002 will be lower than SIF 001. This is a good reason for keeping the two SIFs separate, rather than combining them into one. Table 6.3: Example of firewater pump activation SIFs. Sensor
Consequence of failure
Primary final element
SIF #
Demand case
Safe state
SIF 001
Fire detected
Firewater supply available
Heat sensors
No firewater available
Start main firewater pump
SIF 002
Main firewater pump does not start on demand
Firewater supply available
Standstill detector on main firewater pump (when SIF 001 was activated)
No firewater available
Start backup firewater pump
SIL assessment: special topics
163
Redundant SIFs in low risk situations It is not uncommon to find redundant SIFs for relatively low risk situations, toodcases that can be handled by a single SIF rated at SIL 1. In such cases, the design team has the option to remove the SIL target from one of the SIFs, allowing it to be implemented as a trip function in the BPCS. Indeed, a LOPA study may even show that one of the SIFs can be deleted altogether, if the tolerable risk target can be met without it. How should the team decide which SIF to remove? Usually, it makes sense to remove the SIF that is more complex, less reliable, or harder to test and maintain. For example, valve position switches can be prone to failure, and burner flame scanners are hard to test online. So, redundant SIFs based on these sensors would be good candidates for removal.
6.3 One SIFdtwo hazards A general principle of SIF design is that a SIF should be designed to protect against only one type of dangerous condition. However, occasionally a SIF can protect against more than one dangerous condition, either by design or unintentionally. During the SIL assessment, we only need to consider dangerous conditions that are in the design intent of the SIF. The design intent should be made clear in the SRS, and reflected in the definition of the safe state. Any other protection that the SIF may unintentionally provide in other cases need not be considered. This is because the other cases should have enough risk reduction from other layers of protection. If we include them in the SIL assessment study, we are imposing requirements on the SIF that are not needed to achieve the tolerable risk level. If a SIF appears to be protecting against multiple dangerous conditions, it may have different final elements for each case. In that case, it should be split updone SIF for each combination of final elements. Hazards can only be combined in a single SIF if all the demand cases share the same sensors and final elements. If a SIF is intended to protect against more than one dangerous condition, each condition should be considered as a separate demand case. These may differ in demand frequency and outcome severity. They may also have different IPLs. Methods for handling multiple demand cases are described in Chapter 5.
6.4 The IPLs vary depending on demand case Suppose the demand case for a SIF is failure of a control loop, part of which is used in a potential IPL. For example, consider the overhead receiver on a distillation column. This is a vessel used to collect the condensed vapours distilling from the column, and provides
164 Chapter 6
Figure 6.1 Distillation column exampledschematic.
feed for the reflux pump (see Fig. 6.1). The reflux flow may be under flow control which, if it malfunctions and fully opens, will empty the receiver, leading to pump cavitation and damage. A SIF may be provided to stop the pump on low low level in the receiver. If the flow controller is provided with a high flow alarm, can we take this as an IPL to reduce the SIL of the protective SIF? The control loop contains several componentsda flow sensor, a controller, and a control valvedany one of which may have failed. If the sensor or controller are at fault, then the alarm cannot be relied on, and should not be used as an IPL. But if only the valve has failed and opened fully, the alarm should still work; we can take credit for it, provided there’s enough time for the operator to act, and an independent manual action is available to prevent the harm. One solution is to write two separate demand casesdone in which the cause is valve failure, and one for all other causes. These have different demand frequencies and different IPLs. The overall result is likely to be a reduction in the SIL target, because valve failure could be the most likely cause of demand on the SIF. However, many analysts take the more conservative (and simpler) position that, if any element in the control loop is failed, the whole loop is deemed to be unavailable.
SIL assessment: special topics
165
6.5 The demand case is activation of another SIF In Chapter 4, we learned that activation of a SIF may lead to a new risk. In some cases, a further ‘secondary’ SIF may be required to address this risk. For example, when a power generator trips and runs down, it is necessary to keep it turning slowly at about 3 rpm for some hours (a process known as cranking) until the shaft has cooled. An electric motor and clutch are provided for this purpose. If the rotor is not cranked, uneven cooling will result in bending of the shaft, after which it must be left to cool completely for 24 h before restart, to avoid damage to the bearings. This delay causes loss of production, with significant cost implications. Hence, a secondary SIF may be required to start the cranking motor and engage the associated clutch after the machine has tripped. The initiator of a secondary SIF is the successful activation of another SIF. This has two implications: •
•
There is no physical initiator, like a sensor or position switch. The initiator is just a software signal from one SIF to another. This needs to be made clear in the definition of the SIF. When the SIF is modelled in the SIL verification phase, a “perfect” initiator should be used, with zero (or very low) failure rate. The demand rate of the secondary SIF is equal to the demand rate of the primary SIF from all causes, including a real demand due to process upset, spurious trip due to equipment failure, manual trip (intentional or unintentional), testing (if done while the process is online), and cascade trip from any other SIF. Adding all those terms up at this early stage in the design is probably unfeasible, but we should, at least, try to work out which one is dominantdprobably process demand or testing.
Secondary SIFs usually have lower SIL than the corresponding primary SIF. But this does not have to be the case. Indeed, it is possible for the SIL of a secondary SIF to be higher. A possible example is as follows. Large oil and gas facilities may have several turbogenerator sets to provide electrical power. There is usually at least one spare set, to allow for maintenance. A primary SIF may be provided (typically at SIL 1) to protect an individual set from damage in the event of high vibration (among other causes). This would require a standby set to be started, to avoid load shedding and reduced production from the facility. However, if the standby set is not available, it could be urgently necessary to start a standby power source, such as a diesel generator. Because of the high economic impact of lost production, the secondary SIF to start the diesel generator may require SIL 2 (a higher SIL is unlikely to be achievable in practice, so there is no point in assigning it). These SIFs are summarised in Table 6.4.
166 Chapter 6 Table 6.4: Primary and secondary SIFs for turbo-generator shutdown case. SIF #
Demand cases
Safe state
Sensor
Consequence of failure
Primary final element
Other final elements
SIF 001 (primary)
Fault in turbo-gen machine (e.g. high vibration)
No possibility for faults to cause severe damage to machine
Vibration sensors (etc)
Machine damage, repair costs (no lost production, because redundant machine is available)
Machine shutdown (e.g. close fuel gas inlet valve)
Open circuit breaker
SIF 002 (secondary)
Trip of any turbo-gen AND standby machine status is “not ready”
Minimised load shedding of critical equipment
Digital machine status signals
Loss of production (and perhaps other impacts such as need to blow down to flare)
Send start command to emergency power source
Load shedding of noncritical power consumers
6.6 One SIF cascades to another The cause & effect diagram (C&ED) for a complex SIS may show that one interlockdcall it XA-1000, for exampledhas another interlock, say XA-2000, listed in its “effects”. For instance, large facilities often have a hierarchy of shutdowns, according to the severity of the situation. These may be termed something like “unit shutdown” (USD), “process shutdown” (PSD), and “emergency shutdown” (ESD). You may also see the term ‘bar’ (e.g. ‘wellhead shutdown bar’), meaning a horizontal line in an Interlock Logic Diagram, linking several causes and effects together. A demand for a top-level ESD will often be required to trigger all the actions associated with lower level shutdowns, USD and PSD. This is most clearly depicted in the interlock logic diagram, as we discussed in Chapter 4. To handle such cascade situations correctly, we take each demand case in turn. Then we look at all the interlocks that are activated by the demand case, and determine which of all their final elements are able to achieve the safe state. Next, we write a SIF containing only those final elements. This SIF can be combined with any other SIFs sharing the same set of sensors and final elements. When we have decomposed the whole array of sensors and final elements into a set of SIFs, we determine the SIL of each SIF in the usual way. (This decomposition task is discussed in detail in Chapter 4.)
SIL assessment: special topics
167
After all the SIFs have been composed, there may be some final elements in the C&ED or Interlock Logic Diagram that have not been assigned to any SIF. They will fall into one of two categories: 1. Final elements provided to avoid consequent harm as a result of the trip. These should be considered as a secondary SIF (see earlier in this chapter). 2. Final elements that don’t help to achieve the safe state, and don’t prevent any significant secondary effects. They could be provided to help stabilise the plant for faster restart, for example; or they could simply be legacy actions left over from old versions of the design. These need not be considered in SIL assessment. The decision to disregard them should be documented in the SIL assessment report, along with the justification. Cascaded SIF as a sensor or final element. When interlock XA-1000 cascades to interlock XA-2000, it is not correct to write a final element that says “activate XA-2000”. If the SIF is written in this way, without identifying the physical final elements involved, it will be impossible to handle when it comes to SIL verification, for the following reasons: • • •
The engineer performing the SIL verification would be unable to calculate the correct failure rate (PFDavg or PFH) of the SIF. The engineer cannot confirm that its architectural constraints are met. The engineer cannot confirm that the final elements are SIL-capable.
Similarly, unless XA-2000 is purely a secondary function, it would be incorrect to write a SIF for XA-2000 having an initiator saying “activation of XA-1000”; the correct initiator(s) should be the actual devices that activate XA-1000 for the demand case under consideration. Cascade to or from out-of-scope equipment. Occasionally, a remote signal from a thirdparty system (say, an offsite gas generation facility) may be required as a SIF initiator. In such cases, it may be impossible to work out the actual architecture of the initiator, because it is unknown at the time of the SIL assessment study. In such cases, the team will have to accept “remote signal” as an initiator and treat it as a black box. Similarly, a final element may be a “shutdown signal” sent to a separate unit or complex vendor package, and it may not be feasible to identify the exact shutdown actions required. This creates problems for the SIL verification later, which we discuss in Chapter 9. A clear and correct “safe state” definition will help the SIL verification engineer to resolve these issues.
6.7 Initiating event involves multiple simultaneous failures Example 1 Suppose we have a pressure control loop PIC-200 managing the blanketing pressure on a storage tank. The blanket gas pressure is reduced upstream by a separate pressure control
168 Chapter 6 valve PCV-100. If PCV-100 and PIC-200 both fail simultaneously, it could lead to a high pressure in the tank, which might result in a dangerous consequence (e.g. tank overpressure, or high pressure in a delivery truck if there is a vapour return line to the truck). If a SIF is provided to prevent this consequence, how should we calculate its initiating event frequency? It depends on whether the two failures in question, PCV-100 and PIC200, can be discovered in a short period of time, in the event that they occur separately (without the other failure occurring). For example, suppose the system is equipped with 2 pressure gauges, one downstream of PCV-100 (to reveal PCV-100 failure) and one on the tank (to reveal PIC-200 failure), as shown in Fig. 6.2. If the operator is instructed to check these pressure gauges once per week, the mean time from fault to discovery is 0.5 weeks. Adding a further 0.5 day to allow repair, this gives the Mean Time To Restore (MTTR) of each item as 4 days. From this, we can estimate the probability that each device is in a failed state at any moment as P(Failed) ¼ MTTR $ Failure rate (lDU). For example, the probability that PCV is failed could be 96 h 105 (failures per hour) z 103. (If lDU is not known, the reciprocal of the Mean Time Between Failures (1/MTBF) can be used instead.) If neither of the initiating events has a way to reveal a latent failure (e.g. no pressure gauge, or no operator procedure to check it regularly), and the equipment is never tested, then you can estimate its probability of being in a failed state, averaged over the life of the
Figure 6.2 Pressure blanketing exampledschematic.
SIL assessment: special topics
169
equipment (also known as the Mission Time, MT), by using Pavg(Failed) z lDU $ MT/2. (This conservatively assumes that a fault, once present, will not be discovered until the end of the Mission Time.) Then, we can estimate the rate of both failures occurring simultaneously by multiplying the failure rate of one device (say, PIC-200) by the probability that the other is in a failed state: Rate ð2 failuresÞ ¼ lDU ðPIC200Þ,PðPCV100 failedÞ Taking a typical control loop failure rate of 0.1/year, this gives, for example, an overall result of 0.1/year $ 103 ¼ 104/year.
Example 2 A cooling water system is equipped with 3 large circulation pumps. All pumps are normally in service; the plant cannot run with only 2 pumps. If one pump fails, a SIF (let’s call it SIF 001) shuts down the plant in a controlled way, avoiding the risk of an unmanaged shutdown (which could stress the equipment and lead to difficulties in restarting). The difficulty faced by the analyst is that the consequence may be much worse if all 3 pumps fail simultaneouslydfor example, due to power failure or spurious trip of the low level detection in the supply tank. If a second SIF (SIF 002) is provided to detect failure of multiple pumps, then the ‘multiple pump failure’ scenario need not be considered in the single pump failure SIF 001. As usual, we should apply the principle that each demand scenario applies to only one SIF. If there is no SIF 002, then the analyst must decide whether to consider the lesser consequence from one pump failing, or the greater (but less likely) consequence from all pumps failing. First, the team should determine whether the actions of SIF 001 can mitigate the greater consequence; if the actions are designed only to mitigate the lesser consequence, then the question is mootdthe ‘all pumps fail’ scenario need not be considered, as the SIF does not help in this case. If the team decides that SIF 001 does mitigate the greater consequence, the best solution is to apply Event Tree Analysis (ETAs) to consider both possible outcomes and evaluate a total risk to be mitigated. However, in practice, teams are often reluctant to embark on an ETA in the middle of a SIL assessment workshop, so the chairman needs to propose an alternative way forward. Another solution is to evaluate the two scenarios separately, and then select the higher SIL and RRF target among the two. To do this, consider all the possible initiating causes of
170 Chapter 6 Table 6.5: Example analysis of ‘single pump fail’ vs. ‘all pump fail’ scenarios. Scenario One pump fails
Initiating event Machine protection trip
Tolerable frequency 3
SIL target
10 /year
SIL 2
104/year
SIL 1
Suction/discharge valve closed Circuit breaker failure All pumps fail
Power failure Spurious trip of feed tank low level SIF
pump failure and divide them according to whether they lead to failure of a single pump or all pumps. An example is in Table 6.5. It is quite possible that the ‘all pumps fail’ scenario could have a lower SIL target, as the initiating events should be less frequent, and there may be additional IPLs protecting vulnerable equipment against total loss of cooling water. One difficulty in this approach is that the ‘all pumps fail’ scenario may lead to multiple consequences all over the plant. The analysis team would probably need to consider only the most significant consequence among these. However, equipment items that are vulnerable to cooling water loss may have their own individual protective functions (such as a high temperature trip); this suggests three more options for resolving this case: • • •
Option 1: Use the high temperature trip in the individual equipment as an IPL to reduce demand on SIF 001 Option 2: The other way round: use SIF 001 as an IPL to reduce demand on the high temperature trip. Option 3: If the situation is really complex, apply Fault Tree Analysis (see Chapter 5).
If Option 2 is selected, and all vulnerable equipment has its own protective function, then the ‘all pumps trip’ scenario for SIF 001 need not be considered at all, as the hazard is already mitigated by other SIFs. For another viewpoint on this topic, see Ref. [1].
6.8 Permissives A permissive is a function that allows something to proceed when conditions are met. For example, fired heaters often have a permissive to start the light-off sequence (start the burners) based on conditions such as “leak test successfully completed” and “fuel gas pressure not low”.
SIL assessment: special topics
171
Permissives often need to be SIL-rated functions, because they may be expected to provide a significant degree of risk reduction. There are two particular issues with SIL selection of permissive functions to be aware of:
Demand frequency On a simple level, the demand frequency of the SIF can be taken as the same as the frequency of the associated event. In the example above, this would be the typical frequency of lighting off the burners. This is often in the high or continuous demand range. However, it could be argued that the demand frequency is reduced by the fact that the operator will not always attempt to do the activity that is permitted by the permissive. For example, if the operator knows the leak test failed (from alarms or warning messages on the console), he should not attempt to start the light-off sequence. The SIF would be required to protect only if he attempts to light off anyway. This might be, perhaps, 2 (e.g. 3oo3), the only logical choice is to reduce N by 1, e.g. 3oo3 falls back to 2oo2. For 2oo3 sensor groups, the degraded voting options are 1oo2 or 2oo2. The choice between these depends on the criticality of the sensor(s) involved: 1oo2 is the ‘safer’
Safety instrumented system design 227
Figure 8.7 Screenshot: example application logic emulation. Courtesy Schneider Electric.
Figure 8.8 Screenshot: example function block application logic. Courtesy Schneider Electric.
228 Chapter 8
Figure 8.9 Screenshot: example safety certified HMI for bypasses and critical alarms. Courtesy Schneider Electric.
option with a lower PFDavg, but 2oo2 increases plant availability as it has a higher Mean Time To Fail Spuriously (MTTFS), i.e. lower chance of spurious trip. At lower SIL (e.g. SIL 1 and SIL 2), 2oo2 is generally acceptable, as the additional contribution to the overall PFDavg is unlikely to be significant, given the short fraction of time that the system is degraded. At higher SIL, the PFDavg contribution of a degraded state may become significant, and 1oo2 may be the preferred choice. It is important to enable the diagnostics for field devices, because they have a big impact on the PFDavg achieved by the SIFs. If diagnostics are not enabled, all dangerous detectable failures (signified by lDD) become dangerous undetectable failures, and therefore lDD must be added back into lDU when doing the SIL verification calculations (see Chapter 9). This often means the SIL target is no longer achieved. Also, lDD failures no longer count as safe failures (because they do not lead to a trip) and therefore the Safe Failure Fraction (SFF) reduces greatly. This means that the architectural constraints may not be met, if you are applying the requirements of IEC 61508 or IEC 61511:2010. (SFF is not used for this purpose in IEC 61511:2016.)
Safety instrumented system design 229 Table 8.1: Typical I/O functions in a safety PLC. Input or output Input
Output
Type of I/O function
Description
Analogue input
Field devices such as pressure transmitters are configured to draw a current in the range 4e20 mA from this input, where 4 mA represents the lower end of the field device’s configured range (e.g. 0 psig) and 20 mA is the upper end (e.g. 100 psig). The PLC measures the current and interprets this to decide whether a trip is required. The field device can signal a detected failure by drawing a current outside this range, e.g. 22 mA. The PLC should then interpret this as a diagnostic signal. Field devices can also superimpose digital data on the 4e20 mA line, using a protocol known as HART. This can be used to provide status and diagnostic information to the PLC.
Digital input
This detects whether DC voltage is present, and represents the status of the field device as a binary digit (e.g. continuity ¼ 1, open circuit ¼ 0). These inputs are used for field devices that provide only a yes/no status, such as position switches, pressure switches, level switches, flow switches, operator handswitches, and some flame and fire detectors. For connection to simple devices such as limit switches, some I/O cards can provide 24VDC via a built-in supply; more often, a separate external power supply is used.
Analogue output
These provide a 4e20 mA signal, typically for valve positioners.
Digital output
These provide a DC voltage, which can be used to control a relay or power a solenoid valve. Typically, presence of voltage is the normal state, and the voltage drops to zero when the safety function trips; this is known as ‘de-energise-to-trip.’ The PLC can be configured the other way round, so that the signal is ‘energise-to-trip’ instead (see details later in this chapter).
High power output
Same as digital output, but with a higher current limit, for driving devices such as DC motors directly.
8.2.4 Setting trip parameters Setpoints The trip setpoint for analogue sensors needs to be selected to allow an appropriate tradeoff. Too close to the normal operating envelope, and the SIF will be prone to spurious trips. Too far away, and the process safety time may become uncomfortably low, to the point where the SIF may not have enough time to prevent the consequence in the worst case. See Fig. 4.6 in Chapter 4 for an illustration of this point.
230 Chapter 8 Trip delay For noisy signals or fluctuating process variable (PV) measurements (especially level measurements), a delay time will often be implemented between when the PV initially reaches the setpoint, and when the trip is activated. The PV must stay in the trip zone (e.g. above the setpoint, for a high trip) continuously during the delay time. If the PV drops out of the trip zone, the trip is not executed. The delay timer will be restarted when the PV goes back into the trip zone. The delay time is programmed into, and implemented in, the safety PLC. Typical trip delay values are around 2 s for most PVs, and 10 s for level measurements. An alternative is to base the trip decision on the moving average value of the PV over, say, the last 5 s. While common for control loops, this method is rarely used for SIFs. Reset Most SIFs will be configured to latchdthat is, to remain tripped even after the PV moves out of the trip zone. Exceptionally, a SIF may be configured to auto-reset; one possible example is a low level trip on a sump pump, so that the pump will resume operation when the sump level starts to rise (to avoid an inadvertent overflow). An auto-resetting SIF effectively becomes a control function, rather than a trip (and may, therefore, become a continuous mode function; see Chapter 4). An auto-reset time delay should normally be configured, to avoid the function ‘chattering’ between normal and tripped states. For normal SIFs, a manual reset function will be implemented in the HMI. There will usually be a separate reset for each section of the plant, possibly based on trip hierarchy (ESD, PSD etc; see Fig. 4.4 in Chapter 4). Some field equipment may also need to be reset locallydthis is to ensure the operator visits the location for a visual inspection before restarting the process.
8.2.5 Cybersecurity While no chapter on SIS design should be without a mention of industrial cybersecurity, this is a complex and specialised area of knowledge that reaches far beyond the scope of functional safety. Enough serious cybersecurity incidents have already occurred in the process industry to mean that every SIS engineer should be aware of the importance of the issue. Nonetheless, cybersecurity engineering is still treated as a separate discipline from other aspects of SIS design and operations. The IEC 61511 standard requires analysis and mitigation of the risk that the SIS may be compromised through a cybersecurity incident. This is usually done as part of a wider cybersecurity assessment, and often shrouded in secrecy to avoid revealing details of countermeasures.
Safety instrumented system design 231 The SIS design engineer needs to ensure the SIS manufacturer will provide sufficient information to support the assessment. It is also important to confirm that appropriate hardening measures are provided and can be implemented, such as: • •
• •
Communications firewalls Levels of authorization, to enable operators and maintenance personnel to perform routine tasks (such as maintenance overrides) while restricting other changes such as setpoint modification to authorised personnel Protection against unauthorised adjustment of parameters, embedded software, and application program Detection and alarming of unauthorised modifications.
8.3 Selection of field devices This section discusses field devices, that is, every component of the SIS (other than power feed) that is not located inside or close to the logic solver cabinet.
8.3.1 Preferred types of SIF initiator Selection of initiator type A given process hazard can often be detected in more than one waydfor example, by detecting an excursion from normal pressure or temperature, or by using an online analyser. This gives the SIS designer a choice of sensor types. The following guidance may help designers to specify a SIS that is reliable and easy to verify: •
•
•
•
For pressure, temperature, flow and level measurement, smart transmitters are generally preferred over switches. This is because smart transmitters have diagnostics, which, if enabled, result in a lower PFDavg and higher Safe Failure Fraction. Switches may also be more difficult to maintain, as they have to be located in a potentially inaccessible position. Mechanical switches, such as limit switches on valves and doors, are generally relatively unreliable compared with smart transmitters. This is especially true in low demand applications, where frequent proof testing of switches may be needed. Analysers lack failure rate data, due to their complexity. This causes difficulty during SIL verification, and so it’s better to avoid analysers as SIS initiators where possible. Software signals can be taken as SIS initiatorsde.g. a motor running status signal from an MCC, or an internal signal where one SIF triggers another (a secondary SIF; see Chapter 6 for details). The random hardware failure rate of a purely internal signal can be taken as zero, as there is no dedicated hardware involved. An external signal may be
232 Chapter 8 assigned a failure rate based on the associated cables, connectors, interposing relays, etc. Valve limit switches as initiators Often, critical on-off valves will be equipped with limit switches at both open and closed positions. If we wish to detect that the valve is in the open position, it is often better to test whether it is not in the closed position. This is because the valve may be stuck in an intermediate position, with neither open nor closed signals activated. Alternatively, to reduce spurious trips, the sensor architecture can be set up as 2oo2 ‘open position limit switch activated’ AND ‘closed position limit switch not activated’. However, this architecture will have a higher PFDavg, and will not trip if the valve gets stuck in an intermediate position.
8.3.2 Defining final element architecture The SIS designer should ensure the final elements are properly defined to include every element whose failure could lead to the failure of the SIS to perform the required safety function. The following guidance may be helpful: •
•
•
For hydraulic and pneumatic remote actuated valves, the final element should include the process valve, actuator, and solenoid valve. Also include any special elements such as volume booster or quick exhaust valve on the pneumatic side. If the function is energise-to-trip, the energy source (e.g. pneumatic fluid or compressed air supply) should be included, as well as the source of power for the solenoiddthis is usually from an external power supply, although it can be provided directly from the I/O card in the logic solver. If the process valve has a redundant architecture (e.g. 1oo2), it is generally preferred to duplicate all components, including the solenoid, to achieve sufficient hardware fault tolerance. However, occasionally one solenoid will be shared between two process valves in a 1oo2 configuration (to reduce the amount of wiring from the safety PLC, among other reasons). The hardware fault tolerance of the overall final element subsystem will then be 0. Motor operated valves can be used as final elements. However, the function will then be energise-to-trip, as the motor needs to be energised to move the process valve to the safe state. This could be acceptable for process shutdown applications, but needs careful consideration for emergency shutdown applications that must remain available in the event of a power loss or fire.
Safety instrumented system design 233 • •
•
It is possible (though generally not preferred) for a process control valve to do double duty as a shutdown valve. See Section 6.13 for a discussion. When the SIF’s action includes an action on a motor (start or stop), the final element should include not only the interposing relay, but also the relevant safety function of the MCC, and the power feed to the motor (if a start action is required). It is common practice to consider only the relay, but this is dangerously optimistic. Occasionally, an alarm or beacon light is specified as the final element. This is acceptable, provided it is clearly understood that raising a warning is not, in itself, able to achieve the safe state; an operator response is still required. If the alarm is provided in addition to other final elements acting directly on the process, such a shutdown valve, then the alarm is not usually counted as part of the SIF.
8.3.3 SIF architecture While a simple SIF may have only one of each device, it is sometimes necessary or desirable to provide more than one instance of each device. The reasons for providing multiple devices are: •
•
•
To increase the availability of the SIF. If multiple devices are provided in a MooN (M < N) configuration, such as 1oo2, 1oo3 or 2oo3, the overall SIF becomes more reliable. To put it more precisely, the SIF’s availability increases, and the PFDavg decreases. To increase the availability of the process. When the SIF trips spuriously, it may result in a process shutdown or reduced production. Therefore, reducing the spurious trip rate of the SIFdi.e. increasing its Mean Time To Fail Spurious (MTTFS)dmay increase process availability. This can be achieved by providing multiple field devices in an NooN configuration, such as 2oo2. However, this comes at the cost of increasing the SIF’s PFDavg. A good compromise is to use 2oo3 architecture, which provides both a PFDavg reduction similar to 1oo2, and a MTTFS increase similar to 2oo2. To meet architectural constraints. Where the standard requires Hardware Fault Tolerance (HFT) > 0, multiple hardware devices are essential. HFT requirements are specified per subsystem, i.e. for the entire sensor subsystem or entire final element subsystem. Therefore, every component of the respective subsystem must have multiple instances. For example, if a final element subsystem containing a shutdown valve is required to comply with HFT ¼ 1, a shutdown valve must be fully duplicated, including the valve itself, its actuator, and solenoid valve. Using a single solenoid valve shared between both redundant channels would not be sufficient, unless it is a redundant solenoid (with intrinsic HFT). For motor shutdown, the usual solution is to apply multiple interposing relays to send a trip command to the Motor Control Circuit (MCC). In principle, even the wiring from the PLC’s I/O card to the solenoids or relays should be
234 Chapter 8 duplicated, although this is not always done in practice as the probability of wiring failure is low compared with other components. Subsystem architectures are easily visualised by Reliability Block Diagrams (RBDs). These depict components as blocks, connected by logical flow lines through which a signal can pass when the block is working correctly. RBDs are an effective way to depict the overall architecture of a SIF. To confirm that each subsystem’s HFT¼n requirement is met, we check that the signal can still flow from sensors (on the left) to final elements (on the right) when any combination of n blocks within the subsystem are faulty, i.e. obstruct the signal flow. Some example RBDs are shown in Fig. 8.10.
8.3.4 Testing and maintainability Most SIFs require proof testing and maintenance during their lifetime. It is therefore important to design the SIS with testability and maintainability in mind. One of the problems contributing to the Buncefield incident was that high level switches in storage tanks were virtually impossible to maintain, and were not subject to a proper testing regime [2]. Testing may require the provision of overrides, either in hardware or software, so that the function can be tested without disturbing the process. Careful control of such overrides is needed to ensure the function is not inadvertently left in a overridden state after testing; this is typically achieved by: • •
Use of a permit to work system for testing Provision of a timer for software overrides. The timer will automatically raise an alarm (or even trip the respective SIF) if the override is left in place beyond a time limit.
Figure 8.10 Examples of Reliability Block Diagrams (a) 1oo1 sensor, 1oo1 solenoid-operated valve as final element (b) 2oo3 sensors, 1oo2 motor-operated valves (MOVs) with redundant power supply.
Safety instrumented system design 235 For some functions, overrides are necessary, especially during startup. The most common examples are: • • •
Low discharge pressure trips on centrifugal pumps Low flow trips on pumps Vibration trips on rotating machinesdthere may be transient vibration during startup, which is expected and acceptable if the duration is limited
Where feasible, it would be better to design SIFs not to require routine overrides. For example, a low discharge pressure trip on a centrifugal pump could be changed into a low suction pressure trip, which does not need overriding for startup (a centrifugal pump should not be started with low suction pressure). Reducing the number of routine overrides helps to reduce the likelihood of SIFs being left in an override condition inadvertently. Are Bypass Lines Allowed on SIS Shutdown Valves? While the standard does not explicitly forbid them, it is usually considered best practice to avoid installing them if possible. The concern is that, even if a bypass valve is properly managed by permit to work and kept locked closed, it may not seal properly. Loss of seal may not be evident until an incident occurs and the SIS valve closes. So, if a bypass is unavoidable (e.g. for maintenance purposes, or for pressure balancing during startup), the process designer should consider some means to minimise the risk of a passing bypass valve compromising the safety function, e.g. provide a double block valve on the bypass line. Proof testing of the SIF could also include a leak test on the bypass line.
8.3.5 Partial valve stroke testing SIS valves in low demand mode applications remain in the ‘non-safe’ position for many months, if not years. When a demand on the SIF occurs, they must immediately move to the safe position after a long idle period. Seizuredthe phenomenon of the valve getting stuck in its idle positiondis a significant problem, and increases the SIF’s probability of failure on demand [3]. A common solution is to exercise the valve periodically by moving it from the non-safe position to an intermediate position (typically around 10%e20% of the way through the full stroke), then back again within a few seconds. The valve is not allowed to reach the safe position during this procedure, as this could cause significant process upset and potentially cause other SIFs to trip. During the PFDavg calculation, partial valve stroke testing (PST or PVST) can be credited as an additional proof test, with a somewhat lower proof test coverage than the full proof test (as it does not confirm that the valve seals properly). This can make a useful contribution to achieving the PFDavg target, although its effect on PFDavg is usually not that substantial.
236 Chapter 8 Is PVST a diagnostic? There is some debate in the process safety community about whether PVST can be treated as a diagnostic test. The difference between diagnostics and proof tests, in terms of achieving a SIL target, is that diagnostics convert dangerous failures to safe failuresdincreasing the Safe Failure Fraction (SFF) of the function, and reducing the PFDavg; whereas proof tests allow dangerous failures to be discovered and repaired, giving some improvement in the PFDavg, but do not influence SFF. Increased SFF may allow a higher SIL to be claimed in terms of architectural constraints, as discussed in Chapter 9. To claim that PVST is a diagnostic, it needs to meet the following criteria: •
• •
Testing must be automated, i.e. not dependent on operators or maintenance personnel remembering to perform it. This is seldom implemented in practice, due to concerns about plant disturbance and the possibility of spurious trip. Testing must occur at least 10 times as often as the expected demand on the safety function Personnel must be alerted automatically if the PVST fails, and a repair must be completed within the MTTR
When collecting failure rate data for process valves (and their actuators), engineers may find that the SIL certificate shows a non-zero lDD value. This is valid only if PVST is applied and meets the criteria for a diagnostic. If PVST is not applied, or does not meet the criteria for a diagnostic, then lDD should be added into lDU before running the PFDavg calculation.
8.3.6 Energise and de-energise-to-trip Most SIFs are designed so that loss of energy source to any one componentdeither electrical power, or pneumatic or hydraulic pressure controlling a valve actuatordresults in the final element moving to the ‘safe’ position. (If the affected component has HFT >0, the loss of energy would result in degraded operation, rather than immediate transition to the safe position.) This configuration is known as ‘de-energise-to-trip’ (DTT). Alternatively, SIFs can be designed to require energy to achieve the safety function when required. This is known as ‘energise-to-trip’ (ETT). A common example is the use of a motor-operated valve as a final element. What are the relative merits of DTT and ETT configurations? DTT is generally preferred as it makes the SIF fail-safe, and generally has a lower PFDavg. However, DTT SIFs are more susceptible to spurious trip, as any interruption of energy supply may put the process
Safety instrumented system design 237 into its safe state. A common solution is to provide backup energy sources for critical elements, e.g. air reservoir bottles for pneumatic actuators on shutdown valves. The standards require that SIFs with ETT configurations apply a means to confirm energy supply availability. For elements requiring electrical power, this can be done by an end-ofline monitor, detecting a continuous ‘pilot’ current through the circuit and alarming on loss of current.
8.3.7 Derating The practice of selecting elements with higher specification than strictly required for a particular application, e.g. higher design pressure or temperature, is known as derating. Some safety practitioners advocate derating of SIS field elements, especially elements in contact with process fluids. The objective is to avoid operating elements too close to their design limits. This provides greater headroom for the mechanical stresses experienced by the elements, which could reduce the probability of a random hardware failure. Nevertheless, there is no specific requirement for derating in the standards, and so hardware is seldom derated in real-world SIS applications.
8.3.8 Hard-wiring of field devices SIS final elements are normally individually hard-wired to the outputs of the PLC. However, if there are multiple final elements in NooN configuration, common wiring could be acceptable, because common cause failure leads to no additional risk in this case. For energise-to-trip SIFs, hardwiring needs continuity monitoring, so that any break in the connection will be rapidly detected and alarmed. Some vendors offer bus- or wireless-based communication systems for sensors. These have the advantage of reducing the extent of hardwiring, which could be considerable for a plant spread over a large geographical area. Wireless systems can now achieve a high level of availability by designing each device on the network to act as a relay station for all the other devices, providing multiple routes for signals to reach the logic solver. Cybersecurity issues need to be carefully considered if wireless communications are implemented, as they introduce a new point of vulnerability to the system.
8.4 Independence To be valid as a layer of protection, a SIF must be independent of the initiating cause of demand on the SIF, and independent on other layers of protection that are considered to reduce the risk for those initiating causes. But how independent should the SIF be?
238 Chapter 8 Realistically, it is virtually impossible to eliminate all causes of common cause or common mode failure between the SIF and other layers of protection. In practice, designers typically apply some criteria to determine when the degree of independence between these layers is sufficient: • • •
The layers do not share any items of field hardware in common Common sources of energy, such as power feeds, are separated so that they have as few common elements as possible Any credible common causes of failure: • Have a much lower frequency or probability than other, non-common causes of failure for the layers, and • Will be readily evident to the plant operator, i.e. will not lead to latent failures
We already discussed the case of tripping a control valve earlier in this chapter. Let’s consider some other specific cases where independence issues may arise. Other cases relating specifically to non-SIS layers of protection are covered in Chapter 6.
8.4.1 Multiple SIFs in the same SIS If a hazardous scenario is protected simultaneously by two SIFs, with separate sensors and final elements, but sharing the same SIS logic solver, are the SIFs independent? Some safety practitioners consider the SIFs not to be independent, as they share hardware and software. Others, taking the view that the logic solver usually makes only a small contribution to the total PFDavg of each SIF, consider this position to be overconservative. Given that a typical SIL-rated, PLC-based logic solver usually has a very high diagnostic coverage (>99%), it seems plausible to say the chances of an unrevealed common fault affecting both SIFs simultaneously are remote. Moreover, supposing the two SIFs (at SIL 1 and SIL 2, for example) were replaced by a single SIL 3 SIF, it is hard to understand why simultaneous failure of the two lower-rated SIFs should be considered any worse than a single SIL 3 SIF failure. Therefore, provided the field elements of the SIFs are connected to different I/O cards on the PLC (to eliminate one possible common failure cause), and the PLC’s internal diagnostics are enabled, it is reasonable to take full credit for both SIFs.
8.4.2 Multiple systems tripping a motor via the same MCC The following scenario shown in Fig. 8.11, relating to protection of a pump against dry run, commonly arises:
Safety instrumented system design 239
Figure 8.11 Protection of a pump against dry rundSchematic.
•
• • • •
Situation: a drum or tank intermittently containing liquid (e.g. closed drain tank, flare knockout drum) is provided with a pump to remove liquid when required. The pump is controlled by a level control in on/off mode. Initiating cause: level control LIC-100 malfunctions, and runs the pump when the tank is empty Layer of protection 1: low low level from LIC-100 stops the pump through the BPCS Layer of protection 2: independent transmitter (e.g. a second level transmitter, or a pump suction pressure transmitter) trips the pump through the SIS Layer of protection 3: the pump motor is equipped with a machine monitoring system (MMS) that trips the pump on high vibration
Evidently layer 1 is invalid as an IPL, as it relies on the same transmitter that potentially failed in the initiating cause. But can we take separate credit for layers 2 and 3, given that they both send a trip signal to the pump motor’s MCC? Opinions vary. Some practitioners regard these two trips as sufficiently independent, given that the systems generating the trip signal are fully independent; others are concerned that they both act on the same trip circuitry in the MCC. Also, in some cases the MMS routes its trip signal via the SIS. One solution is to try to send the trip signals directly from the two systems (SIS and MMS) to separate MCC inputs, if available.
8.4.3 Communications between SIS logic solver and BPCS A communication link from the SIS logic solver to the BPCS is clearly very useful, as it allows the BPCS HMI to:
240 Chapter 8 • • • • •
Display SIS-generated process variable measurements Display the status of SIFs, including the ‘first out’ initiator (i.e. if the SIF has multiple initiators, the display shows which one tripped first) Annunciate alarms from the SIS, including ‘pre-trip’ alarms and diagnostic alarms Display the status of bypasses and overrides Display health information about the SIS
What if we require communications in the opposite direction, from the BPCS to the SIS? This can be useful, as (for example) it allows SIS bypasses to be initiated from the BPCS screen. However, this opens up the possibility of compromising the correct operation of the SIS in several ways, such as: • • • •
Transfer of data that is incorrect or meaningless (e.g. out of range) Overloading of the SIS PLC with a data flood Transfer of malware to the SIS PLC, especially if the BPCS PLC is connected to other networks Loss of data (on failure of the BPCS PLC or failure of the communications link)
The SIS PLC must be designed so that it is not susceptible to any upset arising from communications issues. This can be achieved by allowing the SIS PLC to ‘pull’ data from the BPCS, but not allowing the BPCS to ‘push’ data to the SIS PLC. Also, the SIS PLC should not rely on any specific data being received from the BPCS. For example, suppose the designer plans to use a certain transmitter to provide both a trip input in the SIS, and an alarm in the BPCS. The appropriate configuration is to connect the transmitter directly to the SIS, and then transfer its measured value to the BPCS via a communications link. This is preferable to sending the data in the opposite direction from BPCS to SIS. Alternatively, a specially designed splitter can be used to connect the 4e20 mA loop to both the BPCS and the SIS simultaneously.
8.4.4 Implementing BPCS and SIS in a single logic solver Given all we have stressed about the importance of independence, it may seem strange to consider the possibility of using a single PLC to implement both BPCS and SIS functions. Nevertheless, such integrated systems are now available from most PLC manufacturers, and rapidly gaining market acceptance. In most cases, they achieve the required independence by strict segregation of control and safety software, and extensive use of internal diagnostics to minimise the risk of undetected failures. They may also have separate communications buses (electrical connections between different sections of the internal architecture) for control and safety functions.
Safety instrumented system design 241
8.4.5 Implementing non-safety functions in the safety PLC Any function, whether discrete (on/off functionality such as a trip) or continuous (such as a control loop), whose SIL requirement is less than SIL 1, is termed a non-safety function. The standards discourage implementing non-safety functions alongside safety functions (with SIL 1) in a SIS. Reasons for this include: • •
The possibility of interference between safety and non-safety functions, especially in fault conditions Increased complexity, leading to increased probability of errors leading to systematic failures
Besides, I/O cards for safety PLCs are more expensive than for control PLCs. As a result, SIS designers often consider transferring non-safety functions from the SIS to the BPCS after the SIL assessment step is complete. This could include non-critical final elementsdi.e. any actions shown in the Cause & Effect Diagram that are not listed as critical final elements in the SIF architecture. However, this may raise another independence-related issue. During SIL assessment, some functions that were originally proposed as SIFs, but turned out to have a SIL target PFDavg 0.001
100 < RRF 1000
SIL 3 SIL 4
10
3
10
4
> PFDavg 10
4
1000 < RRF 10,000
> PFDavg 10
5
10,000 < RRF 100,000
For high demand and continuous mode functions, the situation is completely different because the SIF is effectively being tested frequently by actual demands (especially in continuous mode). Therefore, any dangerous failure will quickly be revealed and repaired. The effective probability of failure on demand is thus relatively constant and, therefore, we do not need to use an average failure measure. Instead, we use a measure known as probability of failure per hour (PFH). This is closely related to the random dangerous hardware failure rate lD; actually, for single devices, the PFH is usually considered to be equal to its lD. The term RRF is not used in this context. The failure measure MTBFs (mean time between failures) is also not customarily used for SIFs, although in principle it could be (if its definition is restricted to dangerous failures). Table 9.3 gives the definition of PFH targets per SIL, as defined in the standards.
9.2.2 How the failure measure is calculated: SIL verification Several mathematical methods exist for calculating PFDavg or PFH. These are explained in depth in other texts such as Ref. [1] so we will provide only a brief description here, without going into the mathematical background.
Table 9.3: Maximum SIL achievable depending on PFH achieved, for high demand and continuous mode functions. SIL achievable
PFH achieved
No SIL
PFH 105
SIL 1
105 > PFH 106
SIL 2
106 > PFH 107
SIL 3
107 > PFH 108
SIL 4
108 > PFH 109
248 Chapter 9 Calculation of probability curves A relatively intuitive approach is to calculate the instantaneous probability of failure of the SIF at each moment during its mission time, and plot this as a curve of PFD vs. time. To obtain the PFDavg (average height of the curve), one can calculate the area under the curve and then divide by the length of the curve (the mission time). As the curve is relatively smooth, its area can be estimated by adding up the height of the curve at regular intervals. Single devices
The SIF’s instantaneous PFD is a measure of its unavailability to perform its function at each moment. First, let’s focus on a single device, such as a pressure transmitter. The device may be unavailable for any of the following reasons: •
•
•
A non-discoverable fault: The device has failed due to a fault that is not discoverable by proof testing. The rate of such a failure is lDU $ (1 PTC), where PTC is the proof test coverage. The time duration available for such a fault to develop is the device’s entire mission time, MT. A discoverable fault that is not detectable by diagnostics: The device has failed due to a fault that will be discovered by the next proof test. The applicable failure rate is lDU $ PTC. The time duration during which faults can develop is the time since the last proof test; on average, this will be half the proof test interval, i.e. 0.5 $ PTI. The device has been bypassed for proof testing. The fraction of time the device is under proof testing is PTD/PTI, where PTD is the proof test duration.
Some faults are detectable by diagnostics. As the diagnostic testing rate is assumed to be fast, we normally assume such faults are detected immediately, and therefore do not count the diagnostic cycle time in the calculation. If diagnostics are available but not implemented for some reason, then lDD faults become the same as lDU. We therefore have to include lDD in the calculations above; so, the rate of non-discoverable faults becomes (lDU þ lDD) $ (1 PTC), and the rate of discoverable but not detectable faults becomes (lDU þ lDD) $ PTC. Each of these unavailability causes leads to a probability curve. For non-discoverable DU faults, the shape of the curve is exponential: PFD(t) ¼ 1 e(l (1 PTC)t), where t is the time since the device was new. As lDU is small (since we are using high reliability devices), this can be approximated to PFD(t) ¼ lDU(1 PTC)t. The curve shape is shown in Fig. 9.1. The height of the curve is exaggerated for clarity. For discoverable faults, the same principle applies, except that the probability of such faults resets to zero at each proof test (assuming that proof testing is executed correctly). Thus, the t quantity is the time elapsed since the last proof test. The characteristic curve shape is a sawtooth, as shown in Fig. 9.2.
Meeting SIL requirements: SIL verification 249
Figure 9.1 PFD vs. time curvednon-discoverable faults.
Figure 9.2 PFD vs. time curveddiscoverable faults.
The unavailability due to testing is a step function: the PFD from this contribution is 1 when the device is under test, and 0 otherwise. The curve is shown in Fig. 9.3. However, if the process is stopped (offline) during the proof test, then there is no possibility of a demand during proof testing, and this contribution can be ignored. There is one more contribution to include. When a fault is discovered, either by proof testing or by diagnostics, there is a delay time MTTR (the mean time to restore) during which the device is unavailable. The probability of this unavailability occurring is the
250 Chapter 9
Figure 9.3 PFD vs. time curvedunavailability due to testing.
probability of a detectable dangerous fault during the preceding hour, i.e. lDD, plus a contribution lDU from undetectable, discoverable faults. (For lDU faults, the repair would occur only after proof testing; but, taken over the mission time, the average probability of unavailability for repair is the same, no matter when we consider the repair to take place.) Thus, the average probability of unavailability due to repair is (lDU þ lDD) $ MTTR. This adds a constant contribution to the overall unavailability. The total unavailability of the device, as a function of time since new, is approximately the sum of these contributions. This is depicted in Fig. 9.4, which we saw earlier in Chapter 4. As each of these quantities can be calculated, we can also calculate the average probability of failure PFDavg for the device, as follows. The 0.5 factors arise from averaging under the curves. Undiscoverable faults
Discoverable faults
Testing
Repair
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflffl{ zfflfflfflfflfflffl}|fflfflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ PFDavg ¼ 0:5lDU ð1 PTCÞMT þ 0:5lDU PTC,PTI þ PTD=PTI þ lDU þ lDD ,MTTR (9.1) Multiple devices
Related devices in a SIF can be logically combined to calculate the PFDavg of the group. For example, consider a shutdown valve assembly comprising a solenoid valve, a pneumatic actuator, and a process-wetted ball valve. For the function to succeed, all of these elements must be available; we can describe this as a 3oo3 logical relationship. Other groups, such as redundant transmitters, may be in a 1ooN or MooN (M < N) group.
Meeting SIL requirements: SIL verification 251
Figure 9.4 PFD vs. time curvedtotal.
To calculate the PFDavg of the group, we need to consider the possibility that a random hardware failure could occur simultaneously in all devices in the group, due to a common causedfor example, an external stressor such as vibration. To incorporate this in the calculation, we define a common cause factor b, which is defined as the fraction of lDU and lDD arising from failure modes susceptible to common cause failure. The value of b is typically selected according to Table 9.4. As an approximation, the same b value is normally applied to lS when calculating spurious trip rate. More specific estimates of b can be developed by applying the detailed rule set provided in IEC 61508:2010 part 6, clause D.6. b can also be estimated by Failure Modes and Effects Analysis (FMEA); see Chapter 5 of Ref. [2].
Table 9.4: Typical values of the common cause factor b.
Type of technology applied in devices in the group
Are the devices in the group spatially separated to reduce exposure to common stressors?
2%
Diverse
Yes
5%
Diverse
No
Similar
Yes
Similar
No
Typical b
10%
252 Chapter 9 Alternatives to b The beta factor is straightforward to estimate and apply. However, some practitioners regard it as an oversimplification. For a more detailed approach, see Ref. [3].
1ooN groups like this can fail in two ways. For illustration purposes, we are treating for all devices in the group as having the same l values, which is true if the devices are identical but generally not the case for groups of dissimilar devices. • •
All N elements can fail simultaneously due to a common cause. The applicable failure rate is b lDU. All N elements can fail simultaneously due to non-common causes, i.e. they all happen to fail randomly at the same time. The applicable failure rate is [(1 b)lDU]N.
Each of these makes a contribution to the undiscoverable and discoverable fault failure probability. Therefore we split each of the first two terms of Eq. (9.1) into two, as shown in Eq. (9.2). Undiscoverable faultscommon cause
PFDavg ¼
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ 0:5blDU ð1 PTCÞMT
Discoverable faultscommon cause
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ þ0:5blDU PTC,PTI Testing
Undiscoverable faultsnoncommon cause
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ N þ 0:5 ð1 bÞlDU ð1 PTÞCMT
Discoverable faultsnoncommon cause
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ N þ 0:5 ð1 bÞlDU PTC,PTI
(9.2)
Repair
ffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{ zfflfflfflfflfflfflffl}|fflfflfflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl þPTD=PTI þ N lDU þ lDD MTTR Here, we are conservatively assuming that the entire SIF is taken offline if any one sensor is found to be faulty. The factor N now appears in the repair term, because we have N devices that can fail. However, if the SIF is actually kept online in a degraded state during the repair, then the repair term becomes negligibly small and can be omitted. For NooN groups, the group fails if any one of the elements fails. This means we can apply the same formula as for 1oo1, replacing l with Nl in Eq. (9.1). It makes no difference whether the element failed due to common or non-common cause, so there is no need to consider the b factor. This calculation slightly overestimates the failure probability, as it double-counts cases where more than one element fails simultaneously due to common cause; however, this has a negligible effect on the overall result. The calculation for MooN groups is more complex and is covered in Ref. [1].
Meeting SIL requirements: SIL verification 253 The complete SIF
Using these equations, it is sometimes possible to develop a complete analytical solution for the PFDavg of the whole SIF. It is always assumed that the SIF’s three subsystemsdsensors, logic solver and final elementdare independent, i.e. they have no common failure causes. This means the PFDavg for each subsystem can be calculated separately and then added together to obtain PFDavg for the whole SIF. However, the calculations quickly become complex, with a high probability of making a mistake due to the large number of input parameters involved. Also, in general it is hard, if not impossible, to describe the PFDavg analytically for more complex architectures, where multiple b, PTI and mission time values apply. Furthermore, PFDavg values cannot, in general, be simply added up for different groups within a subsystem, as this gives the wrong result for 1ooN and MooN architectures. As a result, most practitioners use dedicated software for PFDavg calculations. State-based calculations Another widely used approach is to treat the SIF as a system of devices each having multiple possible states. For example, a smart transmitter could be in any one of the following states: 1. 2. 3. 4. 5.
Working properly (i.e. available) Unavailable due to an undiscoverable fault Unavailable due to a fault that is discoverable by proof testing Unavailable during repair time Unavailable during proof testing
The transmitter is normally in state 1, and can transition from one state to another. For example, when a discoverable fault occurs, it transitions from state 1 to state 3. When the fault is repaired, it spends time in state 4, and is finally restored to state 1. Extending this to a system with multiple devices, the system as a whole can exist in any of the states listed above, plus many additional states if it has any hardware fault tolerance. Here are some examples of additional states: 6. Working in a degraded state (one device failed, other device(s) still able to fulfil the SIF’s function) 7. Working in a degraded state (one device under repair) 8. Working in a degraded state (staggered testing of devices in a MooN group) A mathematical description of the system can be developed by determining the transition rate from each state to every other state. For example, for a single device, the transition rate from state 1 to state 2 is lDU (1 PTC). This can then be used to calculate the
254 Chapter 9 probability that the system is in any one state. The analyst will calculate the total probability that the system is in any state in which the SIF cannot perform its safety function; this yields the SIF’s PFDavg. One method for doing this calculation is Markov modelling. Another method is to use a statistical approach such as Monte Carlo simulation.
9.2.3 High demand and continuous modes Calculating PFH is more straightforward than PFDavg, as no averaging is required. The lD values of individual elements are combined using standard probability rules. For high demand mode, only lDU is considered: we assume that, if a fault is discovered by diagnostics, the process will be stopped while the fault is repaired, and therefore lDD need not be considered. In continuous mode, diagnostics are assumed to be insufficiently frequent to detect faults, so the applicable lD is lDU þ lDD. Ensure l values are expressed in h-1. For 1oo1 subsystems, simply PFH ¼ lD. For 1ooN subsystems, all the devices must fail (by common cause or independently) to cause subsystem failure, therefore PFH ¼ blD þ [(1 b)lD]N. For NooN subsystems, any one device failure leads to subsystem failure, therefore PFH ¼ NlD. To combine subsystems, we simply sum their individual PFH.
9.3 More on proof testing 9.3.1 Optimising the proof test interval One common project requirement during SIL verification is to optimise the proof test interval (PTI) of each element in the SIF. This may be required when: • •
The SIF, in its original configuration, does not meet its PFDavg/RRF target, and the design team decides to try shortening the PTI to meet the target; or The SIF comfortably exceeds its PFDavg/RRF target, and the team decides to try a longer PTI to save on maintenance costs during the operational lifetime of the SIS.
While optimising the PTI, there are three important issues to be aware of: 1. Unavailability during testing. If proof testing is to be performed with the process still running and the hazards are still present, the risk is not managed by the SIF during the proof test duration (PTD). As PTI decreases, the unavailability due to proof testing (¼ PTD/PTI) increases and can become a dominant contributor to PFDavg (see Eq. (9.1)).
Meeting SIL requirements: SIL verification 255
Figure 9.5 Typical relationship between PFDavg and proof test interval.
In fact, if one plots a curve of calculated PFDavg against PTI for a typical SIF, it passes through a minimum at around PTI ¼ 1e2 months, and rises at shorter PTI, as depicted in Fig. 9.5. In other words, over-testing can be just as hazardous as under-testing. Kletz has reported a real-world example of an accident that occurred as a result of overtesting [4]. 2. Human error during testing. The more often proof testing is performed, the higher the chance of making an error. This could result in either a false test result, or the SIF getting left in a degraded or unworkable state. Besides, every time a system is disturbed, its hardware components such as cable connectors and seals become gradually less secure, increasing the chance of failure. This is another reason to avoid overtesting. 3. Minimum PTI requirement. For a SIF in low demand mode, proof testing must be performed frequently relative to the predicted demand rate on the SIF (as a guide, at least twice as often). For example, if the predicted demand is once per 2 years, proof testing should be at least once per year. If the PTI is extended beyond this limit, dangerous failures are increasingly likely to be discovered by failure on demand rather than by proof testing, and therefore the basic premise of low demand mode is no longer met. In such a case, the SIF should be assigned to high demand mode instead.
9.3.2 The effect of human error during proof testing Eq. (9.2) includes terms taking account of proof test coverage (PTC). This makes the implicit assumption that proof testing is performed perfectly, i.e.
256 Chapter 9 • • • •
The test is performed exactly as specified in the manufacturer’s safety manual. The test is performed on time, i.e. the duration between tests is no longer than the proof test interval specified in the Safety Requirements Specification (SRS). Any dangerous failures found are repaired. The actual mean time to repair (MTTR) is no longer than the MTTR assumed in the SIL verification. The elements under test are restored to full operation after test or repair, with bypasses cancelled.
All of these depend on humans developing and executing procedures correctly. In general, it is not reasonable to assume 100% perfection where humans are involved. Therefore, a performance or confidence factor is sometimes included in the PFDavg calculation. This could be expressed by, for example, multiplying the PTC by a performance coefficient p where typically 0.9 p 0.99. Selection of a p value near the top end of this range needs justification by enhanced attention to human factors, using measures such as • • •
high quality training of maintenance personnel, with assessment of training effectiveness random checks of proof test performance cross-checking of each proof test by a competent independent person.
9.4 Architectural constraints 9.4.1 Introduction Both IEC 61508 and IEC 61511 impose requirements on the hardware fault tolerance (HFT) each SIF must possess. This is one of the few prescriptive requirements in the standards. A definition and description of HFT is provided in Chapter 2. The HFT is usually assessed separately for each of the three subsystems comprising the SIF, for reasons explained a few pages later. The subsystems are: • • •
Sensors, including wiring, signal conditioners and intrinsically safe (IS) barriers, up to the point where the signal enters the I/O card of the safety PLC; Logic solver, including its I/O cards, CPU, memory, and power supplydbut excluding any components not required for the SIF to succeed, such as hard disk drive and HMI; Final elements, including wiring from the I/O card, drivers, power supplies, solenoids, and valve actuatorsdbut excluding components that are not regarded as part of the SIF, such as valve position sensors (to feed information on success or failure of the SIF back to the PLC). If a motor (for a compressor or pump) is a critical final element of the SIF, the motor itself is not usually considered part of the SIFdthe SIF ends at the motor control circuit (MCC).
Meeting SIL requirements: SIL verification 257 The standards provide several options for assessing HFT requirement. To understand them, we must first explain two more concepts: hardware type (type A/B) and safe failure fraction.
9.4.2 Hardware type A and type B IEC 61508 distinguishes between two types of component, identified as type A and type B. Type B is equipment whose failure modes cannot be clearly predicted, which in practice means any equipment containing programmable electronics (because software can fail in too many different ways to identify them all). This includes smart transmitters, because the user can usually program various options such as span and diagnostic functions. Type A is everything else, i.e. simple mechanical devices with no programmable electronics.
9.4.3 Safe failure fraction Some of the models for assessing HFT requirement use a quantity known as Safe Failure Fraction (SFF). This is the fraction of all safe and dangerous faults that will result in a spurious trip, unless the trip is suppressed either by diagnostics (lSD faults) or by redundancy (e.g. a safe failure of a single device in a 2oo2 system will not lead to a trip). If automatic diagnostics are applied, lDD faults are counted as safe for this purpose. Eq. (9.3) shows the formula for SFF in IEC 61508:2000 (the first edition of IEC 61508): Eq. (9.3) (lNE ¼ no-effect failures; see Chapter 2 for a definition) With diagnostics applied SFF ¼
lSD þ lSU þ lDD þ lNE lSD þ lSU þ lDD þ lDU þ lNE (9.3)
Without diagnostics applied SFF ¼
l þl þl þ lSU þ lDD þ lDU þ lNE SD
l
SD
SU
NE
Eq. (9.4) shows the formula for SFF in IEC 61508:2010 (second edition): With diagnotics applied SFF ¼
lSD þ lSU þ lDD lSD þ lSU þ lDD þ lDU (9.4)
Without diagnostics applied SFF ¼
l þl þ lSU þ lDD þ lDU SD
l
SD
SU
As discussed in Chapter 8, when calculating SFF, make sure to count partial valve stroke test (PVST) as a diagnostic (i.e. its detected failures are a contributor to lDD) only if it is frequent and automatic.
258 Chapter 9
9.4.4 HFT requirements in IEC 61508:2000 IEC 61508:2000 provides a pair of tables showing the maximum SIL achievable with equipment of a given HFT and SFF; these are reproduced here as Tables 9.5 and 9.6. Table 9.5 is for subsystems containing only type A devices; Table 9.6 is for subsystems containing one or more type B devices.
9.4.5 HFT requirements in IEC 61508:2010 IEC 61508:2010 offers two options for hardware fault tolerance requirements, known as Route 1H and Route 2H (the ‘H’ standing for hardware). Route 1H is the same as Tables 9.5 and 9.6. Route 2H follows the much simpler algorithm shown in Table 9.7. It does not apply the SFF concept. Table 9.5: SIL achievable for subsystems containing only type A devices, as a function of SFF and HFT. Hardware fault tolerance SFF
0
1
2
SFF < 60%
SIL 1
SIL 2
SIL 3
60% SFF < 90%
SIL 2
SIL 3
SIL 4
90% SFF < 99%
SIL 3
SIL 4
SIL 4
SFF > 99%
SIL 4
SIL 4
SIL 4
Table 9.6: SIL achievable for subsystems containing one or more type B devices, as a function of SFF and HFT. Hardware fault tolerance SFF
0
1
2
SFF < 60%
Not allowed
SIL 1
SIL 2
60% SFF < 90%
SIL 1
SIL 2
SIL 3
90% SFF < 99%
SIL 2
SIL 3
SIL 4
SFF > 99%
SIL 3
SIL 4
SIL 4
Table 9.7: SIL achievable per subsystem per IEC 61508:2010 Route 2H. SIL target
Minimum HFT required
1
0
2
0
3
1
4
2
Meeting SIL requirements: SIL verification 259 According to the standard, Route 2H can be applied only if the following conditions are met: • • •
The reliability data (lD values) must be based on field experience for devices in a similar environment to your application Based on data collected in line with applicable standards, such as IEC 60300-3-2 or ISO 14224 The failure measure achieved (PFDavg or PFH) shall be based on failure rate data at a 90% confidence level.
In other words, to claim the more lenient HFT requirements in Route 2H, the failure rate data must have robust statistical justification. In practice, this justification should normally be part of the process of confirming each device’s SIL capability (e.g. by SIL certification or ‘prior use’ analysis), so ideally it will not impose an additional burden on the analyst. Why valve manufacturers panicked when IEC 61508:2010 was issued The first edition of IEC 61508 included lNE in the total of safe failures. This resulted in higher, sometimes considerably higher, achieved values of SFF than the second edition formula. A peculiar and unintended side effect was that the more no-effect failures experienced by an item of equipment, the higher the SFF, and hence the higher the SIL it was capable of! The second edition closed this loophole, but the transition from first to second edition caused some pain for manufacturers of safety equipment, especially valves (which have a relatively high proportion of no-effect failures), because the maximum SIL theoretically achievable by their devices suddenly decreased. (Valve manufacturers should have rejoiced rather than panicking, as the higher HFT requirement gave them an excuse to sell twice as many valves.)
9.4.6 HFT requirements in IEC 61511:2016 IEC 61511:2016 provides another HFT algorithm, similar to Route 2H, as shown in Table 9.8. As the IEC 61511:2016 algorithm is the simplest to understand and easiest to comply with, the author recommends selecting this option. Like Route 2H, this approach does not apply the SFF concept. Additional requirements are as follows: • •
All programmable devices (i.e. ‘type B’ as defined in IEC 61508) must have a diagnostic coverage >60%. (In practice, this requirement is generally easily met.) Failure rate data needs an upper bound statistical confidence of 70% or better. This is similar to, but less stringent than, the Route 2H requirement.
260 Chapter 9 Table 9.8: Architectural constraints per IEC 61511:2016. SIL target
Minimum HFT required
1
0
2
0 (continuous mode: 1)
3
1
4
2
9.4.7 How to apply SFF requirements To apply the SFF requirement, we need to determine whether the requirement applies for each individual element, for groups of elements, or for complete subsystems. IEC 61511 makes it clear that SFF can be applied at the subsystem level, which is straightforward and reasonable. However, the IEC 61508 standard does not clearly explain which architectural level the requirement applies at, nor how to calculate SFF for groups of elements or complete subsystems. To resolve this problem, we can reason like this. Applying the SFF at individual element level does not make sense, as some elements (such as process valves) intrinsically have no safe failure modes, therefore SFF ¼ 0. On the other hand, we cannot calculate SFF for an entire subsystem in general, because SFF is defined in terms of lD and lS, which don’t really have any meaning once elements are combined in groups with any architecture other than NooN. So the only workable solution is to apply SFF at the level of element systems that are combined in NooN architecture, i.e. logically in an ‘AND’ relationship. An example is a sub-assembly of solenoid valve, actuator and process valve, plus any additional components such as quick exhaust valve and volume booster. These systems are known as ‘legs’ in some SIL verification software applications. In a Reliability Block Diagram (RBD), as discussed in Chapter 8, they correspond to linear chains of elements. Within such a group, the SFF is calculated based on the sum of the applicable lD and lS values for all the elements in the group.
9.5 SIL capability and SIL certification 9.5.1 Introduction The third major requirement needed to meet a SIL target is to demonstrate SIL capability. This means that evidence is available to confirm that the elements in the SIF are
Meeting SIL requirements: SIL verification 261 reasonably free of errors in design that could lead to systematic failure. The requirements in the standards are in IEC 61508:2010, parts 2 (hardware) and 3 (software); and IEC 61511-1:2016, clause 11.5. This evidence will usually cover the following main issues: •
•
Confirmation that the design procedure is robust, and contains proper management control of specifications, configurations, competency of personnel, and validation. This is similar to the management requirements in the Functional Safety Management Plan (FSMP), as we described in Chapter 7, and in fact they can be controlled by the equipment manufacturer’s own FSMP. It also relates to quality management; the manufacturer can use evidence of compliance with a quality management program to support claims of SIL capability. Assessment of the element’s performance in the field. This is discussed in the following section.
If an individual end user were to collect and assess all the data required to determine SIL capability for each element in the SIS, this would clearly be an arduous process, consuming a lot of time and expertise, and duplicate the efforts of other end users of the same equipment. Thus, in practice, the normal approach is for the equipment manufacturer to hire an external consultant, who collects and assesses the data and then issues a “SIL certificate”. The standard has no requirement for SIL certification, and does not even define what such certification might entail; however, for convenience, this has become the normal way for end users to address SIL capability requirements. It does not mean, however, that a SIL certificate is compulsory. In principle, an end user can use non-certificated equipment if they are willing to carry out a SIL capability assessment by themselves. Also, hardware that has been widely used in the field for many years, with a good track record of reliability, is generally accepted as “SIL capable”.
9.5.2 Assessing the element’s performance in the field An important part of the SIL capability assessment is to show that the equipment performs well in practice. This is known as a “proven in use” or “prior use” justification. The following aspects are important: •
There must be a substantial amount of industry experience in using the specific item of equipment. The total accumulated number of hours of all such devices in service
262 Chapter 9
•
•
•
•
generally needs to be in the tens to hundreds of millions. For example, given that 1 year z 104 h, 5 years of operating experience with 200 devices would give 107 h of experience. Given that a typical field component has lD z 107/h, very few failures may be expected during this period, which is not enough to draw statistically valid conclusions. The manufacturer needs to collect data from end users about the accumulated hours, operating conditions and failure events for the devices. This is a very challenging requirement. To count towards the operating experience total, operating conditions for the device usage in the field must be relevant to the conditions of use in the SIS, especially with regard to: • Environmental conditions (e.g. marine environment, corrosive, subsea, high vibration, extreme temperature or humidity, high EMI) • Operating mode (demand or continuous mode); this is especially important for mechanical devices that can be subject to stiction if left in the same position for long periods. The observed failure rates, based on field failure data, should be consistent with expectations. Expected values could be generated from Failure Modes and Effects Analysis (FMEA) (see Chapter 5 of [2]), or from experience with similar devices. If the observed failure rate is substantially higher than predicted, it indicates the possible existence of a design fault, which could result in a systematic failure. (IEC 61508-4:2010, clause 3.8.18, indicates that the requirement to use equipment with a good track record is specifically aimed at reducing systematic failures.) Data on failure events needs to be sufficiently detailed to allow the root cause of the failure to be assessed. For example, failures due to misoperation, incorrect installation, or unsuitable environmental conditions would not indicate a design fault, although they might indicate the need for better user education or changes to the Safety Manual.
The net effect of all these challenges is that failure rates estimated solely from field failure return data is likely to be optimistic. However, some large end user companies keep their own hardware failure records, and use this to estimate in-house failure rates. Over time, these can find their way into non-vendor-specific resources such as the OREDA database. They are also useful as initiating event data for LOPA and Fault Tree Analysis.
9.5.3 What is the difference between ‘proven in use’ and ‘prior use’? IEC 61511-1:2016, clause 3.2.51 explains the difference as follows. ‘Proven in use’ applies when a device is used in line with the manufacturer’s design basis (environmental conditions and maintenance procedure). However, in the real world, these conditions are
Meeting SIL requirements: SIL verification 263 not always complied with; ‘prior use’ refers to real world usage conditions that may differ from the manufacturer’s specification.
9.5.4 What is meant by a “SIL 2 shutdown valve”? Sometimes you will hear items of hardware defined as “SIL x”dfor example, a “SIL 2 shutdown valve”. This is a shorthand way of saying that the valve is SIL 2 certified. It does not mean that every SIF using the valve will automatically achieve SIL 2, because the SIL achieved by a SIF is affected by many other aspects of the SIF’s implementation, such as redundancy and testing.
9.5.5 Software SIL capability While the general principles described in this section mostly apply to both hardware and software, the specific requirements for demonstrating software SIL capability are more detailed. Refer to IEC 61508-3:2010 and IEC 61511-1:2-16, clause 11.5.4 and 11.5.5, for more information.
9.6 Calculating predicted spurious trip rate The final task required during SIL verification is to confirm that the predicted Mean Time To Fail Spurious (MTTFSs) meets the target for each SIF. MTTFS is the inverse of the SIF’s overall safe failure rate. Therefore, the calculation (for all SIF operating modes) is similar to the PFH calculation for high demand mode SIFs, substituting lS þ lDD for lD. The effective architecture for sensor subsystems is the opposite of that for dangerous failure; for example, 1oo2 sensor voting logic becomes 2oo2 when it comes to spurious trip. Therefore, we can use the following equations to calculate lS,tot, the total safe failure rate for the sensor subsystem: • • •
For 1oo1 subsystems: the spurious trip logic is still 1oo1, so the equation: lS,tot ¼ (lS þ lDD) For 1ooN subsystems: Spurious trip occurs if any one device fails, so the equation is: lS,tot ¼ N(lS þ lDD) For 2oo3 subsystems: The system fails if any 2 devices fail independently, or if all devices fail due to common cause. lS,tot ¼ 3C2[(1 b) (lS þ lDD)]2 þ b(lS þ lDD). (This assumes the common cause factor b is the same for safe and dangerous failures;
264 Chapter 9
•
this is not necessarily true in general, but it is a necessary simplification, as separate data for bD and bS is not usually available.) For NooN subsystems: Spurious trip occurs only if all devices fail simultaneously: lS,tot ¼ b(lS þ lDD) þ [(1 b) (lS þ lDD)]N.
If diagnostics are enabled, take into account any diagnostics that can prevent spurious trip, i.e. do not count lSD as safe failures.
9.7 What to do if SIS design targets are not met Table 9.9 suggests design changes and other actions that can be taken to achieve the SIF’s numerical targets if they are not met. For each target, the actions are listed in descending order of typical effectiveness. Table 9.9: Actions available if numerical targets are not met. Target PFDavg, RRF or PFH
Actions to consider to achieve target Investigate which subsystem of the SIF is contributing most to the failure probability, and focus attention on that subsystem. Typically, the final element subsystem is the main contributor. If the target is driven by protection against financial loss (rather than safety or environmental impact), remember that the functional safety standards do not require any particular measures for financial risk receptors. Consider waiving the SIL requirement if it’s based on the risk owner’s risk matrix, or perform cost/benefit analysis (see Chapter 5) to determine the optimal SIL target. However, if there is also a lesser safety or environmental impact from SIF failure, the target failure measure from this impact should still be met. Confirm that diagnostics are enabled. For elements with high diagnostic coverage, enabling diagnostics has a massive impact on the failure measure achieved. Provide additional redundancy for the element(s) contributing most to the failure measure. (However, this will decrease the MTTFS.) Check whether all the final elements in the SIF are necessary to achieve the process safe state defined in the SRS. Including unnecessary final elements in the SIF will make it harder to achieve the failure measure target. Decrease the proof test interval for some or all of the SIF elements. However, see the caveat earlier in this chapter. (This applies only to low demand mode SIFs.) Increase the proof test coverage, if an enhanced proof test procedure is available. (This applies only to low demand mode SIFs.)
Meeting SIL requirements: SIL verification 265 Table 9.9: Actions available if numerical targets are not met.dcont’d Target
Actions to consider to achieve target Apply partial valve stroke testing. This can help to improve the failure measure if the proof test interval is long. (This applies only to low demand mode SIFs.) Substitute some devices with others having lower dangerous failure rates (lDU). However, this can easily become an exercise in “cherry-picking” the most favourable data. There are multiple ways of determining l values, yielding substantially different values for very similar devices. Reduce the mission time. This can have an impact on the failure measure when devices have a relatively low proof test coverage (or will not be subjected to proof testing at all). (This applies only to low demand mode SIFs.)
Architectural constraints
As the standards contain multiple “routes” (methods for assigning AC requirements), consider applying a different route to yield a lower requirement. Typically, the route in IEC 61511:2016 gives lower targets than IEC 61508 (route 1H) and IEC 61511:2003. For continuous and high demand mode SIFs, confirm whether the operating mode has been correctly assigned. If the SIF can be reassigned to low demand mode, this may reduce the AC target. It may be possible to reduce the AC target by applying allowances provided in the standards. For example, IEC 61511:2003 allowed reduced AC targets if conditions are met such as “the dominant failure mode is to the safe state.” IEC 61511:2016 has a different provision for reducing the target, although this is rarely applicable. The only remaining option is to provide additional redundancy by adding hardware components, or substituting with devices having intrinsic redundancy (such as redundant solenoids).
MTTFS
Check that diagnostics are enabled, as some diagnostics are able to detect safe failures and raise an alarm instead of tripping. Use 2oo3 architectures instead of 1oo1 or 1oo2. Change from “safer” redundant architectures (with more hardware fault tolerance, e.g. 1oo2) to “higher process availability” architectures (with less hardware fault tolerance, e.g. 2oo2). However, this will affect the failure measure and may miss the AC target.
266 Chapter 9
Exercises Descriptive questions 1. The dangerous failure measure of a SIF in low demand mode is the average probability of failure on demand (PFDavg). What is the corresponding measure for high demand and continuous mode? 2. What is meant by SIL verification? 3. Why is it incorrect to say that the SIL of a low demand mode SIF decreases during its lifetime? 4. What is the relationship between PFDavg and proof test interval (PTI) for a typical SIF in low demand mode? 5. Suppose the project manager has asked you to optimize the proof test interval (PTI) for low demand mode SIFs that easily meet their failure measure target. What factors do you need to consider? 6. What factors determine the preferred proof test interval (PTI) for a continuous mode SIF? 7. When calculating safe failure fraction (SFF), under what condition should lDD failures be counted as safe failures? 8. Draw a Reliability Block Diagram (RBD) depicting a subsystem containing a double block valve arrangement comprising: a common wire from the logic solver and 2 block valves (each containing 1 solenoid, 1 actuator, and 1 ball valve), with the 2 block valves configured in series. 9. Describe the requirements in the standards relating to SIL certification. 10. The main purpose of SIL certification of a device is to: (select the most appropriate answer) (a) Confirm that the device is sufficiently free of design errors that can lead to systematic failure. (b) Confirm that the device’s random hardware failure rate is sufficiently low to meet the SIL target. (c) Confirm that the device is manufactured properly by a reputable company. (d) Confirm that the device was designed and manufactured under an ISO 9000 quality management system. 11. Why must field failure data be collected and analysed as part of a SIL capability assessment? 12. Suggest 3 design changes that can be made if a low demand mode SIF does not meet its failure measure target (PFDavg).
Meeting SIL requirements: SIL verification 267
Numerical questions 13. Consider a SIF subsystem containing a single component, a limit switch with a dangerous failure rate of lD ¼ 5 107/h. There are no diagnostics or proof testing. The SIF operates in low demand mode. The mission time is 20 years. What is the PFDavg for this subsystem? 14. You are provided with a SIL certificate that quotes a component’s failure measure in terms of PFDavg only. PFDavg is quoted as 6 104 for a proof test interval of 2 years and a mission time of 15 years. Back-calculate the component’s dangerous failure rate, lD, in FIT. What assumption(s) do you need to make? 15. A device’s Mean Time Between Failures (MTBF) is quoted as 450 years. What failure rates (l values) can be calculated from this? What assumptions do you need to make? 16. A final element subsystem comprises the components shown in the following table. Calculate the PFDavg of the subsystem, assuming: • The SIF operates in low demand mode • The mission time is 20 years • The proof test interval is 2 years • The expected demand interval on the SIF is once in 10 years • Partial valve stroke testing is performed automatically once per month. What other assumptions are necessary?
lDU (FIT)
lDD (FIT)
Proof test coverage (based on lDU þ lDD)
Solenoid valve
200
400
80%
Actuator
300
500
90%
Ball valve
400
200
60%
Component
17. Repeat the preceding question, assuming partial valve stroke testing is performed manually once per 6 months. 18. Suppose you have a hypothetical low demand mode SIF containing only one element, with the following parameters: lDU ¼ 100 FIT, lDD ¼ 900 FIT, diagnostics are enabled, proof test coverage ¼ 75%, proof test interval ¼ 2 years, proof test duration ¼ 24 h, process online during testing, mean time to restore ¼ 24 h, mission time ¼ 25 years. What is the PFDavg for this SIF? 19. For the same SIF as in question 18, calculate PFH assuming the SIF is in continuous mode.
268 Chapter 9
Answers Question 1dAnswer Probability of failure per hour (PFH).
Question 2dAnswer SIL verification is the task of demonstrating that each SIF in the SIS meets its performance targets in terms of failure measure (PFDavg or PFH), architectural constraints, and spurious trip rate. Sometimes SIL capability is also assessed in SIL verification.
Question 3dAnswer By definition, SIL for low demand mode SIFs is related to PFDavg, which is an average measure over the SIF’s whole life. As long as there is no change in the design, the PFDavg should not change and therefore the SIL is constant.
Question 4dAnswer If the PTI is decreased, PFDavg will also decrease. However, at a certain point, PFDavg passes through a minimum and starts to increase again. This is because the unavailability during proof testing starts to dominate.
Question 5dAnswer The PTI of each SIF can be extended, provided that both of the following conditions are still met at the extended PTI: (1) The SIF’s PFDavg target is still met, and (2) the PTI is no more than half of the expected interval between demands on the SIF.
Question 6dAnswer This is a bit of a trick question. In principle, PTI is irrelevant for a continuous mode SIF, because any dangerous failures are revealed by demands on the SIF. Proof testing may not even be necessary, although it could still be useful if there are any parts of the safety function that are not exercised during normal operation. However, inspection should still be carried out, to reveal any conditions such as damaged wiring and loose mountings that could lead to a dangerous failure later.
Meeting SIL requirements: SIL verification 269
Question 7dAnswer In low and high demand modes, lDD failures are counted as safe when automatic diagnostics are enabled, and the diagnostic test interval is no more than 1% of the expected demand interval on the SIF. In continuous mode, no credit is taken for diagnostics and lDD is therefore counted as part of the dangerous failure rate.
Question 8dAnswer Refer to Fig. 9.6.
Figure 9.6 Reliability block diagram: answer to question 8.
Question 9dAnswer The standards make no mention of SIL certification.
Question 10dAnswer The correct answer is (a), ‘Confirm that the device is sufficiently free of design errors that can lead to systematic failure.’
Question 11dAnswer The failure rates and modes observed must be assessed to see if they suggest a design fault. Also, failure rates should be checked against predicted values (e.g. those obtained by FMEA). Finally, analysis of failures should be part of the manufacturer’s quality management system.
Question 12dAnswer Enable diagnostics; provide additional redundancy for elements having the highest contribution to PFDavg; decrease proof test interval; apply partial valve stroke testing;
270 Chapter 9 replace devices with others having lower lDU; reduce the mission time of elements of the SIF hardware having the highest contribution to PFDavg.
Question 13dAnswer PFDavg ¼ lD $ mission time/2 ¼ 5 107 $ 8760 (hours per year) $ 20/2 ¼ 0.04. The answer is quoted to only one significant figure to match the precision of the input data.
Question 14dAnswer When PFDavg is quoted on a SIL certificate, the proof test coverage is typically (and optimistically) assumed to be 100% unless otherwise stated. For a system with 100% PTC, PFDavg ¼ lD $ PTI/2. Therefore, lD ¼ 2 PFDavg/PTI ¼ 0.0012/year ¼ 0.0012/(1e-9 $ 8760) FIT ¼ 140 FIT. Notice that, due to the 100% PTC, the mission time is not involved in the calculation.
Question 15dAnswer lTOTAL is the reciprocal of MTTF. As MTBF ¼ MTTF þ MTTR, we can calculate lTOTAL from MTBF if we assume MTTR is negligibly small compared with MTTF. Unless we know MTBF refers only to dangerous failures, we need to make an assumption about the device’s Safe Failure Fraction (SFF). For the sake of illustration, we will assume SFF ¼ 80% and also assume that no-effect failures are not included in MTBF. This enables us to calculate lD and lS as follows: lD þ lS ¼ 1/MTTF z 1/MTBF ¼ (1/450) year1 ¼ (1/450)/(1e-9 $ 8760) FIT ¼ 250 FIT As SFF ¼ lS/(lD þ lS) ¼ 0.8, we can calculate lD ¼ 50 FIT and lS ¼ 200 FIT.
Question 16dAnswer The proof test coverage (PTC) is given for lDU þ lDD combined. As we are performing partial valve stroke test (PVST), we can assume all faults detectable by PVST will already be detected by the time proof testing comes around. Therefore, we consider proof testing to cover only lDU faults. We need to split lDU into two parts, for faults discoverable and undiscoverable by proof testing. Considering the solenoid valve: 80% of all lDU þ lDD faults (200 þ 400 ¼ 600 FIT) are discoverable. This amounts to 480 FIT. Of these, 400 FIT are lDD faults, leaving 80 FIT of discoverable lDU faults. The remaining 120 FIT are undiscoverable lDU faults.
Meeting SIL requirements: SIL verification 271 Similar calculations for the actuator gives 220 FIT discoverable and 80 FIT undiscoverable; and for the ball valve, 160 FIT discoverable and 240 FIT undiscoverable. The PFDavg calculation includes 3 terms: faults detectable by PVST, faults discoverable by proof testing (but missed by PVST), and undiscoverable faults. PFDavg ¼ (lDD $ PVST interval/2) þ (lDU Discoverable $ PTI/2) þ (lDU Undiscoverable $ MT/2). Calculating for each component: Solenoid valve: PFDavg ¼ (400 $ 1e-9 $ 8760 $ (1/12)/2) þ (80 $ 1e-9 $ 8760 $ 2/2) þ (120 $ 1e-9 $ 8760 $ 20/2) ¼ 1.5e-4 þ 7.0e-4 þ 1.1e-2 ¼ 1.2e-2. Actuator: PFDavg ¼ (500 $ 1e-9 $ 8760 $ (1/12)/2) þ (220 $ 1e-9 $ 8760 $ 2/2) þ (80 $ 1e-9 $ 8760 $ 20/2) ¼ 1.8e-4 þ 1.9e-3 þ 7.0e-3 ¼ 9.0e-3. Ball valve: PFDavg ¼ (200 $ 1e-9 $ 8760 $ (1/12)/2) þ (160 $ 1e-9 $ 8760 $ 2/2) þ (240 $ 1e-9 $ 8760 $ 20/2) ¼ 7.3e-5 þ 1.4e-3 þ 2.1e-2 ¼ 2.2e-2. Total for all components: PFDavg ¼ 1.2e-2 þ 9.0e-3 þ 2.2e-2 ¼ 4.3e-2. Additional assumptions are: the SIF is de-energise to trip; no common cause failures between components (i.e. failures are independent); failure rates are appropriate for TSO/ non-TSO requirements.
Question 17dAnswer As PVST was explicitly included in the PFDavg calculation in the calculation above, we can repeat the same method, changing only the PVST interval to 6 months. Solenoid valve: PFDavg ¼ (400 $ 1e-9 $ 8760 $ (6/12)/2) þ (80 $ 1e-9 $ 8760 $ 2/2) þ (120 $ 1e-9 $ 8760 $ 20/2) ¼ 1.5e-4 þ 7.0e-4 þ 1.1e-2 ¼ 1.2e-2. Actuator: PFDavg ¼ (500 $ 1e-9 $ 8760 $ (6/12)/2) þ (220 $ 1e-9 $ 8760 $ 2/2) þ (80 $ 1e-9 $ 8760 $ 20/2) ¼ 1.1e-3 þ 1.9e-3 þ 7.0e-3 ¼ 1.0e-2. Ball valve: PFDavg ¼ (200 $ 1e-9 $ 8760 $ (6/12)/2) þ (160 $ 1e-9 $ 8760 $ 2/2) þ (240 $ 1e-9 $ 8760 $ 20/2) ¼ 4.3e-4 þ 1.4e-3 þ 2.1e-2 ¼ 2.3e-2. Total for all components: PFDavg ¼ 1.2e-2 þ 1.0e-2 þ 2.3e-2 ¼ 4.5e-2. Compared with the previous result of PFDavg ¼ 4.3e-2, this shows the PFDavg is not very sensitive to the PVST intervaldas typically found in practice.
272 Chapter 9
Question 18dAnswer Applying Eq. (9.1) and converting l into year1 where required: • • • • •
Undiscoverable faults: PFDavgU ¼ 0.5 107 8760 (1 0.75) 25 ¼ 0.0027 Discoverable faults: PFDavgD ¼ 0.5 107 8760 0.75 2 ¼ 0.00066 Testing: PFDavgT ¼ 24/(8760 2) ¼ 0.0014 Repair: PFDavgR ¼ (1 107 þ 9 107)/24, negligible Total: PFDavg ¼ 0.0048, meeting SIL 2 requirement
Question 19dAnswer Proof testing is not applicable in continuous mode, and lDD are counted as dangerous failures. Therefore PFH ¼ lDU þ lDD ¼ 106/h, meeting SIL 1 requirement.
References [1] W. Goble, H. Cheddie, Safety Instrumented Systems VerificationdPractical Probabilistic Calculations, ISA, Research Triangle Park, 2012. A standard textbook on SIL verification calculations. [2] B. Skelton. Process Safety Analysis: An Introduction, Institution of Chemical Engineers (IChemE), Rugby, 1997. An excellent and concise textbook and learning resource, with an especially helpful section on Fault Tree Analysis. [3] S. Hauge, M.A. Lundteigen, P. Hokstad, S. Ha˚brekke, Reliability Prediction Method for Safety Instrumented Systems, SINTEF, Trondheim, 2013. An alternative approach to calculating failure measures using the “PDS method.” PDS is the Norwegian acronym for ‘reliability of computer-based safety systems.’. [4] T.A. Kletz, What Went Wrong? Case Histories of Process Plant Disasters and How They Could Have Been Avoided, fifth ed., Butterworth-Heinemann, Oxford, 2009. A must-read treasury of case histories with insightful analysis.
Further reading [1] Health and Safety Executive (HSE), Failure Rate and Event Data for Use within Risk Assessments, HSE, Bootle, 2012. A source of failure rates for mechanical process equipment (e.g. tanks, flanges, piping, loading arms, hoses, drums, IBCs, cylinders), external events (e.g. aircraft strikes, floods, lightning strikes), and human errors. [2] I. van Beurden, W. Goble, Safety Instrumented System Design: Techniques and Design Verification, ISA, Research Triangle Park, 2017. [3] M. Rausand, Reliability of Safety-Critical Systems: Theory and Applications, Wiley, Hoboken, 2014.
C H A P T E R 10
Assurance of functional safety Abstract The functional safety lifecycle provides four complementary tools to reduce the occurrence of errors that can lead to systematic failures: (1) Verification is the task of ensuring that each lifecycle phase has generated the required outputs. (2) SIS validation is the task of confirming that the as-built SIS, both hardware and software, complies with the Safety Requirements Specification. This is often achieved through a combination of Factory Acceptance Test (FAT) and integration testing and inspection. (3) Functional safety assessment (FSA) is a high-level review by an independent assessor, who will confirm that the functional safety lifecycle has achieved its objective, particularly in terms of risk reduction and control of systematic failures. FSA must be carried out prior to startup, and can usefully be executed at five different points in the lifecycle. (4) Functional safety audit inspects the deliverables generated by each functional safety-related procedure, to confirm that the procedure is being followed.
Keywords: Factory acceptance test; FAT; FSA; Functional safety assessment; Functional safety audit; Functional safety verification; Validation.
10.1 Introduction A major objective of functional safety is to avoid human errors that can lead to systematic failures. One of the most important methods for achieving this is to check that functional safety tasks have produced the required outcomes. The standards define a set of four approaches, all of which are mandatory; these can be termed assurance activities, as a reflection of the concept of quality assurance. The four assurance activities are listed in Table 10.1 and described in detail in the following sections. They are separate and distinct activities, and should be carefully planned to maximize effectiveness and minimize overlap.
10.2 Verification 10.2.1 Introduction In functional safety terms, verification is the task of ensuring that each lifecycle phase has generated the required outputs. This is covered in IEC 61511-1:2016, clause 7. The Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00009-4 Copyright © 2023 Elsevier Inc. All rights reserved.
273
274 Chapter 10 Table 10.1: Functional safety assurance activities. Activity
Description
Verification
Confirm that each lifecycle phase has generated the required outputs
Validation
Confirm that the commissioned SIS meets the Safety Requirements Specification (SRS) in every detail
Functional safety assessment
Confirm that the objectives of functional safety management have been achieved
Functional safety audit
Confirm that procedures relevant to functional safety have been followed
standard does not really explain what verification entails, so there is considerable room for interpretation. In principle, this task needs to be done after each phase. However, verification is not appropriate for certain phases. Table 10.2 outlines typical verification needs for each phase. Table 10.2: Typical verification needs for each functional safety lifecycle phase. Phase
Output to be verified
Hazard identification (HAZOP)
HAZOP report HAZOP worksheet Action items
Allocation of safety layers (SIL assessment)
SIL assessment report SIF list Action items
Development of preliminary SRS
Preliminary SRS
Basic SIS design
Detailed SRS SIL verification
Detailed SIS design
Detailed design documents Application program SRS (Functional analysis)
Procurement, construction and commissioning
Application program (although verification of this is normally integrated into the application program development process) Hardware verification is typically achieved as part of validation (partly during FAT), and does not need to be carried out separately
Operation
SIS performance review report Management of Change (MoC) records
Decommissioning
SIS modification reports
Assurance of functional safety 275
10.2.2 How verification works in practice The first step is to prepare a verification plan. This could be included in the Functional Safety Management Plan (FSMP), or developed as a separate document and referenced in the FSMP. Verification planning requirements are detailed in IEC 61511:2016 part 1, clause 7.2.1. The important aspects to be included in the verification plan are as follows. Each of these is detailed in the following sections. • • • • •
A list of phases requiring verification Verification checklists for each phase in the list Details of when verification shall be executed for each phase in the list (see discussion below) Responsibility for verification activities How to handle discrepancies
Verification should be conducted as soon as possible on completion of the outputs from the phase. It may make sense to conduct verification on draft documents prior to review and approval, so that any changes arising from verification do not trigger a second round of approval. Since the whole purpose of verification is to ensure the outputs are complete and reasonably error-free, later lifecycle phases should not proceed until verification is completed. Since most deliverables (e.g. HAZOP reports) need to be verified, it makes sense to structure the deliverables in a way that makes them easy to verify. For example, HAZOP reports could contain a brief ‘Verification’ section summarizing all the evidence that the verifier needs. Verification checklists (see below) can be used to guide the report’s author in preparing this section. For this reason, the questions listed in the verification checklist should be made available to the person responsible for executing each functional safety task before the task is executed. Verification is not difficult to achieve properly and efficiently, provided that it is well planned and resources are made available to execute it. However, this is not often the case during real-world projects, resulting in either (1) verification being ignored completely, or (2) a scramble to complete verification retrospectively, when its omission is flagged during Functional Safety Assessment.
10.2.3 Verification checklists Verification of each phase needs to show that the inputs to the phase have been processed according to the procedure for that phase, resulting in outputs that (1) meet the requirements set out in the procedure, and (2) are complete and correct.
276 Chapter 10 Since the task required in each phase should be clearly defined, it should be possible to develop a checklist of items to review in the outputs. This is not the same as a general document review, which is typically looking for issues such as consistency, conformity with project specifications, clarity and style. As well as the list of items to verify, the checklist should also include: • • • • • •
Title, document number and revision number of the document(s) reviewed Name of the verifier Date on which verification was conducted Evidence examined for each item verified, and whether the evidence was sufficient List of discrepancies found Conclusion of the verification activity (pass, pass with follow-up actions, fail)
Verification is also a convenient point to confirm a number of other requirements. These belong more strictly within the domain of Functional Safety Assessment (FSA) rather than verification; however, they fit quite naturally within the verification workflow, and this helps to gather evidence that will facilitate the Functional Safety Assessment later. These are the requirements to consider: • • •
Confirmation that each item of data in the output is traceable to source Confirmation that the outputs are clear, readable, well organized and fit for purpose (see Chapter 7 for a more detailed discussion) Confirmation that the phase was executed by competent personnel, with supporting records (e.g. competency checklist) completed
An example of a verification checklist, for the SIL assessment phase, is provided in Appendix A. Sample verification checklists for other lifecycle phases are available from xSeriCon.
10.2.4 Discrepancy handling The verification procedure needs to explain what happens if an issue is found during verification. A reasonable approach is to document the issue on the verification checklist and assign a failure category depending on the severity of the issue. Suggested failure categories are detailed in Table 10.3. All issues in categories B and C would be treated as action items and recorded on the project’s action items register, to ensure closeout. Closeout of category B and C issues should be confirmed during Functional Safety Assessment.
Assurance of functional safety 277 Table 10.3: Suggested failure categories and actions for issues found during verification. Category
Description
Example (in HAZOP report)
Action
A
Minor
Supporting data missing, e.g. list of P&IDs considered
Issue to be resolved within a certain period, e.g. 1 month
B
Significant
Missing competency records
Issue to be resolved quickly, but work can proceed to the next lifecycle phase
C
Major
Failure to identify demand cases for some or all SIFs
Safety lifecycle paused until the issue is resolved
10.2.5 Competency and independence requirements As with all functional safety tasks, the verifier needs to be competent for the task of verification. A possible list of suitable criteria to determine a verifier’s competency is as follows: • • • • •
General knowledge and experience of functional safety management tasks Knowledge of the objective of verification in the FSMP Awareness of the verification procedure Knowledge of the engineering entailed in the task being verified Experience of the task being verified, or a closely related task
While verification has no explicit independence requirement in the standards, it makes sense for the verifier not to verify his/her own work (Fig. 10.1).
Figure 10.1 Why verification is important. Credit: Courtesy Mike Organisciak
278 Chapter 10
10.3 Validation 10.3.1 Introduction Validation is the task of confirming that the commissioned SIS matches the Safety Requirements Specification (SRS) in every detail. This task can be divided into the following sub-tasks: • •
• •
Factory Acceptance Test (FAT): this confirms that the SIS logic solver is working correctly, and includes a functional test of the application program. Hardware inspection: this confirms that the SIS hardware (the logic solver and the field equipment) has been installed correctly. This is usually done as part of installation and commissioning; if so, there is no need to re-inspect during validation. End-to-end test (Site Integration Test, SIT): this confirms that all the SIFs work correctly, including the field hardware and the SIS HMI. Document inspection: certain document requirements are specified in the standards; details are given later in this section.
The application program also needs to be validated. This involves inspection of the code’s development records, and the code itself, to confirm the development procedure was followed properly. This validation is normally executed by the System Integrator as part of the software development process. Details are outside the scope of this book. Validation is carried out towards the end of commissioning. It is generally required only once per SIS element; however, the overall validation exercise can be staggered to align with commissioning, especially for large projects. Revalidation may be required under some circumstances; see later in this section. Because validation is an arduous task, involving hundreds of separate checks potentially spread over several weeks and more than one team, it is advisable to prepare a validation checklist, listing every single test and inspection item required. The test and inspection results can then be recorded in the checklist. A sample validation checklist is available from xSeriCon. A detailed example can be found in IEC 61511:2016 part 2, clause F.28.
10.3.2 Hardware inspection It is important to confirm that the SIS hardware has been installed and set up correctly. A list of inspection points is given below. Most, if not all, of these points are typically covered as part of the normal commissioning process; there is no need to repeat the inspection for validation, as long as adequate inspection records are available.
Assurance of functional safety 279 Field equipment inspection Each item of field equipment must be inspected to confirm that it is: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
The correct item of equipment, including make and model The correct tag number The correct number of devices (e.g. 3 sensors present for a 2 out of 3 sensor group) Located correctly with respect to other process equipment In the correct orientation (e.g. upwards/downwards, flow in the correct direction) Securely mounted, with all clamps and process connections tight, covers properly fitted, and cable glands properly assembled Energised Set up ready for use, calibrated and programmed (if applicable) Clear of all packing materials, transit stops etc. Password (or otherwise) protected to prevent unauthorised change of settings Accessible for maintenance and testing
SIS logic solver inspection The SIS logic solver must be inspected to confirm that it is: 1. 2. 3. 4. 5.
The same equipment covered in the FAT Installed in the correct location Energised Connected to all external devices such as HMI and DCS Running the same version of the application program that was covered in the software validation. (If the software has changed, the change should be handled according to the configuration management procedure defined in the FSMP.) 6. Running at the expected operating temperature, and cabinet fans are set up correctly 7. Equipped with the required physical security, e.g. lockable cabinets
10.3.3 End-to-end test Each SIF should preferably be tested end-to-end, i.e. all the way from sensors to final elements, as a whole. If this is not feasible, the test can be split up into segments, ensuring every element of the SIS is tested. Devices that are shared by more than one SIF do not need to be tested repeatedly. The following tests should be performed on the SIF as a whole (not necessarily in the sequence shown): 1. Confirm correct condition (all SIFs tripped or not tripped) at power on 2. Confirm trip response time meets target
280 Chapter 10 3. 4. 5. 6. 7. 8.
Confirm trip delay, if any, is correct Confirm trip indication and alarm, if any, is correctly shown in the HMI Confirm auto-reset, if any, works correctly, including any time delay Confirm manual reset works correctly, including any time delay Confirm manual trip works correctly, including any time delay Confirm overrides work, including password (or other) protection. If override timers are provided, confirm correct behaviour on time-out 9. For SIFs with different behaviour depending on the plant operating mode (e.g. startup, normal operation, turndown operation), confirm the operating mode is correctly detected and implemented in the SIF 10. Confirm measures required to minimise common cause failure have been implemented, e.g. do not install multiple sensors on the same mounting bracket. Refer to the list of identified common failure causes and preventative measures in the SRS. 11. Confirm measures needed to protect against electromagnetic interference (EMI) have been implemented
10.3.4 Specific tests for sensors The following tests should be performed on SIF sensors (not necessarily in the sequence shown). 1. Confirm not tripped when the process variable (PV) is outside the trip range 2. Simulate or drive the PV to the trip range and confirm the SIF trips at the correct set point, subject to the accuracy defined in the SRS 3. If possible, drive the sensor out of range (over and under range) and confirm correct behaviour 4. Confirm alarm or trip (as required) under sensor fault conditions 5. Confirm indication on HMI is correct under tripped, non-tripped, fault and out of range conditions 6. Confirm correct behaviour on loss of energy source (power) 7. For discrete (on/off) sensors such as limit switches, simulate chattering/bouncing contacts if possible, and confirm correct SIS behaviour 8. For multiple sensors (M out of N groups where N > 1), confirm correct logic, i.e. trip occurs only when M sensors are inside the trip range. (The logic itself should have been tested during FAT, but it is advisable to confirm correct operation after integration of the sensors with the logic solver.) 9. For multiple sensors, simulate degraded condition with one or more sensors in fault condition, and confirm the voting reverts to the degraded algorithm specified in the SRS (e.g. 2oo3 degrades to 1oo2)
Assurance of functional safety 281
10.3.5 Specific tests for final elements The following tests should be performed on SIF final elements (not necessarily in the sequence shown). 1. Confirm correct feedback to the logic solver in the non-tripped position (e.g. valve position feedback) 2. Confirm trip behaviour is correct 3. Check valve closing and re-opening time is acceptable, in terms of meeting the response time target and avoiding hydraulic issues such as hammering 4. Confirm correct behaviour on loss of energy source (power, instrument air, hydraulic pressure, etc.) 5. Confirm indication on HMI is correct under tripped, non-tripped and fault conditions 6. Confirm partial valve stroke test (PVST) local controls, if any, work correctly 7. Confirm local indicators show correct status (e.g. valve open/closed)
10.3.6 Test equipment For compliance with IEC 61511:2016 part 1 clause 15.2.3, any test equipment used for measuring SIS performance needs to be calibrated to an external standard, where applicable. This calibration needs to be performed before testing starts, and repeated during testing if the testing period is longer than the calibration validity period. The calibration results should be recorded on the validation checklist.
10.3.7 Document inspection IEC 61511:2016 part 1 specifically requires certain document checks during validation. These checks more naturally belong during other assurance tasks (verification or Functional Safety Assessment), and so the validation procedure can indicate these inspections will be done during those assurance tasks. The required inspections are listed in Table 10.4. It is advisable to include at least a mention of these in the validation report, so that the Functional Safety Assessor can easily confirm the validation is fully standardscompliant.
10.3.8 Discrepancy handling The validation procedure should specify the actions to be taken if any test or inspection fails. A reasonable approach is to define a set of failure categories, which determine the action to be taken. Suggested failure categories are shown in Table 10.5. The category determines the action to be taken. Each failure should be assigned a failure category and recorded on the validation checklist (except for category A).
282 Chapter 10 Table 10.4: Document inspections required by IEC 61511:2016 during validation. Document Proof test procedures
Inspection required
Better covered during
Confirm these procedures exist for all SIS devices that are subject to proof testing
Functional safety assessment
Confirm detailed proof test procedure is specified Confirm proof test interval is specified (normally in the SRS) Include partial valve stroke testing, if required State pass/fail criteria for proof testing All functional safety related documents
Inspect documents for accuracy, consistency and traceability
Verification of each lifecycle phase
It is important to keep track of discrepancies found, and provide clear evidence that they have been addressed. This evidence is necessary for Functional Safety Assessment. Where appropriate, the project’s action item register can be used to ensure closeout of non-trivial issues.
10.3.9 Restoring the SIS after validation Because the SIS needs to be disturbed during validation (e.g. covers removed, forces and bypasses applied), it is critically important to ensure the SIS is restored to its normal state after validation. It is advisable to keep a register of forces and bypasses applied, so that they can be checked off one by one during restoration. The checked-off register can then be included in the validation report. Most SIS logic solvers can maintain a list of software forces automatically. Restoration checks should include the following: 1. 2. 3. 4. 5. 6.
All disabled alarms returned to their normal setting All process isolation valves set according to the process start-up requirements All test equipment and test fluids removed All overrides and forces, both hardware and software, removed All equipment covers reinstated correctly SIS cabinets locked, and HMIs logged out of supervisor mode if used for testing
10.3.10 Validation report On completion of hardware inspection, end-to-end testing, and document inspection, a validation report should be prepared. This should include:
Assurance of functional safety 283 Table 10.5: Suggested failure categories for validation. Category
Description
Examples
Action
A
Trivial
Mislabelled wires; loose covers
Fix immediately (or file a work requisition) and proceed with validation. No documentation necessary.
B
Minor (hardware)
Handswitch in wrong location (if it does not affect safety or operability)
Note in validation report; update as-built specs.
C
Minor (software)
Wrong setpoint; problem with reset or override; HMI display incorrect
Note in validation report; report to relevant engineer for immediate fix; proceed with validation.
D
Documentation
Missing or out-of-date input documentation
Stop validation and notify management to resolve documentation issues. Validation cannot proceed until it is clear what specification you are validating against.
E
Significant (hardware)
Transmitter in wrong location or does not meet an important element of the specification; missing hardware; significant risk of common cause failure; uncalibrated test equipment
Note as ‘fail’ in validation report. Launch engineering review procedure, which will determine whether reengineering is required to achieve functional safety. After any re-engineering is completed, update documentation and repeat relevant testing and validation. Meanwhile, other aspects of validation (not affected by the problem) can continue.
F
Significant (software)
Wrong MooN logic; wrong version of software; unexpected behaviour such as crash, freeze
Note as ‘fail’ in validation report. Launch software amendment procedure, which will include re-testing and increment of application program version number. Repeat all validation steps that could be affected by the change. Meanwhile, other aspects of hardware validation (not affected by the problem) can continue. Continued
284 Chapter 10 Table 10.5: Suggested failure categories for validation.dcont’d Category G
• • • • • • • • • • •
Description Drastic
Examples Wrong model of safety PLC, sensor or final element; multiple software problems found
Action Note as ‘fail’ in validation report and discontinue validation of the affected SIF. Validation of unrelated SIFs can continue. Report to management to invoke necessary re-engineering or repurchasing procedures.
Names of the persons carrying out the validation, and possibly their competency records (unless documented separately) A reference to all input documents used, with version numbers, especially the SRS and validation plan Details of the scope of validation (i.e. which SIS and SIFs were included) Description of any tests waived because they were already executed during FAT, and reference to the FAT test procedure and report Summary of validation results A list of any open items arising from test failures (e.g. the discrepancy list) Conclusion stating whether the lifecycle can proceed to the next phase The complete validation checklists Calibration records (or references to such records) for test equipment used Restoration records Evidence that the required document inspections were completed, or a note stating that they are covered in other assurance activities (see earlier in this section)
The approved validation report is an important input to Functional Safety Assessment.
10.3.11 Revalidation After first startup, any significant changes to the SIS may require some revalidation. The criteria used to decide whether to revalidate should be included in the Management of Change (MoC) procedure. In general, revalidation would be required whenever: • • •
an item of SIS hardware is altered, repaired or replaced, even like-for-like replacement the SIS application program is updated (although this would normally be revalidated under a FAT-style procedure) other SIS operating parameters, such as delay times, are changed.
Assurance of functional safety 285 The following activities would not generally trigger revalidation: • •
Normal proof testing, maintenance, and calibration without repair Embedded software replacement (e.g. firmware updates); however, the provider of the software should provide evidence that the update has been validated
The scope of revalidation can be restricted to the SIFs, and specific tests and inspections, that are related to the changes made. However, stay alert to the possibility that a change of one aspect of the SIS could inadvertently affect others, for instance if it results in an increase of the load on the logic solver, communications network or power supply.
10.4 Functional safety assessment 10.4.1 Introduction Functional safety assessment (FSA) is an assessment of whether the functional safety lifecycle, determined by the Functional Safety Management Plan (FSMP), has met its objectives. It is a relatively high level assessment that takes a broad overview of the FSMP and looks at whether the plan as a whole is working properly. The tasks required for FSA are not clearly defined in the standards, and opinions vary in the industry on how to execute it. Some practitioners treat it as a re-verification of every detail in the entire lifecycle. Others regard it as an audit of compliance with the standards, and seek evidence clause by clause to confirm compliance. Still others see this approach as too legalistic, and prefer a more system-oriented approach that focuses on the FSMP, aiming to answer a set of questions such as: • • • • •
Is the FSMP sufficiently complete? Is the FSMP being implemented in practice? Has all the required documentation been generated? Have all the action items in the actions register been addressed? Is the lifecycle ready to proceed to the next step? (For example, are the necessary procedures in place, and training completed?)
Normal practice is to carry out FSA in stages, at several points in the safety lifecycle. The recommended stages are shown in Table 10.6. Stages 1 and 2 are optional (but recommended), while stages 3 and 4 FSA are compulsory. Stage 3 FSA is essentially the same as pre-startup safety review (PSSR) for the SIS. FSA at each stage starts from the preceding stage, if any; it is not necessary to review back to the start of the lifecycle each time. If FSA at preceding stages is omitted, each FSA will need to perform a ‘catch-up’ review of the stages missed since the preceding FSA.
286 Chapter 10 Table 10.6: Lifecycle stages at which Functional safety assessment is recommended. FSA stage
Point in the lifecycle
Typical specific activities to be assessed
1
On completion of the risk analysis and preliminary SRS
Initial version of FSMP Verification plan HAZOP report SIL assessment report Preliminary SRS
2
On completion of the SIS design
Complete version of FSMP Detailed hardware SRS Software SRS (Functional Analysis) Complete SIS design Configuration management procedure Validation plan Management of Change procedure
3
On completion of commissioning and validation
FAT report Validation report Closeout records for all action items Operations and maintenance procedure Proof test procedures Training records for operational staff
4
After a period of operation
Testing and maintenance records Training records for new operational staff Review of SIS performance
5
After SIS modification
Management of Change records Training records relating to the modification Updated SRS Updates to any other safety lifecycle documents, e.g. HAZOP report, SIL assessment report
Note In addition to the specific activities listed for each stage, other activities need to be assessed at every stage, such as: • Competency assessment is carried out • Document management procedures are complied with • Verification of each lifecycle phase is completed • Procedures are ready to proceed to the next lifecycle phase • The action items register is maintained, and action items are closed out
Assurance of functional safety 287
10.4.2 Which stakeholders need to perform FSA? Every activity in the functional safety lifecycle needs to be subject to FSA, no matter who performs the activity. Thus, in principle every stakeholder needs to perform FSA. The scope of the FSA for each stakeholder should match their scope within the safety lifecycle. Therefore, stakeholders with a limited scope of functional safety responsibility need only perform FSA on their areas of responsibility. For example, a System Integrator with responsibility to supply, install and commission the SIS logic solver, and develop the SIS application program, need only perform FSA on those activities (See Chapter 7 for other examples of stakeholders with limited scope.). However, in practice, responsibility for FSA for almost the entire project, up to stage 3, is often delegated to the EPCIC contractor, and performed by a single assessment team. The exception is that embedded software and tools (e.g. SIL verification software, application program development environment) that is not specifically designed for the project is usually subjected to FSA by the vendor of the software or tool.
10.4.3 What sample size needs to be considered in FSA? In a typical audit, a certain fraction of documents, transactions, etc. are reviewed in detail, to get an overall picture of the level of compliance. However, FSA is not like this: it is necessary to demonstrate that every SIF has been properly designed and implemented, thus the assessor must consider evidence relating to all SIFs and SIS hardware. However, bear in mind that FSA follows on naturally from verification (see earlier in this chapter). Verification should have already considered every SIF in detail. If adequate verification records are not available for all lifecycle phases completed since the last FSA, the assessor may need to halt the FSA and request proper verification before proceeding. Besides the verification records, the task of FSA is greatly assisted by having complete records of the following tasks available: • • • •
Competency checks Information from the document management system, such as lists of approved documents SIS Validation report (for FSA stage 3) Management of Change assessments
If proper records of each functional safety task have been kept, full review of the whole scope of functional safety should not be too onerous.
288 Chapter 10
10.4.4 Independence requirements for FSA In contrast to verification, the standards specify a clear requirement for the FS assessor to be independent of the work being assessed. IEC 61511:2016 requires at least a lead assessor who is independent of the project design team or operations team, but does not preclude the assessor being from the same company, or even the same office. IEC 61508:2010 provides more detailed requirements, where the level of independence depends on the maximum SIL target. In practice, given that FSA is best done by an experienced practitioner, it is often performed by the HSE department, or an external consultant.
10.4.5 How FSA is conducted in practice The first step is to develop an FSA procedure. This is usually provided by the assessor, and should be provided at an early stage so that it can be referenced in the Functional Safety Management Plan. The FSA procedure should also help to guide the work of the Functional Safety engineering team, as they will know what the assessor is expecting. This is especially true at FSA stage 3, where the assessment, working hand in hand with commissioning and validation records, can proceed very smoothly if well coordinated. The FSA procedure should include the following: • • • • • •
Specify what project scope and which stage(s) are to be assessed Specify requirements for selecting the assessor, including independence requirements Specify responsibilities in all relevant stakeholders for providing data to the assessor A list of input information expected for each FSA stage A list of specific points to be assessed at each FSA stage Details of what will be included in the FSA report
Suitable lists of typical input information and points to be assessed can be obtained from xSeriCon. Issues found during FSA should be assigned a criticality category, which determines the action to be taken. Suggested criticality categories are shown in Table 10.7.
10.4.6 Assessment tasks The main tasks involved in assessment are: •
Check the documentary evidence provided to support the points to be assessed. The assessor should not need to check every detail of the outputs from each phase, as this should have already been done during verification.
Assurance of functional safety 289 Table 10.7: Suggested criticality categories for issues found during FSA. Criticality class
Description
Examples
A
Recommendation for improvement. An issue that does not directly breach a procedure or requirement of the standard, but points to a weakness that could compromise functional safety.
Minor issues in competency management. Missing data on a minor design aspect such as testing of diagnostic alarms.
B
Minor documentation issue. Needs addressing to ensure documentation is in order, but need not delay project progress.
Document reviewed but not approved. Minor inconsistency between documents. Problems with document naming or numbering. Problems with document clarity
C
Missing evidence on a relatively minor issue. Likely to require a documentation update. Need not delay project progress, but should be addressed by FSA stage 3.
Some non-critical fields in project documents such as SRS are empty. Missing data in competency records. Missing evidence from suppliers, such as safety manuals and quality management systems. Missing traceability.
D
Significant item of evidence is missing. Should be addressed within 30 days.
Missing procedure needed for the next lifecycle phase. Missing evidence in SIL verification. Relevant HAZOP action items not addressed. Missing evidence from application program development records. Significant discrepancies between project documents, e.g. SRS and SIL verification report.
E
Major item of evidence is missing. Project should not proceed to the next lifecycle phase until the issue is addressed.
Missing or incomplete Functional Safety Management Plan. Verification not executed. Significant amounts of key data missing from SRS. Significant issues with SIS validation.
290 Chapter 10 • • •
Prepare a punchlist of issues to be discussed with the stakeholders, for any issues where the evidence appears insufficient Hold a workshop with stakeholders to discuss and close out as many issues as possible, and present preliminary findings Prepare an FSA report, including a list of any open action items generated during the FSA
Some assessors also conduct personnel interviews to confirm the level of understanding and awareness of the functional safety lifecycle in general, and each individual’s tasks and responsibilities relating to the lifecycle. However, interviews are not a mandatory part of FSA, and assessors may regard competency assessment records as a sufficient alternative.
10.4.7 Common pitfalls to avoid FSA is not a general review of all functional safety documentation. A common misunderstanding on the part of stakeholders such as EPCIC contractors is to dump a massive stack of documents on the assessor’s desk and walk away, expecting an FSA report to magically appear at the end of a black box assessment process. FSA is a focused study looking for specific evidence on clearly defined issues, and it can be both efficient and valuable if the assessor and stakeholders work closely together to compile the evidence. In fact, if the assessor indicates his/her expectations at an early stage, the lifecycle workflow can be structured so as to meet those expectations clearly, leading to a much easier assessment process. One potential problem with FSA is that the assessor may overreach the required scope at each stage. This is perhaps more likely to happen if the assessment follows the ‘clause by clause’ approach rather than the system-led approach. In one example experienced by the author, the stage 1 FSA assessor raised an action item because a cyber security assessment had not been performed: this is not relevant at stage 1, and should have been held back until stage 2 or 3. The resulting action item had to remain open for several months until the project reached the appropriate level of maturity, resulting in a long delay in achieving closure of the FSA.
10.4.8 Example: assessment of SIL verification As an illustration, here are some sample questions the assessor may ask when reviewing the SIL verification at FSA stage 2. This is not intended as a complete list. • • •
Is the SIL verification software SIL-capable? Is there evidence of competency for the lead SIL verification engineer? Has a verification been carried out to confirm that the SIL verification input data matches the data in the SRS?
Assurance of functional safety 291 • • • •
Did all SIFs meet their SIL targets? If not, how were failures resolved, and have project documents such as the SRS been updated? Was the SIL capability of all SIS hardware assessed? Is the data used in the SIL verification reasonable? For example, are the component failure rates, beta values, proof test coverages, and mission times plausible? Did the SIL verification include all necessary hardware in each SIF?
10.5 Functional safety audit 10.5.1 Introduction While Functional Safety Assessment confirms that adequate procedures are in place, it is the task of Functional Safety Audit (FS Audit) to confirm those procedures are working in practice. FS Audit is a similar process to any other quality system audit, and in practice, it is often executed by ISO 9000 quality management system (QMS) auditors or, at least, under the control of a QMS audit procedure. The standards specify that all procedures related to functional safety shall be audited, but impose no requirements on the timing or frequency of audit. In practice, it would probably make sense to audit once during the implementation phase of the lifecycle, i.e. during FSA stage 2 or 3 or between these stages. Auditing would also be beneficial during the operational phase, either periodically (every few years) or in conjunction with FSA stages 4 and/or 5. See IEC 61511 to 1:2016, clause 5.2.6.2 for details of the requirements. According to the standards, the auditor needs to be independent of the design and implementation of the SIS. This is a more stringent independence requirement than other assurance tasks. An audit procedure needs to be developed, which should be referenced in the Functional Safety Management Plan. This could be the same as the audit procedure used for the QMS; there is no requirement for a separate procedure for FS Audit. Auditing will likely result in a set of discrepancies and issues for improvement, which need to be managed to completion. These can be handled using the project’s action item register. Closeout of audit action items should be confirmed in the following FSA. Table 10.8 lists a possible set of functional safety-related procedures that may be subject to audit. Not all of these procedures will necessarily exist as separate documents.
10.5.2 Typical audit procedure The auditor works through the procedure under audit methodically, clause by clause. He/ she will ask for information about the occasions when the clause is invoked. Then, he/she
292 Chapter 10 Table 10.8: Functional safety-related procedures typically subject to audit. Functional safety management plan SIS training procedure Validation procedure SIL assessment procedure Configuration management procedure Commissioning procedure Maintenance procedures for individual SIS elements Functional Safety Assessment procedure SIS equipment supplier management procedure Tools and equipment calibration and assessment procedure Cyber security management procedure Competency management procedure Verification procedure HAZOP procedure SIL verification procedure Management of Change procedure FAT procedure Operations and maintenance strategy SIS operation procedure Action items register Contractor management procedure SIS-related emergency procedure
will select a sample of those occasions at random and ask to see supporting evidence that the requirement was followed. For example, suppose the audit is covering the Functional Safety Management Plan, which contains a clause as follows: “Every person with responsibility for executing a safety lifecycle task shall be competent for the task.” The auditor will first ask what safety lifecycle tasks have been executed since the last audit, and will select a sample of tasks from the list (e.g. preliminary SIS design). The auditor will then ask how competency is determined, and will be shown the Competency Management Procedure. This procedure will list a set of criteria
Assurance of functional safety 293 for determining competency, and will specify how competency is documented. The auditor will ask to see the evidence relating to the personnel completing the task, and will confirm: • • • •
The persons executing the task are the same as the persons whose competency records are provided The date on the competency assessment record makes sense compared with the date when the task was executed The persons’ competency assessment was fully completed, and the result was a pass If there were any conditions such as a requirement to be supervised by a more experienced engineer, the supervision actually took place
The same process of asking for instances when a clause was applied, selecting some instances, and reviewing supporting evidence continues clause by clause until the entire procedure has been audited. Discrepancies and points for improvement are identified, recorded and discussed with the team; some issues may get closed out during the discussion. Finally, an audit report will be produced, detailing the name of the auditor, scope of the audit, audit procedure, findings, open issues, and recommendations. Open issues will be added to the project’s action item register.
Exercises Exam tip The following exercises are typical of the style of question you can expect in the CFSP/CFSE examination.
1. What is the overall objective of assurance activities including verification, validation, Functional Safety Assessment, and auditing? 2. Name 3 lifecycle phases that should be subject to verification. 3. Which of the following statements is the best description of the objective of verification? (a) To confirm that each phase has been conducted by competent personnel. (b) To confirm the outputs from each phase are fit for purpose. (c) To confirm that each phase has been executed according to procedure. (d) To confirm the scope of work in each phase is correct. 4. When is the appropriate time to carry out verification?
294 Chapter 10 5. List 3 problems that could arise if verification is not properly conducted after each phase. 6. Which of the following statements is the best description of SIS validation? (a) To confirm the SIS meets its reliability target. (b) To confirm the SIFs are correctly installed and commissioned. (c) To confirm the SIS, as commissioned, fully conforms with the Safety Requirements Specification. (d) To confirm the plant can operate safely, under the protection of the SIS. 7. If test equipment is used during validation, what are the requirements relating to test equipment? 8. Give 3 examples of checks that should be performed after validation to confirm the SIS has been restored to normal operating condition. 9. Under what conditions might the SIS need to be revalidated? 10. Functional Safety Assessment can be carried out at several stages, but only two stages are compulsory; which stages? 11. Give 3 examples of checks that need to be made during Functional Safety Assessment at all stages. 12. Suppose part of the safety lifecycle is executed by a third party (e.g. a SIS equipment supplier). Regarding Functional Safety Assessment (FSA) of the third party work, which of the following statements is most correct? (a) FSA is not required. Only the EPCIC’s responsibilities are subject to FSA. (b) The EPCIC is responsible for executing FSA for all parties. (c) FSA of third party work is not required if a SIL certificate is provided. (d) FSA is required; the Functional Safety Management Plan should define who is responsible for FSA of third party work. 13. Is it permissible for verification and Functional Safety Assessment to be conducted by the same person? 14. Give 3 examples of project work scope that may be subject to Functional Safety Audit. 15. Consider the following statements regarding Functional Safety Audit (FS Audit). Based on functional safety standards, is each statement true or false? (a) FS Audit is required each time Functional Safety Assessment is carried out. (b) FS Audit is optional. (c) FS Audit may be carried out as part of Quality Management System audit. (d) FS Audit aims to confirm compliance with functional safety procedures. (e) FS Audit is the same as Functional Safety Assessment.
Assurance of functional safety 295 (f) The FS Auditor must be independent of the SIS design or operations team (depending on the lifecycle phases included in the audit scope). (g) FS Audit must be conducted at least once every 2 years. 16. Refer to Fig. 1.3, which shows a decision tree suggesting a way of selecting whether an observed SIS failure is random or systematic. The Figure identifies various types of systematic failure (grey hexagons). For each type of systematic failure, which of the assurance strategies described in this chapter is best able to identify the underlying cause before it leads to an accident?
Answers Question 1danswer To identify human errors that could lead to systematic failure of the SIS.
Question 2danswer While this varies depending on the structure of the lifecycle, the following list is typical: Hazard identification, allocation of safety layers, basic SIS design, detailed SIS design, construction and commissioning, operation, decommissioning.
Question 3danswer The correct answer is (b), ‘To confirm the outputs from each phase are fit for purpose.’ Answer (c) is more related to Functional Safety audit.
Question 4danswer As soon as possible after completion of each lifecycle phase.
Question 5danswer • • • • •
Errors made during the phase may not be discovered until much later, or even when an accident occurs. Part of the scope may be missing, leading to delays and increased costs at a later stage. Functional Safety Assessment will be more difficult to complete, because it relies on evidence from verification checks. There may not be sufficient evidence of personnel competency. Lack of verification should lead to a discrepancy being raised during functional safety audit. This, in turn, may mean that the Quality Management System is not complied with.
296 Chapter 10
Question 6danswer The correct answer is (c), ‘To confirm the SIS, as commissioned, fully conforms with the Safety Requirements Specification.’ The author has seen all of (a), (b) and (d) stated as the objective of SIS validation in project documents, but this is not the intent of SIS validation as stated in the standards.
Question 7danswer The equipment must be calibrated to an external standard, and calibration results should be documented. If the calibration expires during validation, the equipment must be recalibrated. The equipment should also be assessed as suitable for use with a SIS.
Question 8danswer • • • • • •
Reactivate any disabled alarms Remove all bypasses and forces Restore correct lineup of manual valves Remove test equipment and test fluids Replace equipment covers Lock SIS cabinets and log out of supervisor mode on HMIs
Question 9danswer
• • •
When the SIS is changed in a way that could affect the original validation results; for example: An item of SIS hardware is altered, repaired or replaced SIS application software is updated SIS operating parameters, such as delay times, are changed
Question 10danswer Stage 3, after commissioning but before the process hazards are present; and Stage 4, after the plant has been in operation for some time.
Question 11danswer • • • •
Competency assessment is carried out Document management procedures are complied with Verification of each lifecycle phase is completed Procedures are ready to proceed to the next lifecycle phase
Assurance of functional safety 297 •
Action items register is maintained, and action items are closed out
Question 12danswer The correct answer is (d), ‘FSA is required.’
Question 13danswer Yes, provided the person meets the independence requirements in the standard for FSA.
Question 14danswer Any functional safety-related procedure may be subject to audit, such as the Functional Safety Management Plan, Verification procedure, Validation procedure, HAZOP procedure and Management of Change procedure. Further examples are given in the main text.
Question 15danswer (a) (b) (c) (d) (e) (f) (g)
False. There is no requirement to synchronise FS Audit and FSA. False True True False True False. The standard makes no requirement on the frequency of FS Audit.
Question 16danswer (a) Software issue: Safety software verification, FAT, SIS validation (b) Incorrect device specification or selection: Verification of the preliminary SIS design (c) Failure to replace the device after its useable life: Functional Safety Assessment, Stage 3 (to confirm that a replacement procedure is in place) and Functional Safety Audit (to confirm that the replacement procedure is being followed) (d) Maintenance error: Functional Safety Assessment, Stage 3 (to confirm the maintenance procedure exists, and there is a competency requirement for maintenance personnel) and Functional Safety Audit (to confirm that the procedure and competency requirements are being observed) (e) Operator error: Similar answer to (d) (f) Error in Process Hazards Analysis, specification, or design: Verification of the respective lifecycle phases (g) Error in procurement, construction or commissioning: SIS validation
This page intentionally left blank
C H A P T E R 11
The SIS operational phase Abstract During the plant’s operational lifetime, certain tasks relating to the Safety Instrumented System (SIS) must be performed: (1) Operators must be trained to recognise and respond to SIS alarms, to use overrides and resets correctly, and to know when to trip Safety Instrumented Functions manually. (2) Maintenance personnel must be trained in proof testing, maintenance and refurbishment procedures. (3) Proof testing, maintenance and refurbishment must be carried out when required. (4) The actual performance of the SIS must be measured, in terms of observed failure rates, spurious trips, Mean Time To Restore and other parameters. The performance parameters must be compared with values assumed during SIS design. Resolution of discrepancies between assumed and observed values may require reversion to earlier phases of the functional safety lifecycle. (5) SIS modifications and partial SIS decommissioning must be subject to a Management of Change process, to ensure the process risk continues to be managed at a tolerable level.
Keywords: Alarm; Decommissioning; Management of change; Operational phase; Operator; Performance measurement; Proof testing; Training.
11.1 Introduction After the process goes live, with the SIS in place, tasks still remain for the operations and maintenance teams [1]. The objective is to ensure that the SIS continues to provide sufficient risk reduction, even if the risk profile of the process changes. The specific tasks, most of which will be covered in detail in this chapter, are: • • • • • •
Continued training of operators and maintenance personnel Proof testing Maintenance and repair Assessment of SIS performance Functional Safety Assessment and audit (see Chapter 10) Management of Change for any modifications
Normally, all these aspects are summarised in an ‘Operations and maintenance strategy,’ which could be part of the Functional Safety Management Plan or a standalone document.
Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00007-0 Copyright © 2023 Elsevier Inc. All rights reserved.
299
300 Chapter 11
11.2 Training requirements 11.2.1 Operator training In a typical process, demands on the SIS should be rare. Over time, operators may forget details about their SIS training. Also, new operators may be assigned to the process. Therefore, it is important to maintain a regular schedule of SIS-related training. Evidence of training completion should be kept, as it will be checked during subsequent Functional Safety Assessments and Audits. The following topics should be included in the training program: •
•
• • • •
•
Normal operation of the SIS, including the sources of information about SIFs (Cause & Effect Diagrams, Interlock Logic Diagrams), and the information presented on the SIS HMI. Operators should be familiar with relevant parts of the SIS operating procedure. Instructions not to alter or override any part of the SIS, including sensors, logic solvers and final elements, without a Permit to Workdexcept for special cases such as start-up overrides Cyber security issues relating to the SIS What to do when a SIF trips What to do when a SIS fault is discovered, whether by diagnostics or proof testing How and when to trip SIFs manually. (The ‘When’ is a particularly important point. Operators may be understandably reluctant to shut down a process on their own initiative, and they should be given clear information on when it is acceptable for them to do so.) How to interpret and respond to alarms originating in the SIS, including diagnostic alarms and system-related alarms such as fault alarms and communication failure alarms.
All of this information should also be readily and clearly available in operating manuals, which the operators can quickly access in an emergency. The operators should be shown how to access the information when needed.
11.2.2 Training for maintenance personnel Electrical, control and instrument personnel are likely to be familiar with the general types of equipment used in the SIS. However, the SIS may differ from other instrumentation in terms of the high level of documentation and attention to system integrity required. Thus, specific training may be appropriate. Specialised topics may include: •
Awareness of the Functional Safety Management Plan, especially the requirements for Management of Change, Configuration Management, and avoidance of systematic failures
The SIS operational phase 301 • • • • • • • • •
Proof testing procedures, including partial valve stroke test Procedures for applying and managing bypasses. The importance of not revealing supervisor passwords to others (to prevent unauthorised bypass). The importance of returning the SIS to its normal state after testing and maintenance, and procedures for ensuring this is done The potential consequences of disabling, bypassing or inadvertently tripping SIFs How to respond to SIS diagnostic alarms and discovered SIS failures Modification proceduresdin particular, the importance of ensuring that any application program change is validated Analysis of SIF trip and failure events (see later in this chapter) Cyber security considerations
As with operator training, records should be kept for review during Functional Safety Assessment and Audit. As mentioned in Section 9.3.2, it may also be advisable to conduct assessments to prove the training is effective.
11.3 Proof testing 11.3.1 Introduction In this section, everything said about proof testing also applies to partial valve stroke testing. During SIS basic design, decisions were already made about which elements of the SIS shall be subject to proof testing, and the required test interval. Proof testing procedures need to be defined; the basic procedure can usually be obtained from device manufacturers, but may need to be supplemented with in-house aspects such as • • • • • • •
•
Whether the process needs to be stopped or adjusted (e.g. bypass lines opened) for testing Permit to Work requirements Bypass and override procedures The use of calibrated test tools For redundant devices (e.g. in a 1ooN or MooN architecture), whether testing shall be staggered so that the SIF is still available during testing Procedure for ensuring the SIS is restored after testing Requirements for cross-checking between maintenance personnel (these are not specifically mentioned in the standards; but, given the criticality of the SIS, cross-checking could be regarded as a good verification practice) Record-keeping requirements
302 Chapter 11 Where calibrated tools are used for testing, the same issues as for validation apply (see Chapter 10). In summary: • • •
Tools must be calibrated before use, although if a previous calibration is still valid, they do not need to be recalibrated Calibration shall be to a suitable external standard Calibration records must be kept, for review during Functional Safety Assessment and Audit.
11.3.2 Applying more than one test procedure per device Manufacturers of some devices may provide alternative proof test proceduresdfor example, a “basic” test with 60% proof test coverage, performed once per year, and a “rigorous” test with 95% coverage, performed once every 5 years. Interleaving of both test procedures can give significant benefits: a sufficiently low PFDavg may be achievable with longer proof test intervals than if only a single test procedure is applied, resulting in less process disruption and reduced maintenance costs. Some SIL verification software does not have the ability to model more than one proof test per device. A possible workaround would be to model the device in the final elements subsystem (even if it is actually a sensor, for example) and then treat the basic test as a partial valve stroke test. This should give the correct overall result for PFDavg of the complete SIF.
11.3.3 Test before performing maintenance Proof testing and maintenance are typically performed together as a single task. However, it is important to perform the proof test on the device as found, without performing any maintenance beforehand. This is because we need to know whether the device had developed any fault since the last proof test, so that the real-world performance of the SIS can be measured accurately. Exam Preparation Tip The purpose of proof testing is not ‘to prove the device is working’ but ‘to reveal any lDU (dangerous undetectable) faults.’ Watch out for this ‘gotcha’ in multiple choice questions.
11.3.4 Document the duration of testing and repair The duration of the proof test needs to be recorded, to confirm the assumed Proof Test Duration (PTD) is reasonable. If repair is needed, the time taken to complete the repair also
The SIS operational phase 303 needs to be recorded, counting from the time the fault was discovereddwhether by proof testing, diagnostic alarm, or SIF failure on demand. This is needed for comparison with the assumed Mean Time To Restore (MTTR). If the process is left online and unprotected for much longer than the PTD (for testing) or MTTR (for repair), the assumptions made during SIL verification are no longer valid and would need to be reassessed.
11.4 Monitoring of SIS performance To assign SIL and RRF targets to the SIFs, a substantial number of assumptions had to be made during the HAZOP, SIL assessment and SIL verification phases. A typical list of assumptions is shown in Table 11.1. Most of these were discussed in detail in earlier chapters. Table 11.1: Typical assumptions made during SIS-related risk analysis and SIS design. Lifecycle phase
Typical assumptions
Hazard identification (HAZOP)
All significant demand cases for SIFs have been identified Non-SIS layers of protection have been identified The consequences of SIF failure have been correctly identified Assumptions are made about future trends, such as early and late life operating conditions. (For example, upstream oil and gas facilities typically make assumptions about feedstock composition changes over the plant lifetime.)
Allocation of safety layers (SIL assessment)
Initiating event frequencies are assumed Assumed probability of failure on demand (PFD) values are assigned to non-SIS layers of protection Probability of failure of an alarm as a layer of protection is assumed Enabling conditions are assumed valid and are assigned reasonable probabilities of occurrence Conditional modifiers are assumed valid and are assigned reasonable probabilities of occurrence Costs associated with equipment repair and replacement, downtime, and loss of product are assumed Assumptions are made during cost/benefit analysis, such as the cost of maintenance and future value An operating mode assigned to each SIF (low demand, high demand or continuous mode) based on an assumed SIF demand rate
SIL verification
Random hardware failure rates (lD, lS) are assumed based on statistical data Proof test duration (PTD) is assumed Mean time to restore (MTTR) is assumed
304 Chapter 11 All of these assumptions need to be checked during the operational lifetime of the SIS. This depends on the collection of accurate performance data, including • • • • • • • •
Installed hours of experience for each SIS device Records of SIS device replacement (this improves the achieved PFDavg by reducing the mission time of the device) Faults discovered by proof testing Faults detected by diagnostics Faults revealed by SIS failure on demand Faults revealed by spurious trip Analysis of the root cause of any faults discovered (in sufficient detail to determine whether the cause is a random or systematic failure) Trip events caused by actual demands on the SIF (i.e. when the hazardous events occur and preceding layers of protection fail)
Ideally, the operations team will also collect data on the occurrence of initiating events. However, this information could be hard to collect: if the incident is successfully managed by another layer of protection such as a control loop or alarm with response, it may not be feasible to require record-keeping for every such event. The requirements and procedure for assumptions review and performance monitoring, and the frequency of such review, can be detailed in the Operations and Maintenance Strategy. A SIS performance report needs to be written based on this. The report will indicate whether the assumptions need to be re-examined, potentially leading to a change of SIS design. The whole process of reviewing SIS performance should be assessed as part of Functional Safety Assessment (stage 4) and Audit.
11.5 SIS modifications and partial decommissioning Any change in the SIS design, including partial decommissioning of the SIS, needs to be assessed to determine whether the SIS’s risk reduction objectives are still met. The procedure for such assessment should be described, or at least referenced, in the Functional Safety Management Plan (see Chapter 7).
11.5.1 The Management of Change procedure When an operational change is proposed, the Management of Change (MoC) procedure sets out the basis for determining whether a detailed review of SIS design is required. Relevant changes are not only to the SIS itselfdsuch as replacement of hardware with a different brand, type, or architecturedbut also to the risk profile of the process.
The SIS operational phase 305 Here are some examples of changes in the risk profile that would need to be assessed: • • • • • • •
•
• •
Change in the fraction of time that the risk is present; for example, change in batch duration or frequency, or change in product slate Change in raw material composition (e.g. increasing H2S content in gas from a wellhead) Change in manning (e.g. increased or reduced number of operators, or switch between manned and unmanned operation) Changed initiating events for SIF demand (e.g. addition or deletion of valves, change between manual and automatic control of a sequence) Change in operating conditions (pressure, temperature, level, flow) Change in non-SIS layers of protection, such as alarms, check valves, control loops and relief valves Change in plant operating basis (e.g. standby equipment changes from hot standby to cold standby; change in sparing philosophy, for example where 2 out of 3 pumps were previously running, now 3 out of 3 pumps are required) Change in spare parts management (as incident severity and MTTR is often predicated on the availability of onsite spares), or other factors affecting downtime such as availability of on-call maintenance personnel Unforeseen deterioration of plant performance (e.g. fouling of heat exchangers, or plugging of pressure sensing lines, occurring more severely than expected) Bringing new sections of plant online, changing equipment connectivity, or disconnecting equipment (e.g. which might have provided a buffer against overpressure incidents)
Any of these might lead to an increased process risk (increased frequency or severity of a potential dangerous event), in which case the team may decide to increase the SIL, RRF or testing requirements of SIFs in the SIS. A review should be triggered when: • • •
A non-trivial change to the SIS is proposed The level of risk managed by the SIS changes The risk reduction expected to be provided by the SIS changes (e.g. due to new information or changed assumptions)
The MoC procedure should also indicate: •
•
How the review (or the decision not to review) and its outcome should be documented, bearing in mind that it will be examined during Functional Safety Assessment and Audit Who is responsible for deciding whether to review, and the extent of review required.
MoC assessment can be made easier by clear documentation during the original SIS lifecycle assessmentdin particular, by taking care to record assumptions made, with
306 Chapter 11 justifications. The defined process safe state for each SIF is another valuable piece of information for MoC; this is another reason why the SIS designer should specify the safe state of the process, not that of the SIF (see discussion in the Section 7.3 in Chapter 7. Negative assumptions should be captured as welldfor example, demand cases that are not considered during HAZOP or SIL assessment, for specific reasons. Finally, the MoC procedure needs to ensure that all relevant documents are updated, and the resulting changes are communicated where required (e.g. in refresher training).
11.6 Future challenges Much progress has been made in functional safety. Competency levels across the process industry are improving, aided by the near-global adoption of international standards and the increasingly wide uptake and acceptance of functional safety qualifications. The quality of failure rate data, the knowledge base among practitioners and consultants, and the availability of SIL-certified hardware is all improving year by year. Progress is still needed, especially in the area of systematic failure management. Software tools are still too fragmented and mutually incompatible, resulting in error-prone manual data transfer from one to another. Industry awareness of assurance tools such as safety lifecycle verification and SIS validation is still weak. The industry would also benefit from increased harmonisation and transparency in many parts of the functional safety lifecycle, including HAZOP, SIL certification, application program development and verification, and Functional Safety Assessment. Building functional safety planning into the overall project workflow from the very beginning would help to improve efficiency, lower costs, shorten project timelines and improve systematic failure management. It is alarming that some project engineering managers still regard verification as an optional extra or a simple tick-box exercise. This could easily be remedied, with considerable benefits for overall risk management. For example, each functional safety lifecycle phase can be planned and executed with verification in mind, with workflow and output documentation structured so as to make verification straightforward and effective.
11.7 Closing thoughts Starting from scratch, our journey through functional safety has taken us all the way through the lifecycle, from first concept to operational maturity. Overall, I hope the pathway to success is clear: know the requirements, plan carefully, document meticulously, spread awareness among stakeholders, anddmost importantly for every engineerdassume nothing without justification. Stay alert, stay aware and always stay in a state of chronic unease. That is the key to getting everyone home safely, day after day.
The SIS operational phase 307
Exercises 1. Changes in the risk profile of the process managed by the SIS should be assessed by a Management of Change procedure. Ultimately, this may necessitate a change in the SIS design. Explain why a change in manning levels (e.g. reduced number of operators, or changing from manned to unmanned operation or vice versa) may need to be assessed in this way. 2. Suppose you are developing a SIS training plan for operators. Suggest three topics to include in the training materials. 3. What are the main objectives of proof testing a SIF? Select the TWO best options below: (a) To confirm that the field equipment is in good physical condition, with covers and cable glands properly secured, electrical and utility connections tight, and no corrosion or impact damage. (b) To discover any dangerous undetectable failures that may have occurred since the last proof test. (c) To prove that the SIS is free from systematic failures. (d) To comply with the proof testing requirement of IEC 61511. (e) To comply with the proof test regime specified in the manufacturer’s Safety Manual. 4. Give three examples of SIS-related assumptions that should be checked during the operational phase by collecting and analysing data. 5. XYZ Chemicals Limited collects data on the ESD trip events experienced on its plant. The total number of trips from all SIFs (NT) over a total number of plant operating hours (time online TO multiplied by the number of SIFs in service, NS) is used to calculate a value A using the formula A ¼ (TO $ NS)/NT. The value A is reported to management as the Mean Time Between Failures (MTBF) of the equipment. Is this correct?
Answers Question 1dAnswer A change in manning levels could result in: • •
Increased or decreased operator responsiveness to alarms, which could affect assumptions made about non-SIS independent layers of protection (IPLs) A change in the likelihood that an operator is present in the effect zone if an incident occurs
308 Chapter 11 •
A change in the Mean Time To Restore (e.g. if fewer maintenance personnel are available, it may take longer to complete a repair)
Also, if fewer maintenance personnel are available, it may no longer be possible to comply with the proof test regime.
Question 2dAnswer • • • • • • •
How to find information about the SIFs, including architecture, using P&IDs, Cause & Effect Diagrams, and the Safety Requirements Specification Status information about the SIFs on the HMI How and when to operate auxiliary SIS functions such as resets, overrides and manual trips What to do in the event of a trip (real or spurious) What information must be collected about each SIS-related incident SIS-related alarms (e.g. communications failure, power supply failure, I/O card fault, transmitter fault) and how to respond to them Cyber security
Question 3dAnswer Answers (b) and (e) are the best answers. Answer (a) is not the main purpose of proof testing. Answer (c) is important, but proof testing is not the main strategy for achieving it. Answer (d) is incorrect because proof testing is not compulsory according to IEC 61511. Answer (e) must be selected because compliance with the safety manual is compulsory.
Question 4dAnswer • • • • • • •
Initiating event frequencies Spurious trip rates SIS random hardware failure rates Probabilities assigned to enabling conditions and conditional modifiers (e.g. percentage of time that a danger zone is occupied) SIF demand rate, hence the operating mode assigned to each SIF Proof test duration Mean time to restore
Question 5dAnswer This is not correct. The MTBF should be calculated on a per-SIF basis; if all SIFs are clubbed together, the quoted MTBF will be an average value across all SIFs, which would not give any indication of which are the “problem” SIFs to investigate first.
The SIS operational phase 309 Also, the trip data does not distinguish between real and spurious trips. Management will be interested in knowing the frequency of real and spurious trips separately, as they have different policy implications.
Reference [1] S. Ha˚brekke, S. Hauge, M.A. Lundteigen, Guideline for Follow-Up of Safety Instrumented Systems (SIS) in the Operating Phase, second ed., SINTEF, Trondheim, 2021 (Accessed 14 July 2022), https://www. sintef.no/globalassets/project/pds/reports/h3_guideline-for-follow-up-of-sis-in-the-operating-phase_2021edition.pdf.
This page intentionally left blank
APPENDIX A
Sample verification checklist Abstract Verification is the task of ensuring each functional safety lifecycle has generated the correct outputs. This appendix provides a detailed checklist as a model of how verification can be executed for the Safety Integrity Level assessment phase.
Keywords: Checklist; Safety instrumented system; Safety Integrity Level assessment; Verification.
While there are many ways to achieve verification in compliance with IEC 61511, a good approach is to develop a checklist similar to the example below. It’s better if the questions are as specific as possible. For each question, the checklist should provide a pass/fail indication and space to enter brief supporting details (e.g. supporting evidence from documents checked). If the checklist is prepared in advance, it can be used as a reference during execution of the related task. This will help the engineer to ensure all necessary aspects of the task are completed and make it easier for the verification assessor to find the evidence. The example is for verification of the SIL assessment phase.
Verification checklist: SIL assessment Name of assessor: Date of assessment:
Section 1. Scope of work executed 1. Was an SIL assessment workshop executed? 2. Was the workshop attended by all necessary personnel, as defined in the SIL assessment procedure? 3. Was the full scope of the project (containing all proposed SIFs) considered in the SIL assessment?
311
312 Sample verification checklist
Section 2. Inputs (a) SIL assessment procedure: 4. Defines the scope of the SIL assessment study? 5. Approved? 6. Correct revision used? 7. Defines the rules for risk reduction from non-SIS Independent Protection Layers? 8. Defines the rules for determining SIL (and, optionally, Risk Reduction Factor (RRF) or Probability of Failure per Hour (PFH)) targets? 9. Does the tolerable risk guideline match the project tolerable risk criteria? (If risk graph or risk matrix are applied, show evidence that the graph/matrix is calibrated to match the project tolerable risk criteria.) (b) C&ED or interlock logic diagrams: 10. What documents were used to identify the SIFs? 11. Input documents approved? 12. Correct revision used? 13. Covering full scope of the SIL assessment?
Section 3. Outputs (a) SIL 14. 15. 16. 17.
assessment report contains: Attendance records? Scope of work executed? Evidence of the SIL assessment methodology used? References to input documents including HAZOP report, SIL assessment procedure, C&ED or interlock logic diagrams and tolerable risk guidelines? 18. Action items for follow-up and closeout? 19. Report review and approval record? (b) Is there an SIL assessment worksheet for each SIF identified? (c) Is there a table of results showing, for each SIF assessed (items marked * are required only if the SIL target is 1)? 20. All significant demand cases found in the HAZOP worksheet 21. Any other demand cases considered? 22. Worst-case consequence of SIF failure 23. SIL target 24. RRF or PFH target (optional)* 25. Expected demand interval 26. Identification of critical final elements* 27. SIF operating mode (low demand, high demand or continuous)* 28. Process safe state achieved by the SIF* (d) Has the SIL assessment complied with the rules defined in the SIL assessment procedure?
Sample verification checklist 313 (e) Is there sufficient evidence of independence between initiating events, enabling conditions, IPLs, conditional modifiers and the SIF under consideration? (f) Does the assessment comply with the ‘Rule of Two’ (regarding credits for BPCS IPLs)?
Section 4. Quality 29. Are the outputs traceable to corresponding input data? 30. Are the inputs (e.g. tolerable risk criteria, initiating event frequency, etc.) traceable to source? 31. Are the outputs readable, numbered, clear, concise, complete and fit for purpose?
Section 5. Personnel 32. Are the key personnel (facilitator, process engineer, I&C engineer) identified? 33. Is there evidence to confirm competency of key personnel?
Section 6. Verification results • • •
List any findings for closeout: Result: Verification pass/fail Points for future improvement:
This page intentionally left blank
APPENDIX B
What is affected by SIL Abstract The Safety Integrity Level (SIL) assigned to a Safety Instrumented Function (SIF) affects several aspects of the SIF, besides the target random hardware failure measure and the hardware fault tolerance requirement. This appendix provides a list of SIF-related aspects affected by SIL.
Keywords: Safety instrumented function; Safety integrity level; SIL.
The standards require that the SIL target of the relevant SIF, or the highest SIL target among all SIFs in the SIS, is considered when addressing the following matters: • • • • • • • •
PFDavg or PFH target Hardware fault tolerance target Degree of confidence required during SIL capability assessment Degree of independence required during Functional Safety Assessment (see IEC 61508:2010 part 1, clause 8.2.18, Table 5) Competency level (see IEC 61511:2016 part 1, clause 5.2.2.2h) Whether special measures are required (for SIL 4) Whether prior use justification is accepted for the application program Selection of PE logic solver (see IEC 61511:2016 part 1, clauses 11.5.4, 11.5.5)
315
This page intentionally left blank
Index ‘Note: Page numbers followed by “f ” indicate figures and “t” indicate tables.’
A Action item management, 191e192 Actuated valves, 35e36 Actuators, 36 Alarms, 37, 300 management, 88e90 with operator response, 87e90 sounds, 1 Annunciation failures, 43 Application program, 34 Architectural constraints, 256e260 apply SFF requirements, 260 hardware type A and B, 257 HFT requirements in IEC 61508:2000, 258 IEC 61508:2010, 258e259 IEC 61511:2016, 259 safe failure fraction, 257 As Low As Reasonably Practicable (ALARP), 60 risk matrix and, 123 tolerable risk levels in relation to, 60f Automatic recirculation valves (ARVs), 93, 111 Automatic system, 7 Autostart of standby equipment, 91 Average probability of failure on demand (PFDavg), 65, 245e246
B Basic process control system (BPCS), 34, 91e92, 177, 220e221
final elements are shared between SIS and, 177 Binary logic diagrams, 79e80, 79f Bowtie analysis, 72
C Calibration range, 31 records, 302 of risk matrix, 118e121 empty risk matrix for calibration, 118t risk matrix with first point calibrated, 119t SIL target as function of required, 119t Cascade control loops, alarms from, 176 Cause and effect diagrams (C&EDs), 73, 75e77 simplified example, 75t Checklist, 311 Common cause factor, 251 failures, 203e204 Competency, 20e22 management, 191 Conditional modifiers, 131, 133e135 used in LOPA, 133t Configuration management, 189 management of change, 188e189 Consequence of failure on demand (CFD), 115
317
Continuing professional development (CPD), 191 Continuous modes, 63e64 SIFs, 68 Contractor management, 192 Control loops, 90e91 malfunctions, 112e113 Cost/benefit analysis, 141e143 Cranking, 165 Critical common element analysis, 97e99 Critical final element, 177 Cyber security, 230e231, 300
D Dangerous detected failures (DDs), 45 Dangerous failures, 43 Dangerous undetected failures (DUs), 45, 66e67 De-energise-to-trip (DTT), 236e237 Demand frequency, 125e129, 171 Derating, 237 Deviation measurements, 31e32 Diagnostics, 44, 224e225 for field devices, 225e228 Differential measurements, 31 Digital signals, 32 Discrepancy handling, 276, 281e282 failure categories and actions for issues found during verification, 277t Document inspection, 281 Document management, 190
Index Documentation, 185e186 maximising effectiveness of, 208e209 Double jeopardy rule, 84e85 Double-acting, 36 Duty and standby pumps, 173e176 dry run protection of pump groups, 173t Duty pump switchover, 175e176
E Electrical, electronic and programmable electronic safety equipment (E/E/PE safety equipment), 8e9 Electromagnetic interference (EMI), 203, 280 Elements of final element subsystem, 37 Emergency depressurization valves (EDV), 35, 48e49 Emergency shutdown (ESD), 28, 166 Enabling conditions, 132e133 End-to-end test, 279e280 Energise-to-trip (ETT), 236e237 Engineering, procurement and construction (EPC), 187 Event Tree Analysis (ETAs), 72, 109, 169
F F parameter, 129 Factory Acceptance Test (FAT), 278 Fail closed (FC), 36 Fail last (FL), 36 Fail open (FO), 36 Failure measure, 66 Failure modes, 43e45 Failure modes and effects analysis (FMEAs), 43, 251e252, 262 Failure rates, 45e46 Failures in time (FIT), 45e46 Fatality risk, 57 Fault tree analysis (FTA), 106, 138e141
documenting, 140e141 Field equipment inspection, 279 Final elements, 28. See also Primary final elements subsystem, 34e39 actuated valves, 35e36 elements of final element subsystem, 37 MooN concept for final elements, 37e39 motor control circuits, 37 safety PLC, 35t Fire & Gas System (FGS), 32 Fire detection system, 72 Fire sensors, 32 Functional safety, 7 answers, 24e25, 216e218 audit, 291e293 answers, 295e297 exercises, 293e295 typical audit procedure, 291e293 complete of functional safety documentation, 209e213 documents normally required for complete functional safety lifecycle, 210te213t documentation, 185e186 exercises, 24, 214e216 functional safety assessment, 285e291 lifecycle stages, 286t functional safety audit, 291e293 functional safety-related procedures subject to audit, 292t functional safety management plan, 186e194 hazard and risk, 2e7 harm, 3 hazard, 2e3 risk, 3e5 risk management through, 6e7 tolerable risk, 5e6 IEC 61511 key concepts, 9e22 lifecycle, 9e12, 186e188 information needed for lifecycle phase, 188
318
main sections into hardware SRS, 195t maximising effectiveness of documentation, 208e209 automate, careful, 208e209 minimise repetition, 208 safety requirements specification, 194 standards, 7e9 comply with IEC 61511, 9 purpose of, 7e8 scope of IEC 61511, 8e9 validation, 278e285 verification, 273e277 functional safety assurance activities, 274t needs for functional safety lifecycle phase, 274t Functional Safety Assessment (FSA), 186, 192, 276, 285e291 assessment of SIL verification, 290e291 tasks, 288e290 common pitfalls to avoid, 290 conducted in practice, 288 independence requirements for, 288 sample size needs to considered in, 287 stakeholders need to perform, 287 Functional Safety Management Plan (FSMP), 27, 186e194, 214, 261, 275, 285 functional safety lifecycle, 186e188 typical coverage of, 187t important, 193e194 management of change and configuration management, 188e189 management requirements in, 189e193 action item management, 191e192 assurance planning, 193 competency management, 191 contractor management, 192
Index document management, 190 overall planning, 190 SIL capability management, 193
G Gas sensors, 32
H Handling redundant initiators, 160 Hardware fault tolerance (HFT), 46e47, 49, 233e234, 256 requirements in IEC 61508:2000, 258 requirements in IEC 61508:2010, 258e259 SIL achievable per subsystem per IEC 61508:2010 route 2H, 258t requirements in IEC 61511:2016, 259 architectural constraints per IEC 61511:2016, 260t Hardware inspection, 278e279 field equipment inspection, 279 SIS logic solver inspection, 279 Harm, 3 Hazard, 2e3 matrix methods, 67 Hazard and Operability (HAZOP), 53 BPCS trips, 78e79 and old SIL assessment study reports, 77e79 Heater start function, 154 High demand mode, 63e64 SIFs, 68 Human-Machine Interface (HMI), 219, 221
I IEC 61508 standard, 7e9 IEC 61511 standard assuring functional safety, 13 competency, 20e22 comply with, 9 functional safety lifecycle, 9e12
key concepts, 9e22 origins of, 22e23 random and systematic failures, 14e20 scope of, 8e9 SRS, 13 standard, 7e9 structure of, 22 Independence, 237e242 communications between SIS logic solver and BPCS, 239e240 implementing BPCS and SIS in single logic solver, 240 implementing non-safety functions in safety PLC, 241e242 multiple SIFs in same SIS, 238 multiple systems tripping motor via same MCC, 238e239 requirements for FSA, 288 Independent protection function (IPF), 86 Independent protection layers (IPLs), 85e95, 115, 122, 132, 241 alarms with operator response, 87e90 autostart of standby equipment, 91 backup utility supplies, 94 BPCS interlocks, 91e92 check valves, 92e93 control loops, 90e91 depending on demand case, 163e164 distillation column example, 164f examples of insufficient independence, 95 interlocks in PLCs, 92 IPL credit available, 95 typical PFD for IPLs in LOPA, 96te97t mechanical protective devices, 93 operating procedures, 93 pressure relief devices, 86e87 SIF, 94e95
319
spill containment, 93 trace heating, 94 Individual risk, 57 Initiating events, failure of safeguards as, 113, 113t Input/Output cards (I/O cards), 34 Interlocks, 29e30 logic diagrams, 80e82 for offshore oil facility, 81f International Electrotechnical Commission (IEC), 23 IEC 61508 standard, 7e9 IEC 61511 assuring functional safety, 13 competency, 20e22 comply with, 9 functional safety lifecycle, 9e12 key concepts, 9e22 origins of, 22e23 random and systematic failures, 14e20 scope of, 8e9 SRS, 13 standard, 7e9 structure of, 22 Intrinsically safer design, 12, 256
K Knock-on effects, 71
L Layer of protection (LOP), 85, 97 Layer of protection analysis (LOPA), 67e68, 106, 124e131 conditional modifiers, 133e135 enabling conditions, 132e133 estimating SIF demand rate, 135e136 example LOPA worksheet, 136 high demand and continuous mode SIFs, 136e138 handling multiple initiating events, 135 method, 124e131
Index Lifecycle approach, 41 Local area network (LAN), 221 Logic solver, 28, 238 subsystem, 34 Low demand modes, 63e64
M M out of N (MooN), 32 for final elements, 37e39 for initiators, 32e34 Machine monitoring system (MMS), 239 Management of change (MoC), 185, 188, 191, 284, 304 assessment, 305e306 and configuration management, 188e189 Markov modelling, 253e254 Mathematical methods, 247 Mean time between failures (MTBF), 307 Mean time to fail spurious (MTTFS), 29, 46, 226e228, 233, 263 Mean time to repair (MTTR), 256 Mean Time To Restore (MTTR), 168, 220, 226, 302e303 Mission Time (MT), 168e169 Mitigation, 100 functions, 72 Motor control circuits (MCC), 37, 88, 233e234, 256 Motor-operated valves (MOVs), 36, 232
N N out of N (NooN), 32, 37 No effect failures (NEs), 43, 45 Nodes, 53e54
O 1 out of N (1ooN), 32, 37e38 Operating mode, 64 selection of SIF’s operating mode, 65t Operating System (OS), 220e221 Operations and maintenance strategy, 299 Operator training, 300
P P parameter, 129e130 Partial decommissioning, 304e306 Partial valve stroke testing (PVST), 44, 48e49, 235e236, 257, 270, 281 diagnostic, 236 Performance-based approach, 9 Permissives, 170e172 functions, 39 Phases, 10 Physical initiators and final elements, 171e172 Piping and instrumentation diagram (P&ID), 53e54, 82e83 depiction of SIS interlock in, 83f Plant’s hot oil circulation system, 181 Pre-startup safety review (PSSR), 285 Pressure control valves (PCVs), 93, 111 Pressure relief devices (PRDs), 86e87 Pressure relief valves (PRVs), 86e87 Prevention functions, 72 Primary final elements, 177e180 answers, 182e184 exercises, 180e181 safe state, 177e178 Printed circuit board (PCB), 24 Prior use, 262e263 Probability of failure on demand (PFD), 125e129 Probability of failure per hour (PFH), 65, 68, 245, 312 Probable loss of life (PLL), 59 Process hazards analysis (PHA), 117 Process shutdown (PSD), 166 Process variable (PV), 31, 230, 280 Programmable logic controller (PLC), 220e221 interlocks in, 92 PLC-based logic solvers, 220e231
320
redundant power supplies for system and field power, 223f SIS PLC, 220e222 TMR SIS logic solver, 222f redundancy, 224e225 Project Management Consulting companies (PMCs), 192 Proof test coverage (PTC), 255e256, 270 Proof test duration (PTD), 254e255, 302e303 Proof test interval (PTI), 254e255 Proof testing, 44, 64, 254e256, 301e303 applying more than one test procedure per device, 302 document duration of testing and repair, 302e303 effect of human error during proof testing, 255e256 optimising proof test interval, 254e255 test before performing maintenance, 302 Proven in use, 262e263
Q Quality management system (QMS), 291 Quantitative methods, 68 Quantitative risk analysis (QRA), 59, 109
R Random failures, 14e20, 43, 46e47, 105t, 171e172, 207, 246e254 Reciprocal of Mean Time Between Failures (1/MTBF), 168 Redundancy, 162 Redundant array of inexpensive disks (RAID), 221 Redundant initiators, 84, 159e160 handling, 160 Redundant safety functions, 160e163
Index one SIF backup to another, 162 redundant SIFs in low risk situations, 163 two SIFs redundant, 162 Reliability Block Diagrams (RBDs), 234, 260, 266 Restoring SIS after validation, 282 Revalidation, 278, 284e285 Risk analysis, 77, 186 Risk evaluation expressing risk in numbers, 54e55 tolerable risk, 55e57 Risk Graph methods, 67, 109, 124e131 calibration of, 124e131 estimating SIF demand rate, 131 examples, 125 handling enabling conditions and conditional modifiers, 131 handling independent protection layers, 131 handling multiple initiating events, 131 high demand and continuous mode SIFs, 131 parameters used in, 125 selecting parameter categories, 125e130 avoidance, 129 demand frequency, 125e129 exposure, 125e129 Risk management through functional safety, 6e7 Risk matrix, 117e118 and ALARP, 123 calibration of, 118e121 estimating the SIF demand rate, 122 example risk matrix for SIL assessment, 117t handling enabling conditions and conditional modifiers, 122 handling independent protection layers, 122 handling multiple initiating events, 121
high demand and continuous mode SIFs, 124 likelihood and severity categories, 116 methods, 109, 116e124 Risk receptors, 3e5, 4te5t Risk reduction, 55 Risk Reduction Factor (RRF), 41e42, 55, 132, 142, 246, 312 Risk tolerability level, 60
S Safe detected failures (SDs), 45 Safe Failure Fraction (SFF), 228, 236, 257, 266 requirements, 260 Safe failures, 43, 51 Safe state, 177e178 Safe undetected failures (SUs), 45, 51 Safeguards, 54 Safety Instrumented Function (SIF), 27e28, 53, 105, 186, 315 anatomy of, 30e41 final element subsystem, 34e39 important aspects of, 39e41 logic solver subsystem, 34 permissives and inhibit functions, 39 sensor subsystem, 31e34 answers, 49e52 architecture, 233e234 development of, 41e43 SIL assessment, 41e42 SIL verification, 42e43 example wording for SIF logic description, 202e203 complete requirements for hardware SRS, 196te202t exercises, 48e49 failure, 43e47 failure modes, 43e45 failure rates, 45e46 hardware fault tolerance, 46e47 formal definition of operating modes, 64
321
low demand, high demand and continuous modes, 63e64 meaning of, 28e30 interlock, 29e30 SIL, reliability, and integrity, 29 SIS, 28 operating modes, 63e68 preferred types of SIF initiator, 231e232 selection of initiator type, 231e232 valve limit switches as initiators, 232 selecting operating mode, 64 separating complex interlocks into, 83e84 significance of operating modes, 65 definition of SIL, 65e66 failure rates, 66e67 SIL assessment methodology, 67 tips on selecting operating mode, 67e68 Safety Instrumented System (SIS), 7e10, 28, 186, 219, 315 answers, 49e52 design targets, 264 exercises, 48e49, 266e267 answers, 268e272 descriptive questions, 266 numerical questions, 267 goal of SIS basic design, 219e220 independence, 237e242 logic solver inspection, 279 meaning of, 28e30 interlock, 29e30 SIF, 28 SIL, reliability, and integrity, 29 modifications, 304e306 monitoring of SIS performance, 303e304 typical assumptions made SIS-related risk analysis and SIS design, 303t
Index Safety Instrumented System (SIS) (Continued) non-PLC based logic solvers, 242e244 susceptibility to spurious trips, 244 operational phase answers, 307e309 closing thoughts, 306 exercises, 307 management of change procedure, 304e306 monitoring of SIS performance, 303e304 proof testing, 301e303 SIS modifications and partial decommissioning, 304e306 training requirements, 300e301 PLC, 220e222 cyber security, 230e231 diagnostics for field devices, 225e228 PLC-based logic solvers, 220e231 redundancy and diagnostics, 224e225 reset, 230 setpoints, 225e228 setting trip parameters, 229e230 trip delay, 230 selection of field devices, 231e237 defining final element architecture, 232e233 derating, 237 energise and de-energise-totrip, 236e237 hard-wiring of field devices, 237 partial valve stroke testing, 235e236 preferred types of SIF initiator, 231e232 SIF architecture, 233e234 testing and maintainability, 234e235
Safety integrity level (SIL), 29, 41e42, 315 alarms from cascade control loops, 176 answers, 49e52, 101e104 binary logic diagram for simple SIF, 103f architectural constraints, 256e260 assessment methods, 41e42, 311 assessing consequence severity, 114e115 assessing likelihood of initiating events, 114 calculating cost of outcome, 141e142 calculating cost of SIF, 142 control loop malfunctions, 112e113 cost/benefit analysis, 141e143 decision flow diagram, 107f determine initiating event in sufficient detail, 112 documenting SIL assessment study, 115e116 example, 110f, 110t, 142 failure measures for SIFs, 105t failure of safeguards as initiating events, 113 fault tree analysis, 138e141 features of SIL selection common to all methods, 109 layer of protection analysis, 124e131 overview of, 106e109, 106t overview of SIL assessment methods, 106e109, 106t risk graph method, 124e131 risk matrix method, 116e124 selecting initiating events, 110e113 selecting optimal solution, 143 sources of likelihood data for initiating events, 114t
322
typical initiating events, 111e112 assessment workshop, 143e145 answers, 146e150 calibrated risk graph, 149f overall objectives of SIL assessment workshop, 144e145 SIL assessment team, 143e144 calculating predicted spurious trip rate, 263e264 calculating random hardware failure measure, 246e254 SIL achievable depending on PFH achieved, 247t capability, 260e263 management, 193 capability and certification, 260e263 assessing element’s performance in field, 261e262 difference between proven in use and prior use, 262e263 SIL 2 shutdown valve, 263 software SIL capability, 263 certification, 260e263 critical common element analysis, 97e99 defining physical initiators and final elements, 171e172 demand case activation of another SIF, 165 primary and secondary SIFs for turbo-generator shutdown case, 166t demand frequency, 171 double jeopardy rule, 84e85 duty and standby pumps, 173e176 duty pump switchover, 175e176 exercises, 48e49, 100e101 failure measure, 247e254 calculation of probability curves, 248e253 single devices, 248e250 final elements shared between basic process control system and SIS, 177
Index high demand and continuous modes, 254 identifying and documenting SIFs, 73e83 using binary logic diagrams, 79e80 using cause & effect diagrams, 75e77 using HAZOP and old SIL assessment study reports, 77e79 using interlock logic diagrams, 80e82 objective, 73e74 using piping & instrumentation diagrams, 82e83 using process control narratives, interlock descriptions, 74e75 independent protection layers, 85e95 initiating event involves multiple simultaneous failures, 167e170 pressure blanketing example eschematic, 168f IPLs on demand case, 163e164 meaning of, 28e30 interlock, 29e30 reliability, and integrity, 29 SIF, 28 SIS, 28 multiple devices, 250e252 values of common cause factor, 251t multiple sensors distributed across a wide area, 172 objectives of, 68e72 high demand and continuous mode SIFs, 68 low demand mode SIFs, 68 not use default SIL targets, 70e71 prevention or mitigation, 72 one SIF cascades to another, 166e167 one SIFetwo hazards, 163 operator action as initiator, 172e173
permissives, 170e172 proof testing, 254e256 redundant initiators, 159e160 redundant safety functions, 160e163 combining SIFs, 161t redundant SIFs in distillation column, 161t selecting primary final elements, 177e180 pump transfer example eschematic, 179f separating complex interlocks into SIFs, 83e84 SIF operating modes, 63e68 SIL 2 shutdown valve, 263 SIS design targets, 264 state-based calculations, 253e254 takes to achieve, 245e246 main requirements for achieving SIL, 246t variable number of pumps running, 175 verification, 42e43 assessment of, 290e291 Safety management plan, 299 Safety manual, 193 Safety relay, 242, 244 Safety requirements, 27 Safety Requirements Specification (SRS), 13, 39, 52, 186, 194e203, 215, 225e226, 244, 256, 278 common cause failures, 203e204 developed, 194e195 example wording for SIF logic description, 202e203, 202t information consider adding to, 203 purpose of, 194 safety manual, 207e208 selecting spurious trip rate target, 205e207 SIF demand rates, 204e205 calculation of SIF demand rate, 205t Semi-quantitative methods, 68
323
Sensors, 28, 256 specific tests for, 280 subsystem, 31e34 components of, 32 MooN concept for initiators, 32e34 Set points, 31, 225e228 Severity descriptors, 57 Single acting, 36 Site integration test (SIT), 278 Software actions, 37 Software function, 39 Software SIL capability, 263 Solenoid-operated valve (SOV), 35 Solvent, 2e3 Span range, 31 Spill containment systems, 93 Spring return, 36 Spurious trip rate, 263e264 target, 205e207 Stakeholders need to perform FSA, 287 Standards, 8t Subsystems, 28 Systematic failures, 14e20, 171e172, 186, 189, 193, 215e216
T Target SIL, 29 Temperature control valves (TCVs), 93, 111 Temperature sensors, 172 Test equipment, 281 Testing and maintainability, 234e235 bypass lines allowed on SIS shutdown valvesa, 235e236 Theoretical RRF, 55 Threshold, 31 Tight shutoff (TSO), 45, 178 Tolerable frequency, 56 Tolerable risk, 55e57, 118 ALARP concept, 60 answers, 61 defining tolerable risk per event, 56e57 typical tolerable risk matrix for single events, 56t
Index Tolerable risk (Continued) defining total tolerable risk per risk receptor, 57 combined frequency to determine tolerable risk for individual SIFs, 58f exercises, 61 functional safety, 5e6 precision, 57e60 Tools, 302 Trace heating, 94 Training requirements, 300e301 operator training, 300 training for maintenance personnel, 300e301 Trips, 29e30 delay, 230 point, 31 Type A and B hardware, 257
U Uninterruptible Power Supply (UPS), 34, 221 Unit shutdown (USD), 166 Unmitigated event likelihood (UEL), 55, 132
V Validation, 278e285 discrepancy handling, 281e282 document inspection, 281 end-to-end test, 279e280 hardware inspection, 278e279 report, 282e284 restoring SIS after validation, 282 revalidation, 284e285 specific tests for final elements, 281
324
specific tests for sensors, 280 test equipment, 281 validation report, 282e284 Verification, 188, 273e277 checklists, 275e276, 311 inputs, 312 outputs, 312e313 personnel, 313 quality, 313 scope of work executed, 311 discrepancy handling, 276 results, 313 works in practice, 275 Voting schemes, 31 Vulnerability factor, 134e135
W W parameter, 125e129 Worksheet, 116