Reliability Assessment of Safety and Production Systems: Analysis, Modelling, Calculations and Case Studies (Springer Series in Reliability Engineering) 3030647072, 9783030647070

This book provides, as simply as possible, sound foundations for an in-depth understanding of reliability engineering wi

115 12 37MB

English Pages 912 [887] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Abbreviations and Notations
Part IIntroduction, Background and Overview
1 Introduction
1.1 Human Enterprises Involve Risks
1.2 Philosophy to Master the Risks
2 Background
2.1 A Short Story of Reliability Analysis
2.1.1 Premises
2.1.2 The Beginning
2.1.3 A Step Forward of the Reliability Approach
2.1.4 Consolidation of the Reliability Approach
2.1.5 Dissemination in All the Industry Sectors
2.2 Why, When and How to Implement Reliability Studies
2.2.1 Why
2.2.2 When
2.2.3 How
2.3 Name for the New Discipline
2.4 Notion of Risk
2.4.1 Etymology. Danger Versus Peril, Risk and Hazard
2.4.2 Safety Versus Risk Management Definitions
2.4.3 Risk Overview in Industrial Context
References
3 Reliability Study Overview
3.1 Overview
3.2 Goal and System Definition
3.3 How It Works (Functional Analysis)
3.4 How It Fails (Dysfunctional Analysis)
3.4.1 Point About Terminology
3.4.2 Issue Identification
3.4.3 System Modelling
3.4.4 Reliability and Operational Data Selection
3.4.5 Qualitative Analysis
3.4.6 Quantitative Analysis
3.5 Comparisons and Decision
3.6 Prevention and Risk Mitigation
References
4 Introduction of Basic Core Concepts
4.1 Preamble
4.2 Item Definition
4.3 States of an Item
4.3.1 Up and Down States
4.3.2 Operating and Non-operating States
4.3.3 Restoration States
4.3.4 Degraded and Critical States
4.4 Failure and Fault Concept
4.4.1 Failure Definition
4.4.2 Fault Definition
4.4.3 Failure and Fault Classification
4.4.4 Failure Cause, Failure Mode
4.4.5 Common Cause, Common Mode and Single Failures
4.4.6 Critical Failures and Repairs/Restorations
4.5 Maintenance Related Concepts
4.5.1 Maintenance, Restoration and Repair Definitions
4.5.2 Repairable Versus Repaired Items
4.6 Acronyms and Operational Concepts
4.6.1 General Considerations
4.6.2 MUT and MDT
4.6.3 MTTF and Related Acronyms
4.6.4 MTBF
4.6.5 Maintenance Related Acronyms (MTTR, MRT, MFDT…)
4.7 Probabilistic Concepts
4.7.1 Introduction to Random Processes
4.7.2 Basic Random Process
4.7.3 (Un)Reliability Versus (Un)Availability
4.7.4 Failure Distribution and Link with MTTF
4.7.5 Average and Asymptotic Availability/Unavailability
4.7.6 Failure Rate and Failure Intensities
4.7.7 Restoration/Repair Rate
4.8 Conclusion About the Reliability Concepts
References
5 Dependent and Common Cause Failures
5.1 Introduction to Dependent and Common Cause Failures
5.1.1 Identification of the Problem
5.1.2 Definition
5.1.3 Dependency Classifications
5.2 Examples of CCFs Observed in Real Life
5.2.1 Examples of Typical Accidents Due to CCFs
5.2.2 Examples of Typical CCFs Detected from Field Feedback
5.3 Dependent Failures Identification
5.4 CCF Data Collection
5.5 CCF Modelling
5.5.1 Introduction
5.5.2 The Beta-Factor Model
5.5.3 The Shock Model
5.5.4 Other Modelling Methods
References
6 Extensions to Production Availability and Functional Safety Analyses
6.1 From Availability to Efficiency
6.1.1 Binary Items and Introduction of the Efficiency Concept
6.1.2 Extension to Multistate Systems
6.1.3 Generalization of the Efficiency Concept
6.2 From Conventional Safety to Functional Safety
6.2.1 Generalities About Protection Layers and Safety Systems
6.2.2 Classification of Safety Systems and Impact of Faults
6.2.3 Safety Instrumented Systems
6.3 Overview of Probabilistic Models
References
Part IIRisk Identification and Qualitative Analyses
7 The Inductive Approaches
7.1 Need of the Inductive Approach
7.2 Objectives of Inductive Methods
7.3 Overview of the Main Inductive Methods
7.3.1 Similar Approaches
7.3.2 Area of Implementation
7.3.3 Study Team
7.3.4 Use Within System Life Cycle
References
8 Preliminary Hazard Analysis (PHA)
8.1 Description of the Method
8.1.1 Presentation of the Method
8.1.2 Purposes of the Method
8.1.3 PHA Procedure
8.1.4 Resources for the Method
8.1.5 Comments
8.2 Other Related Approaches
8.2.1 Gross Hazard Analysis
8.2.2 Chemical Industry
8.2.3 Preliminary Hazard Analysis with Frequencies
8.3 Use with Other Methods
8.4 Worked Example 8.1
References
9 Hazard and Operability Study (HAZOP)
9.1 Description of the Method
9.1.1 Presentation of the Method
9.1.2 Purposes of the Method
9.1.3 HAZOP Procedure
9.1.4 Resources for the Method
9.1.5 Comments
9.2 Quantified HAZOP
9.3 HACCP
9.4 Worked Example 9.1
9.5 Use with Other Methods
References
10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A
10.1 Description of the Method
10.1.1 Presentation of the Method
10.1.2 Purposes of the Method
10.1.3 FMEA Procedure
10.1.4 Resources for the Method
10.1.5 Comments
10.2 FMEA/FMECA Worksheets
10.3 FMECA
10.3.1 Criticality Analysis
10.3.2 Use of Criticality Matrix
10.3.3 Use of Risk Priority Number
10.4 Worked Example 10.1
10.5 Use with Other Methods
References
11 Other Inductive Methods
11.1 Checklists
11.2 What-If?
11.3 HAZID
11.4 Additional Methods
References
12 Comparison of Inductive Approaches
12.1 Strengths and Weaknesses of Inductive Approaches
12.1.1 PHA
12.1.2 HAZOP
12.1.3 FMEA/FMECA
12.1.4 Checklists
12.1.5 What-If?
12.1.6 HAZID
12.2 Synthesis
References
Part IIIModelling of Static Systems. Boolean Approaches
13 The Family of Boolean Approaches
Reference
14 Mathematical Framework
14.1 Notion of Events and Boolean Algebra
14.2 Bases for Time-Independent Probabilistic Calculations
14.2.1 Probability of the Disjunction (Union) of Events
14.2.2 Probability of the Conjunction (Intersection) of Events
14.3 Introduction to Time-Dependent Calculations
References
15 Reliability Block Diagrams (RBDs)
15.1 History and Introduction to Reliability Block Diagrams
15.2 Graphical Symbols and Basic RBD Structures
15.3 Building an RBD from Simple Examples
15.4 Tie and Cut Set Identification
15.4.1 Electrical Analogy
15.4.2 Concept of Minimal Cut and Tie Sets
15.5 RBD Representation by Tie and Cut Sets
15.6 Associated Exercises
References
16 Fault Tree Analysis (FTA)
16.1 History and Introduction to Fault Tree Analysis
16.2 Graphical Symbols and Basic FT Symbols
16.3 Building an FT of Simple Examples
16.4 Cut and Tie Set Identification, FTs Versus Success Trees
16.5 Associated Exercises
References
17 Qualitative Analysis from RBDs or FTs
17.1 Single Failure Criterion and Ranking Cut Sets by Order
17.2 Identification of Potential Common Cause Failures
17.3 Associated Exercises
References
18 Extension to Non-Coherent RBDs and FTs
18.1 Notion of Non-Coherent Systems
18.2 Prime Implicants
References
19 Probabilistic Calculations of Elementary Boolean Models
19.1 Calculation of Basic Logic Structures
19.1.1 Series Structures/OR Gates
19.1.2 Parallel Structures/AND Gates
19.1.3 Extension to Combinations of Series and Parallel Structures
19.1.4 NOT, NOR and NAND Logic Gates
19.2 m out of n (m/n) Structures/Gates
19.3 Sylvester-Poincaré Formula
References
20 Semi-Quantitative Analysis from RBDs or FTs
20.1 Ranking Minimal Cut Sets by Probabilities
20.2 Link with Sylvester-Poincaré Formula
20.3 Link with Vesely-Fussell Importance Factor
20.4 Associated Exercises
References
21 Probabilistic Calculations for Large Boolean Models
21.1 Overcoming the Sylvester-Poincaré Shortcomings
21.1.1 Issue Identification
21.1.2 A Step Forward to the Solution
21.1.3 Shannon Decomposition
21.1.4 Binary Decision Diagrams (BDDs)
21.1.5 BDDs of RBDs and FTs
21.2 BDD Calculations
21.2.1 System Failure and Success Probabilities
21.2.2 Conditional Probabilities
21.2.3 Cut and Tie Sets
21.3 Conclusions on BDDs
21.4 Associated Exercises
References
22 Time-Dependent Probabilistic Calculations
22.1 Introduction of Time and Generalities
22.2 Availability/Unavailability Calculations
22.2.1 General Case
22.2.2 RBD and FT-Driven Markov Processes
22.3 Average Availability/Unavailability Calculations
22.3.1 Average Over a Given Interval [0, T]
22.3.2 Asymptotic Availability or Unavailability
22.4 Failure Frequency and Derived Parameters
22.4.1 Average Failure Frequency, Number of Failures and MTBF
22.4.2 Instantaneous Failure Frequency/Birnbaum Importance Factor
22.4.3 Combination of Sub-FTs for Unavailability and Frequency Calculations
22.5 Reliability Calculations
22.5.1 General Case
22.5.2 Systems Made of Non-repaired Items
22.5.3 Systems Made of Repaired Items
22.6 Dynamic Fault Trees
22.7 Associated Exercises
References
23 CCF Modelling with FTs and RBDs
23.1 Introduction
23.2 Modelling Tangible CCFs
23.2.1 Introduction of Tangible CCFs in RBD and FT Models
23.3 Modelling Non-tangible CCFs
23.3.1 Beta-Factor Model
23.3.2 Shock Model
23.4 Considerations with Regards to Item Repairs
23.5 Lineage CCFs
23.6 Use of Minimal Cut Sets
23.7 Associated Exercises
References
24 Critical States and Importance Factors
24.1 Critical and Non-critical States
24.1.1 Minterms and Exclusive and Inclusive Cofactors
24.1.2 Critical States
24.1.3 Non-critical States
24.1.4 Link Between Critical and Non-critical States
24.1.5 Graphical Synthesis of the Concepts
24.2 Importance Factors
24.2.1 Generalities About Importance Factors
24.2.2 Vesely-Fussell Importance Factor
24.2.3 Birnbaum Importance Factor (MIF)
24.2.4 Lambert Importance Factor (CIF)
24.2.5 Diagnostic Importance Factor (DIF)
24.2.6 Risk Achievement Worth (RAW), Risk Reduction Worth (RRW)
24.2.7 Differential Importance Measure (DIM)
24.2.8 Barlow-Proschan Importance Factor (BPIF)
24.2.9 Application and Remarks About Importance Factors
24.3 Associated Exercise
References
25 Uncertainty Handling with RBDs and FTs
25.1 Introduction
25.2 Principle and Application to Non-correlated Events
25.3 Application to Correlated Events
25.4 Considerations About the Pseudo Error Factor
25.5 Conclusions About Uncertainty Propagation
25.6 Associated Exercise
References
26 Sequential Analysis Methods
26.1 Introduction
26.2 Cause-Consequence Diagram
26.2.1 Presentation of the Method
26.2.2 CCD Procedure
26.2.3 Graphical Symbols
26.2.4 Cause-Consequence Diagram Analysis
26.2.5 Worked Example 26.1
26.2.6 Strengths and Weaknesses
26.2.7 Use with Other Methods
26.3 Event Tree
26.3.1 Presentation of the Method
26.3.2 ETA Procedure
26.3.3 Graphical Symbols
26.3.4 Event Tree Analysis
26.3.5 Worked Example 26.2
26.3.6 Dynamic Event Trees
26.3.7 Strengths and Weaknesses
26.3.8 Use with Other Methods
26.4 Bowtie Method
26.4.1 Presentation of the Method
26.4.2 Bowtie Procedure
26.4.3 Worked Example 26.3
26.4.4 Strengths and Weaknesses
26.5 LOPA
26.5.1 Presentation of the Method
26.5.2 LOPA Procedure
26.5.3 Resources for the Method
26.5.4 Worked Example 26.4
26.5.5 Strengths and Weaknesses
26.5.6 Use with Other Methods
26.6 Comparison of the Sequential Methods and Conclusions
References
27 Combinations or Links of Boolean Models with Other Techniques
27.1 Introduction
27.2 Combination with FMEA/FMECA
27.3 Combination RBD/FT and Vice Versa
27.4 Combination with Cause-Consequence, Event Tree or Bowtie Analyses
27.5 Combination with Markov Processes
27.6 Combination with Petri Nets
27.7 Link with Root Cause Analysis
27.8 Link with Belief Networks
27.8.1 Principle of Belief Networks
27.8.2 Description of Belief Networks
27.8.3 Construction of Belief Networks
27.8.4 Utilisation of Belief Networks
References
28 Automated Fault Tree Building
References
29 Boolean Family Exercises
29.1 Description of the Overpressure Protection System (OPPS)
29.2 Reliability Data
29.3 Description of the Exercises Related to the OPPS
29.4 Solutions of the Exercises Related to the OPPS
29.4.1 Exercise 15.1: RBD Building
29.4.2 Exercise 15.2: Tie Set Identification
29.4.3 Exercise 16.1: FT Building
29.4.4 Exercise 16.2: Cut Set Identification
29.4.5 Exercise 20.1: Semi-quantitative Analysis (Basic)
29.4.6 Exercise 20.2: Semi-quantitative Analysis with Partial and Full Stroking Tests
29.4.7 Exercise 20.3: Vesely-Fussell Importance Factor
29.4.8 Exercise 20.4: Semi-quantitative Analysis with CCF Analysis
29.4.9 Exercise 21.1: BDD Building
29.4.10 Exercise 21.2: Comparison of Probabilistic Results (Disjoint Paths Versus Minimal Cut Sets)
29.4.11 Exercise 22.1: Unavailability, Failure Frequency and Unreliability Calculations
29.4.12 Exercise 22.2: Unavailability Calculation with Partial and Full Stroking Tests
29.4.13 Exercise 22.3: Unavailability Calculation with Common Cause Failures
29.4.14 Exercise 22.4: Unavailability Calculation with Test Staggering
29.4.15 Exercise 24.1: Importance Factor Calculations
29.4.16 Exercise 25.1: Uncertainty Propagation
Reference
Part IVDynamic Systems and Stochastic Processes
30 Introduction to Dynamic Systems and Stochastic Processes
30.1 Miscellaneous Dynamic Aspects
30.1.1 Dynamic Aspect Linked to System Operation
30.1.2 Dynamic Aspect Linked to System Maintenance
30.2 Notion of Stochastic (Random) Processes
30.3 Dynamic Methods and Tools
30.4 Systems Typology to Select a Relevant Approach
References
31 Markovian Modelling
31.1 Basis of the Classical Markov Approach
31.1.1 Introduction and Overview of the Markovian Approach
31.1.2 Graphical Representation of Markov Process
31.2 Mathematical Foundations
31.2.1 Basic Formula for Time-Dependent Calculations
31.2.2 Basic Formula for Asymptotic Calculations
31.3 Link with Basic Definition
31.3.1 Preamble
31.3.2 Availability
31.3.3 Reliability
31.3.4 Vesely Failure Rate and Failure Frequency
31.3.5 Failure Rate and Failure Density
31.3.6 Comparison λ( t ) Versus λV ( t ) and f( t ) Versus w(t)
31.3.7 Repair Intensities
31.3.8 MUT, MDT, MTBF and MTTF
31.4 Analytical Calculations of Markov Processes
31.4.1 Classical Calculation Techniques
31.4.2 Matrix Exponentiation
31.5 Advanced Modelling
31.5.1 Failure on Demand and Zero-Duration State
31.5.2 Sequence Modelling
31.5.3 Multistate Modelling and Production Availability
31.5.4 Multiphase Modelling
31.6 Reducing the Size of the Markov Models
31.6.1 Aggregation of States
31.6.2 FT and RBD-Driven Markov Processes
31.7 Specific Modelling
31.7.1 CCF Modelling
31.7.2 Maintenance Modelling
31.7.3 Cold, Hot and Mixed Redundancy
31.8 Limitation and Conclusions
31.9 Associated Exercises
References
32 Monte Carlo Simulation
32.1 Introduction to Monte Carlo Simulation
32.2 History and Principle
32.3 Generation of Probabilistic Laws
32.3.1 General Principle for Generating Random Delays
32.3.2 Random Number Generation
32.3.3 Simulation of Typical Probabilistic Laws
32.4 Accuracy of Results
32.4.1 Accuracy Related to Monte Carlo Itself
32.4.2 Qualitative Appreciation of the Accuracy
32.5 Uncertainty Propagation
32.6 Parameters Changing When Conditions Change
32.6.1 Introduction and Context
32.6.2 Updating Occurrence Dates (Principle)
32.6.3 Various Approaches to Manage the Distribution Changes
32.6.4 General Approach to Update Failure Dates
32.6.5 Generalities About the Application to Weibull Distributions
32.6.6 Detailed Application to Weibull Distributions
32.6.7 Examples of Application
32.7 Comparison Between Analytic and Monte Carlo Calculations
32.8 Associated Exercises
References
33 Petri Net Modelling
33.1 Quest for Complex Behaviour Modelling
33.2 History
33.3 Petri Net Use Within Automation and Dependability Fields
33.4 Basic Principles
33.4.1 Graphical Elements
33.4.2 Validation of Transitions and Firing Rules
33.4.3 Managing Conflicts
33.4.4 Introduction of Delays
33.4.5 Simple Examples
33.5 Extensions of the Basic PNs
33.5.1 Weighted Arcs, Inhibitor Arcs and Repeated Places
33.5.2 Predicates and Assertions/Messages
33.5.3 New Validation of Transitions and Firing Rules
33.6 Other Extensions
33.6.1 Priority of the Transitions
33.6.2 Suspended Events (Transition with Memory)
33.6.3 Probabilistic Switches
33.6.4 Dynamic Transitions
33.7 Miscellaneous Modelling Techniques
33.7.1 Common Cause Failure Modelling
33.7.2 Modelling Maintenance and Maintenance Supports
33.8 Undertaking System Modelling
33.8.1 Modelling of the System
33.8.2 Monte Carlo Simulation of the Model
33.8.3 Timetable
33.8.4 Pre-Processing and Table of Impacted Transitions
33.8.5 Preventing Endless Loops
33.8.6 Markov Graph Generation
33.9 Undertaking System Calculations
33.9.1 Availability and Unavailability
33.9.2 MTBF, MUT and MDT
33.9.3 Reliability and MTTF
33.9.4 Token Counting Related Results
33.9.5 Production Availability Calculation
33.10 Accuracy of Results and Data Uncertainty Handling
33.11 Building PNs Related to Large Systems
33.11.1 Main Drawback: Legibility Problem
33.11.2 Increasing Legibility of Large PNs
33.11.3 Modularization of Large PNs
33.11.4 Modelling of Binary Systems
33.11.5 Modelling of Multistate Systems
33.12 Coloured Petri Nets
33.13 Conclusion About PNs
33.14 Associated Exercises
References
34 Dynamic Modelling Exercises
34.1 Markovian Approach Exercises
34.1.1 Example: Pumping System
34.1.2 Description of the Exercises Related to the Pumping System
34.1.3 Solutions of the Exercises Related to the Pumping System
34.2 Petri Net Approach Exercises
34.2.1 Example: Service Station
34.2.2 Description of the Exercises Related to the Service Station
34.2.3 Solutions of the Exercise Related to the Service Station
Reference
Part VProduction Availability and Functional Safety (SIL) Modelling and Calculations
35 Production Availability Related Modelling and Calculations
35.1 Characteristics of Production Systems
35.1.1 Size and Complexity of the Systems
35.1.2 Multistate and Multiphase Systems
35.1.3 Multiple Product Systems
35.1.4 Multiple Information Sources
35.2 Classification of Failure and Restoration Events
35.2.1 Failure Events
35.2.2 Restoration Events
35.2.3 Planned Maintenance
35.3 Characteristics of Production Availability Studies
35.3.1 Economic Calculations
35.3.2 Rare Events
35.4 Case Study for Comparison of Production Availability Models
35.4.1 Description of the Production System
35.4.2 Modelling with Flow Diagrams
35.4.3 Modelling with Reliability Block Diagrams
35.4.4 Modelling with Markov Graphs
35.4.5 Modelling with Petri Nets
References
36 Functional Safety Related Modelling and Calculations
36.1 Introduction and Standardization
36.2 Safety Integrity Concepts
36.2.1 Establishing the Safety Integrity Levels (SIL) Requirements
36.2.2 Low Demand Versus High Demand Mode of Operation
36.2.3 Probabilistic Requirements: PFDavg and PFH
36.2.4 Failure Classification
36.2.5 Loss-of-Power Versus Emission-of-Power Safety Systems
36.2.6 Safe Failure Fraction: The False Good Idea
36.2.7 Fault Tolerance (Architectural Constraints)
36.2.8 Use of k Out of n Logic
36.3 Probabilistic Calculations
36.3.1 Input Data Needs and Conservativeness
36.3.2 Simplified Analytical Approach
36.3.3 Markovian Approach
36.3.4 Boolean Approach
36.3.5 Petri Net Approach
36.3.6 Uncertainty Handling in SIL Calculations
36.4 Conclusions
36.5 Associated Exercises
References
Part VIStandardization, Data Collection and Uncertainties
37 Standardization
37.1 Introduction to Standardization
37.2 Standardization Versus Regulation and Certification
37.3 Standardization Organization Overview
37.3.1 Standardization Bodies
37.3.2 Development of a Standard
37.3.3 Type and Content of Standards
37.4 Safety and Dependability Related Standardization
37.5 Concluding Remarks About Standardization
References
38 Data Collection and Uncertainties
38.1 Introduction
38.2 The Bare Necessity of Input Data
38.3 Data Collection Standards and Databases
38.3.1 IEC and ISO Data Collection Related Standards
38.3.2 Databases
38.4 Reliability Data Estimation
38.5 Data Uncertainty Modelling
38.5.1 Data Accuracy Versus Field Feedback
38.5.2 Uniform and Triangular Distributions: Expert Judgment
38.5.3 Chi-Square Distribution: Statistics from Field Feedback
38.5.4 Bayesian Approach and Gamma Distribution
38.5.5 Log-Normal Distribution: Practical Approach
References
Index
Recommend Papers

Reliability Assessment of Safety and Production Systems: Analysis, Modelling, Calculations and Case Studies (Springer Series in Reliability Engineering)
 3030647072, 9783030647070

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Springer Series in Reliability Engineering

Jean-Pierre Signoret Alain Leroy

Reliability Assessment of Safety and Production Systems Analysis, Modelling, Calculations and Case Studies

Springer Series in Reliability Engineering Series Editor Hoang Pham, Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ, USA

More information about this series at http://www.springer.com/series/6917

Jean-Pierre Signoret Alain Leroy •

Reliability Assessment of Safety and Production Systems Analysis, Modelling, Calculations and Case Studies

123

Jean-Pierre Signoret Total Professeurs Associés Sedzère, France

Alain Leroy Montreuil, France

ISSN 1614-7839 ISSN 2196-999X (electronic) Springer Series in Reliability Engineering ISBN 978-3-030-64707-0 ISBN 978-3-030-64708-7 (eBook) https://doi.org/10.1007/978-3-030-64708-7 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To Odile, Vincent and Saïda, Denis and Florence, Arnaud, Rachel and Anna —Jean-Pierre Signoret To Danièle —Alain Leroy

Preface

In March 2019, two major accidents—the crash of an Ethiopian Airlines Boeing 737 Max 8 and the oil spill off the French coast after the container ship “Grande America” capsized—prompted the authors to finally undertake the project they had long cherished: to pass on in a book to the younger generations their experience in the fields of safety and dependability modelling! Now retired, both have been working in reliability engineering since the seventies, when the subject was just beginning to emerge as a new field of human knowledge. Together, they have spent over 70 years studying industrial systems (modelling and probabilistic calculations), working in the research and development of methods and tools, in reliability data collection and standardization works relating to industrial safety and dependability (including such economic aspects as production availability). They also deliver courses in schools and universities and are still members of several reliability societies. For the past 50 years, they have been contributing to the ongoing development and improvement of the main approaches proposed in this book. They have first-hand knowledge of the many challenges that have had to be faced to achieve today’s state of the art. They also have hands-on experience of how the latter can be used to tackle safety and dependability studies into simple or complex, small or large industrial systems, from a technological point of view. To share their extensive experience, the book adopts a triple objective of pedagogy, pragmatism and scientific rigour. It sets out to bridge the gap between the theoretical aspects presented in academic works and the practical approaches described in engineering books. Illustrations, explanations, examples and case studies are provided throughout the book to give readers an in-depth grasp of how to achieve accurate and relevant studies and a full understanding of the underlying assumptions, mathematics and limitations. This book aims to be useful for engineers, systems designers, standards developers, professors and students. It is split into six parts:

vii

viii

Preface

Part I describes the background of reliability studies, explains how to handle such a study and defines the basic core concepts. It overviews common cause failures (CCF), which are the prime potential weak points leading to system failures. It highlights the two main aspects of technological risk—safety and dependability (e.g. production availability)—and explains how to extend the scope of reliability studies to cover these topics. Part II is devoted to the starting point of any study: risk identification and simple qualitative analyses. It describes the inductive (bottom-up) approaches such as preliminary hazard analysis (PHA), hazard and operability study (HAZOP) and failure mode, effects and criticality analysis (FMECA) designed to identify the impact of single events on the system under study as a whole. Part III broaches the step forward of modelling static systems. It describes reliability block diagrams (RBD) and fault trees (FT) which share the same mathematical background (Boolean algebra). FTs are very important because they represent the alternative deductive (top-down) approach. Qualitative as well as semi-quantitative and quantitative analyses are presented for time-independent/dependent and small/large systems. The application of binary decision diagrams (BDD) is introduced, and the modelling of common cause failures (CCF), importance factors and uncertainty propagation is described. Extensions to sequential models like cause-consequence diagrams, event trees, LOPA (layer of protection analysis) or bowties and also to dependent event models such as belief networks are introduced. Their use in conjunction with Markov (FT-driven Markov models) or Petri nets (RBD-driven PNs) models is also covered. Related exercises are provided at the end of this part, together with their solutions. Part IV takes another step forward, broaching dynamic systems and stochastic processes. It describes the Markov and Petri nets approaches. Markov models are useful for defining basic core concepts (reliability/availability/failure frequency/ failure rate/failure intensities) but also to model multiphase and multistate (i.e. with more than two states) systems. Petri nets are helpful for modelling interdependencies that cannot be modelled with Boolean approaches, and probabilistic distributions, which cannot be modelled using Markov graphs. The downside is that the analytical calculations have to be discarded and replaced by the Monte Carlo simulation, which is also covered in this part. Related exercises are provided at the end of this part, together with their solutions. Part V is dedicated to the dilemma of industrial system designers, that of solving the opposition of safety versus dependability to design systems that will operate both safely (safety) and economically (dependability). These two facets of the industrial risk—the production availability of production systems and the functional safety of safety systems—are analysed in detail at the end of the book, as specific applications of the general subjects developed in the first parts. Part VI rounds off the book with important topics relating to standardization by international bodies such as IEC or ISO, feedback from the field and reliability data collection.

Preface

ix

As pointed out at the start, the book proposes only mature techniques drawn from the authors’ long experience and which have proven to be enduringly effective in dealing with the technological risk. It does not broach such other important and useful approaches as software reliability, the human factor or security, which are outside its scope. Sedzère, France Montreuil, France

Jean-Pierre Signoret Alain Leroy

Acknowledgments

Such a project could not be carried out without the continuous support and involvement of some benevolent persons, and we wish to warmly thank them for the help they provided all along these past two years. We specially want to thank Yves Dutuit (Professor Emeritus, University of Bordeaux), who performed thorough reviews of all chapters to consolidate the theoretical matters and helped us to make the necessary trade-offs when challenging questions arose. He also continuously stimulated us with relevant remarks. We wish to extend our appreciative thanks to Odile Signoret (former technical English/French translator), who cautiously and comprehensively reviewed the text to correct and improve it, with occasional help from Jacquie Wade. The insights, comments and suggestions of both of them provide an invaluable contribution to the content of this book. We also want to thank our colleagues Stéphane Collas and Nicolas Clavé (from TOTAL), Cyrille Folleau and Philippe Thomas (from SATODEV) who provided an effective help to handle the GRIF-Workshop software package used for probabilistic calculations throughout the book. And lastly, we are very grateful to our families who have unfailingly supported us and accepted the side effects and constraints of this work all the time. Sedzère, France Montreuil, France

Jean-Pierre Signoret Alain Leroy

xi

Contents

Part I

Introduction, Background and Overview 3 3 4

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Human Enterprises Involve Risks . . . . . . . . . . . . . . . . . . . . . . 1.2 Philosophy to Master the Risks . . . . . . . . . . . . . . . . . . . . . . .

2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 A Short Story of Reliability Analysis . . . . . . . . . . . . . . 2.1.1 Premises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 The Beginning . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 A Step Forward of the Reliability Approach . . 2.1.4 Consolidation of the Reliability Approach . . . 2.1.5 Dissemination in All the Industry Sectors . . . . 2.2 Why, When and How to Implement Reliability Studies . 2.2.1 Why . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 When . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 How . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Name for the New Discipline . . . . . . . . . . . . . . . . . . . 2.4 Notion of Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Etymology. Danger Versus Peril, Risk and Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Safety Versus Risk Management Definitions . 2.4.3 Risk Overview in Industrial Context . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 7 7 8 9 10 12 14 14 14 16 17 18

. . . .

. . . .

. . . .

. . . .

. . . .

18 19 21 25

Reliability Study Overview . . . . . . . . . . . . . 3.1 Overview . . . . . . . . . . . . . . . . . . . . . 3.2 Goal and System Definition . . . . . . . . 3.3 How It Works (Functional Analysis) . 3.4 How It Fails (Dysfunctional Analysis) 3.4.1 Point About Terminology . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

29 29 30 31 32 32

3

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

xiii

xiv

Contents

3.4.2 Issue Identification . . . . . . . . . . . . . . . . . 3.4.3 System Modelling . . . . . . . . . . . . . . . . . . 3.4.4 Reliability and Operational Data Selection 3.4.5 Qualitative Analysis . . . . . . . . . . . . . . . . 3.4.6 Quantitative Analysis . . . . . . . . . . . . . . . 3.5 Comparisons and Decision . . . . . . . . . . . . . . . . . . . 3.6 Prevention and Risk Mitigation . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

32 33 38 39 39 41 41 42

Introduction of Basic Core Concepts . . . . . . . . . . . . . . . . . . . . 4.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Item Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 States of an Item . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Up and Down States . . . . . . . . . . . . . . . . . . . . . 4.3.2 Operating and Non-operating States . . . . . . . . . . 4.3.3 Restoration States . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Degraded and Critical States . . . . . . . . . . . . . . . 4.4 Failure and Fault Concept . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Failure Definition . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Fault Definition . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Failure and Fault Classification . . . . . . . . . . . . . 4.4.4 Failure Cause, Failure Mode . . . . . . . . . . . . . . . 4.4.5 Common Cause, Common Mode and Single Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.6 Critical Failures and Repairs/Restorations . . . . . . 4.5 Maintenance Related Concepts . . . . . . . . . . . . . . . . . . . . . 4.5.1 Maintenance, Restoration and Repair Definitions 4.5.2 Repairable Versus Repaired Items . . . . . . . . . . . 4.6 Acronyms and Operational Concepts . . . . . . . . . . . . . . . . 4.6.1 General Considerations . . . . . . . . . . . . . . . . . . . 4.6.2 MUT and MDT . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 MTTF and Related Acronyms . . . . . . . . . . . . . . 4.6.4 MTBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.5 Maintenance Related Acronyms (MTTR, MRT, MFDT…) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Probabilistic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Introduction to Random Processes . . . . . . . . . . . 4.7.2 Basic Random Process . . . . . . . . . . . . . . . . . . . 4.7.3 (Un)Reliability Versus (Un)Availability . . . . . . . 4.7.4 Failure Distribution and Link with MTTF . . . . . 4.7.5 Average and Asymptotic Availability/ Unavailability . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.6 Failure Rate and Failure Intensities . . . . . . . . . . 4.7.7 Restoration/Repair Rate . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

43 43 44 44 44 45 46 47 49 49 50 51 57

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

58 60 61 61 62 63 63 64 66 74

. . . . . .

. . . . . .

. . . . . .

79 80 80 81 82 86

... ... ...

87 90 98

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Contents

xv

4.8 Conclusion About the Reliability Concepts . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5

6

Dependent and Common Cause Failures . . . . . . . . . . . . . . . . . 5.1 Introduction to Dependent and Common Cause Failures . . 5.1.1 Identification of the Problem . . . . . . . . . . . . . . . 5.1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Dependency Classifications . . . . . . . . . . . . . . . . 5.2 Examples of CCFs Observed in Real Life . . . . . . . . . . . . 5.2.1 Examples of Typical Accidents Due to CCFs . . . 5.2.2 Examples of Typical CCFs Detected from Field Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Dependent Failures Identification . . . . . . . . . . . . . . . . . . . 5.4 CCF Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 CCF Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 The Beta-Factor Model . . . . . . . . . . . . . . . . . . . 5.5.3 The Shock Model . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Other Modelling Methods . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extensions to Production Availability and Functional Safety Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 From Availability to Efficiency . . . . . . . . . . . . . . . . . . . . 6.1.1 Binary Items and Introduction of the Efficiency Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Extension to Multistate Systems . . . . . . . . . . . . 6.1.3 Generalization of the Efficiency Concept . . . . . . 6.2 From Conventional Safety to Functional Safety . . . . . . . . 6.2.1 Generalities About Protection Layers and Safety Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Classification of Safety Systems and Impact of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Safety Instrumented Systems . . . . . . . . . . . . . . . 6.3 Overview of Probabilistic Models . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part II 7

. . . . . . .

. . . . . . .

. . . . . . .

103 103 103 105 106 109 109

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

111 111 112 114 114 115 116 119 119

. . . 121 . . . 121 . . . .

. . . .

. . . .

121 123 126 127

. . . 127 . . . .

. . . .

. . . .

129 132 134 136

. . . . . .

. . . . . .

. . . . . .

139 139 139 140 141 142

Risk Identification and Qualitative Analyses

The Inductive Approaches . . . . . . . . . . . . . . . . 7.1 Need of the Inductive Approach . . . . . . . 7.2 Objectives of Inductive Methods . . . . . . . 7.3 Overview of the Main Inductive Methods . 7.3.1 Similar Approaches . . . . . . . . . . 7.3.2 Area of Implementation . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

xvi

Contents

7.3.3 Study Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.4 Use Within System Life Cycle . . . . . . . . . . . . . . . . 143 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8

Preliminary Hazard Analysis (PHA) . . . . . . 8.1 Description of the Method . . . . . . . . . 8.1.1 Presentation of the Method . 8.1.2 Purposes of the Method . . . 8.1.3 PHA Procedure . . . . . . . . . . 8.1.4 Resources for the Method . . 8.1.5 Comments . . . . . . . . . . . . . 8.2 Other Related Approaches . . . . . . . . . 8.2.1 Gross Hazard Analysis . . . . 8.2.2 Chemical Industry . . . . . . . . 8.2.3 Preliminary Hazard Analysis 8.3 Use with Other Methods . . . . . . . . . . 8.4 Worked Example 8.1 . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .

............. ............. .............

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

145 145 145 145 145 148 150 150 150 150 152 153 153 156

. . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

with Frequencies

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

157 157 157 157 157 160 161 161 162 162 164 164

10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A 10.1 Description of the Method . . . . . . . . . . . . . . . . . . . . . 10.1.1 Presentation of the Method . . . . . . . . . . . . . 10.1.2 Purposes of the Method . . . . . . . . . . . . . . . 10.1.3 FMEA Procedure . . . . . . . . . . . . . . . . . . . . 10.1.4 Resources for the Method . . . . . . . . . . . . . . 10.1.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 FMEA/FMECA Worksheets . . . . . . . . . . . . . . . . . . . 10.3 FMECA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Criticality Analysis . . . . . . . . . . . . . . . . . . . 10.3.2 Use of Criticality Matrix . . . . . . . . . . . . . . . 10.3.3 Use of Risk Priority Number . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

165 165 165 165 165 167 168 168 169 169 169 169

9

Hazard and Operability Study (HAZOP) . 9.1 Description of the Method . . . . . . . . 9.1.1 Presentation of the Method 9.1.2 Purposes of the Method . . 9.1.3 HAZOP Procedure . . . . . . 9.1.4 Resources for the Method . 9.1.5 Comments . . . . . . . . . . . . 9.2 Quantified HAZOP . . . . . . . . . . . . . 9.3 HACCP . . . . . . . . . . . . . . . . . . . . . 9.4 Worked Example 9.1 . . . . . . . . . . . . 9.5 Use with Other Methods . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Contents

xvii

10.4 Worked Example 10.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 10.5 Use with Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

173 173 174 175 175 176

12 Comparison of Inductive Approaches . . . . . . . . . . . . . . . 12.1 Strengths and Weaknesses of Inductive Approaches 12.1.1 PHA . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 HAZOP . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 FMEA/FMECA . . . . . . . . . . . . . . . . . . . 12.1.4 Checklists . . . . . . . . . . . . . . . . . . . . . . . . 12.1.5 What-If? . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.6 HAZID . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

177 177 177 178 178 178 179 179 179 180

11 Other Inductive Methods . 11.1 Checklists . . . . . . . . 11.2 What-If? . . . . . . . . . 11.3 HAZID . . . . . . . . . . 11.4 Additional Methods . References . . . . . . . . . . . . .

Part III

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Modelling of Static Systems. Boolean Approaches

13 The Family of Boolean Approaches . . . . . . . . . . . . . . . . . . . . . . . . 183 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 185 185 189 189

14 Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Notion of Events and Boolean Algebra . . . . . . . . . . . . . . 14.2 Bases for Time-Independent Probabilistic Calculations . . . 14.2.1 Probability of the Disjunction (Union) of Events 14.2.2 Probability of the Conjunction (Intersection) of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Introduction to Time-Dependent Calculations . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . 190 . . . 192 . . . 193

15 Reliability Block Diagrams (RBDs) . . . . . . . . . . . . . . . . . . . . 15.1 History and Introduction to Reliability Block Diagrams . 15.2 Graphical Symbols and Basic RBD Structures . . . . . . . 15.3 Building an RBD from Simple Examples . . . . . . . . . . . 15.4 Tie and Cut Set Identification . . . . . . . . . . . . . . . . . . . 15.4.1 Electrical Analogy . . . . . . . . . . . . . . . . . . . . 15.4.2 Concept of Minimal Cut and Tie Sets . . . . . . 15.5 RBD Representation by Tie and Cut Sets . . . . . . . . . . . 15.6 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . .

. . . . . . . . . .

. . . .

. . . . . . . . . .

195 195 197 202 205 205 205 207 208 208

xviii

Contents

16 Fault Tree Analysis (FTA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 History and Introduction to Fault Tree Analysis . . . . . . . 16.2 Graphical Symbols and Basic FT Symbols . . . . . . . . . . . 16.3 Building an FT of Simple Examples . . . . . . . . . . . . . . . 16.4 Cut and Tie Set Identification, FTs Versus Success Trees 16.5 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

209 209 211 217 222 224 225

17 Qualitative Analysis from RBDs or FTs . . . . . . . . . . . . . . . . 17.1 Single Failure Criterion and Ranking Cut Sets by Order 17.2 Identification of Potential Common Cause Failures . . . . 17.3 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

227 227 229 233 233

18 Extension to Non-Coherent RBDs and FTs . 18.1 Notion of Non-Coherent Systems . . . . 18.2 Prime Implicants . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

235 235 241 243

19 Probabilistic Calculations of Elementary Boolean Models . 19.1 Calculation of Basic Logic Structures . . . . . . . . . . . . 19.1.1 Series Structures/OR Gates . . . . . . . . . . . . 19.1.2 Parallel Structures/AND Gates . . . . . . . . . . 19.1.3 Extension to Combinations of Series and Parallel Structures . . . . . . . . . . . . . . . . 19.1.4 NOT, NOR and NAND Logic Gates . . . . . 19.2 m out of n (m/n) Structures/Gates . . . . . . . . . . . . . . . 19.3 Sylvester-Poincaré Formula . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

245 245 245 246

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

247 249 250 252 255

20 Semi-Quantitative Analysis from RBDs or FTs . . 20.1 Ranking Minimal Cut Sets by Probabilities . 20.2 Link with Sylvester-Poincaré Formula . . . . 20.3 Link with Vesely-Fussell Importance Factor 20.4 Associated Exercises . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

257 257 259 260 263 263

21 Probabilistic Calculations for Large Boolean Models . . 21.1 Overcoming the Sylvester-Poincaré Shortcomings . 21.1.1 Issue Identification . . . . . . . . . . . . . . . . 21.1.2 A Step Forward to the Solution . . . . . . . 21.1.3 Shannon Decomposition . . . . . . . . . . . . 21.1.4 Binary Decision Diagrams (BDDs) . . . . 21.1.5 BDDs of RBDs and FTs . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

265 265 265 267 268 271 273

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Contents

xix

21.2

BDD Calculations . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 System Failure and Success Probabilities 21.2.2 Conditional Probabilities . . . . . . . . . . . . 21.2.3 Cut and Tie Sets . . . . . . . . . . . . . . . . . . 21.3 Conclusions on BDDs . . . . . . . . . . . . . . . . . . . . . 21.4 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

277 277 279 281 282 282 283

22 Time-Dependent Probabilistic Calculations . . . . . . . . . . . . . . . . 22.1 Introduction of Time and Generalities . . . . . . . . . . . . . . . 22.2 Availability/Unavailability Calculations . . . . . . . . . . . . . . 22.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 RBD and FT-Driven Markov Processes . . . . . . . 22.3 Average Availability/Unavailability Calculations . . . . . . . . 22.3.1 Average Over a Given Interval ½0; T . . . . . . . . 22.3.2 Asymptotic Availability or Unavailability . . . . . . 22.4 Failure Frequency and Derived Parameters . . . . . . . . . . . . 22.4.1 Average Failure Frequency, Number of Failures and MTBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4.2 Instantaneous Failure Frequency/Birnbaum Importance Factor . . . . . . . . . . . . . . . . . . . . . . . 22.4.3 Combination of Sub-FTs for Unavailability and Frequency Calculations . . . . . . . . . . . . . . . . 22.5 Reliability Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.2 Systems Made of Non-repaired Items . . . . . . . . . 22.5.3 Systems Made of Repaired Items . . . . . . . . . . . . 22.6 Dynamic Fault Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.7 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

285 285 287 287 290 293 293 295 297

23 CCF Modelling with FTs and RBDs . . . . . . . . . . . . . . 23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.2 Modelling Tangible CCFs . . . . . . . . . . . . . . . . . 23.2.1 Introduction of Tangible CCFs in RBD and FT Models . . . . . . . . . . . . . . . . . . 23.3 Modelling Non-tangible CCFs . . . . . . . . . . . . . . 23.3.1 Beta-Factor Model . . . . . . . . . . . . . . . 23.3.2 Shock Model . . . . . . . . . . . . . . . . . . . 23.4 Considerations with Regards to Item Repairs . . . 23.5 Lineage CCFs . . . . . . . . . . . . . . . . . . . . . . . . . . 23.6 Use of Minimal Cut Sets . . . . . . . . . . . . . . . . . . 23.7 Associated Exercises . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . 297 . . . 299 . . . . . . . .

. . . . . . . .

. . . . . . . .

303 307 307 308 311 314 316 316

. . . . . . . . . . 319 . . . . . . . . . . 319 . . . . . . . . . . 320 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

320 326 327 328 328 330 330 332 332

xx

Contents

24 Critical States and Importance Factors . . . . . . . . . . . . . . . . . . . . 24.1 Critical and Non-critical States . . . . . . . . . . . . . . . . . . . . . . 24.1.1 Minterms and Exclusive and Inclusive Cofactors . 24.1.2 Critical States . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.1.3 Non-critical States . . . . . . . . . . . . . . . . . . . . . . . . 24.1.4 Link Between Critical and Non-critical States . . . . 24.1.5 Graphical Synthesis of the Concepts . . . . . . . . . . 24.2 Importance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.1 Generalities About Importance Factors . . . . . . . . . 24.2.2 Vesely-Fussell Importance Factor . . . . . . . . . . . . 24.2.3 Birnbaum Importance Factor (MIF) . . . . . . . . . . . 24.2.4 Lambert Importance Factor (CIF) . . . . . . . . . . . . 24.2.5 Diagnostic Importance Factor (DIF) . . . . . . . . . . . 24.2.6 Risk Achievement Worth (RAW), Risk Reduction Worth (RRW) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.7 Differential Importance Measure (DIM) . . . . . . . . 24.2.8 Barlow-Proschan Importance Factor (BPIF) . . . . . 24.2.9 Application and Remarks About Importance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.3 Associated Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . 366 . . 370 . . 370

25 Uncertainty Handling with RBDs and FTs . . . . . . . . . . . 25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.2 Principle and Application to Non-correlated Events . 25.3 Application to Correlated Events . . . . . . . . . . . . . . 25.4 Considerations About the Pseudo Error Factor . . . . 25.5 Conclusions About Uncertainty Propagation . . . . . . 25.6 Associated Exercise . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

373 373 374 379 382 383 384 384

26 Sequential Analysis Methods . . . . . . . . . . . . . . . . . . . . 26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.2 Cause-Consequence Diagram . . . . . . . . . . . . . . . 26.2.1 Presentation of the Method . . . . . . . . . 26.2.2 CCD Procedure . . . . . . . . . . . . . . . . . 26.2.3 Graphical Symbols . . . . . . . . . . . . . . . 26.2.4 Cause-Consequence Diagram Analysis . 26.2.5 Worked Example 26.1 . . . . . . . . . . . . 26.2.6 Strengths and Weaknesses . . . . . . . . . . 26.2.7 Use with Other Methods . . . . . . . . . . . 26.3 Event Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.3.1 Presentation of the Method . . . . . . . . . 26.3.2 ETA Procedure . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

385 385 385 385 386 387 388 388 392 392 393 393 393

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

333 333 333 338 343 345 346 350 350 351 355 359 361

. . 362 . . 364 . . 365

Contents

xxi

26.3.3 Graphical Symbols . . . . . . . . . . . . . . . . . . . . . 26.3.4 Event Tree Analysis . . . . . . . . . . . . . . . . . . . . 26.3.5 Worked Example 26.2 . . . . . . . . . . . . . . . . . . 26.3.6 Dynamic Event Trees . . . . . . . . . . . . . . . . . . . 26.3.7 Strengths and Weaknesses . . . . . . . . . . . . . . . . 26.3.8 Use with Other Methods . . . . . . . . . . . . . . . . . 26.4 Bowtie Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.4.1 Presentation of the Method . . . . . . . . . . . . . . . 26.4.2 Bowtie Procedure . . . . . . . . . . . . . . . . . . . . . . 26.4.3 Worked Example 26.3 . . . . . . . . . . . . . . . . . . 26.4.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . 26.5 LOPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.5.1 Presentation of the Method . . . . . . . . . . . . . . . 26.5.2 LOPA Procedure . . . . . . . . . . . . . . . . . . . . . . 26.5.3 Resources for the Method . . . . . . . . . . . . . . . . 26.5.4 Worked Example 26.4 . . . . . . . . . . . . . . . . . . 26.5.5 Strengths and Weaknesses . . . . . . . . . . . . . . . . 26.5.6 Use with Other Methods . . . . . . . . . . . . . . . . . 26.6 Comparison of the Sequential Methods and Conclusions . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Combinations or Links of Boolean Models with Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2 Combination with FMEA/FMECA . . . . . . . . . . . . . 27.3 Combination RBD/FT and Vice Versa . . . . . . . . . . 27.4 Combination with Cause-Consequence, Event Tree or Bowtie Analyses . . . . . . . . . . . . . . . . . . . . . . . . 27.5 Combination with Markov Processes . . . . . . . . . . . 27.6 Combination with Petri Nets . . . . . . . . . . . . . . . . . 27.7 Link with Root Cause Analysis . . . . . . . . . . . . . . . 27.8 Link with Belief Networks . . . . . . . . . . . . . . . . . . 27.8.1 Principle of Belief Networks . . . . . . . . . . 27.8.2 Description of Belief Networks . . . . . . . . 27.8.3 Construction of Belief Networks . . . . . . . 27.8.4 Utilisation of Belief Networks . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

394 394 395 397 398 399 400 400 400 400 401 402 402 402 404 404 405 407 407 407

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

409 409 409 410

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

412 416 416 416 417 417 419 420 420 421

28 Automated Fault Tree Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 29 Boolean Family Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1 Description of the Overpressure Protection System (OPPS) 29.2 Reliability Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3 Description of the Exercises Related to the OPPS . . . . . . .

. . . .

. . . .

. . . .

427 427 427 429

xxii

Contents

29.4

Solutions of the Exercises Related to the OPPS . . . . . . . . . 29.4.1 Exercise 15.1: RBD Building . . . . . . . . . . . . . . . 29.4.2 Exercise 15.2: Tie Set Identification . . . . . . . . . . . 29.4.3 Exercise 16.1: FT Building . . . . . . . . . . . . . . . . . 29.4.4 Exercise 16.2: Cut Set Identification . . . . . . . . . . 29.4.5 Exercise 20.1: Semi-quantitative Analysis (Basic) . 29.4.6 Exercise 20.2: Semi-quantitative Analysis with Partial and Full Stroking Tests . . . . . . . . . . . . . . . 29.4.7 Exercise 20.3: Vesely-Fussell Importance Factor . 29.4.8 Exercise 20.4: Semi-quantitative Analysis with CCF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.4.9 Exercise 21.1: BDD Building . . . . . . . . . . . . . . . 29.4.10 Exercise 21.2: Comparison of Probabilistic Results (Disjoint Paths Versus Minimal Cut Sets) . . . . . . 29.4.11 Exercise 22.1: Unavailability, Failure Frequency and Unreliability Calculations . . . . . . . . . . . . . . . 29.4.12 Exercise 22.2: Unavailability Calculation with Partial and Full Stroking Tests . . . . . . . . . . . . . . . 29.4.13 Exercise 22.3: Unavailability Calculation with Common Cause Failures . . . . . . . . . . . . . . . . . . . 29.4.14 Exercise 22.4: Unavailability Calculation with Test Staggering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.4.15 Exercise 24.1: Importance Factor Calculations . . . 29.4.16 Exercise 25.1: Uncertainty Propagation . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part IV

. . . . . .

. . . . . .

430 430 431 432 434 434

. . 435 . . 436 . . 437 . . 439 . . 444 . . 446 . . 447 . . 449 . . . .

. . . .

450 451 453 454

. . . . . . . .

. . . . . . . .

457 457 457 461 464 467 468 470

Dynamic Systems and Stochastic Processes

30 Introduction to Dynamic Systems and Stochastic Processes . . . 30.1 Miscellaneous Dynamic Aspects . . . . . . . . . . . . . . . . . . . 30.1.1 Dynamic Aspect Linked to System Operation . . . 30.1.2 Dynamic Aspect Linked to System Maintenance 30.2 Notion of Stochastic (Random) Processes . . . . . . . . . . . . . 30.3 Dynamic Methods and Tools . . . . . . . . . . . . . . . . . . . . . . 30.4 Systems Typology to Select a Relevant Approach . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Markovian Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Basis of the Classical Markov Approach . . . . . . . . . . 31.1.1 Introduction and Overview of the Markovian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Graphical Representation of Markov Process

. . . . . . . .

. . . . . . 471 . . . . . . 471 . . . . . . 471 . . . . . . 473

Contents

31.2

xxiii

Mathematical Foundations . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Basic Formula for Time-Dependent Calculations 31.2.2 Basic Formula for Asymptotic Calculations . . . . 31.3 Link with Basic Definition . . . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.4 Vesely Failure Rate and Failure Frequency . . . . 31.3.5 Failure Rate and Failure Density . . . . . . . . . . . . 31.3.6 Comparison kðtÞ Versus kV ðtÞ and f ðtÞ Versus w(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.7 Repair Intensities . . . . . . . . . . . . . . . . . . . . . . . 31.3.8 MUT, MDT, MTBF and MTTF . . . . . . . . . . . . 31.4 Analytical Calculations of Markov Processes . . . . . . . . . . 31.4.1 Classical Calculation Techniques . . . . . . . . . . . . 31.4.2 Matrix Exponentiation . . . . . . . . . . . . . . . . . . . . 31.5 Advanced Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5.1 Failure on Demand and Zero-Duration State . . . . 31.5.2 Sequence Modelling . . . . . . . . . . . . . . . . . . . . . 31.5.3 Multistate Modelling and Production Availability 31.5.4 Multiphase Modelling . . . . . . . . . . . . . . . . . . . . 31.6 Reducing the Size of the Markov Models . . . . . . . . . . . . . 31.6.1 Aggregation of States . . . . . . . . . . . . . . . . . . . . 31.6.2 FT and RBD-Driven Markov Processes . . . . . . . 31.7 Specific Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.7.1 CCF Modelling . . . . . . . . . . . . . . . . . . . . . . . . 31.7.2 Maintenance Modelling . . . . . . . . . . . . . . . . . . . 31.7.3 Cold, Hot and Mixed Redundancy . . . . . . . . . . . 31.8 Limitation and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 31.9 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 Monte 32.1 32.2 32.3

32.4

32.5 32.6

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

475 475 479 481 481 481 484 487 488

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

490 491 491 494 494 497 500 500 501 512 514 522 522 530 532 532 535 539 542 543 544

Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to Monte Carlo Simulation . . . . . . . . . . . . . . . History and Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generation of Probabilistic Laws . . . . . . . . . . . . . . . . . . . . 32.3.1 General Principle for Generating Random Delays . 32.3.2 Random Number Generation . . . . . . . . . . . . . . . . 32.3.3 Simulation of Typical Probabilistic Laws . . . . . . . Accuracy of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.1 Accuracy Related to Monte Carlo Itself . . . . . . . . 32.4.2 Qualitative Appreciation of the Accuracy . . . . . . . Uncertainty Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . Parameters Changing When Conditions Change . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

547 547 548 553 553 554 556 559 559 562 562 564

xxiv

Contents

32.6.1 32.6.2 32.6.3

Introduction and Context . . . . . . . . . . . . . . . . . . Updating Occurrence Dates (Principle) . . . . . . . . Various Approaches to Manage the Distribution Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.6.4 General Approach to Update Failure Dates . . . . . 32.6.5 Generalities About the Application to Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.6.6 Detailed Application to Weibull Distributions . . . 32.6.7 Examples of Application . . . . . . . . . . . . . . . . . . 32.7 Comparison Between Analytic and Monte Carlo Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.8 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . 564 . . . 564 . . . 567 . . . 570 . . . 575 . . . 577 . . . 579 . . . 584 . . . 585 . . . 586

33 Petri Net Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.1 Quest for Complex Behaviour Modelling . . . . . . . . . . . . . . 33.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Petri Net Use Within Automation and Dependability Fields . 33.4 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.4.1 Graphical Elements . . . . . . . . . . . . . . . . . . . . . . . 33.4.2 Validation of Transitions and Firing Rules . . . . . . 33.4.3 Managing Conflicts . . . . . . . . . . . . . . . . . . . . . . . 33.4.4 Introduction of Delays . . . . . . . . . . . . . . . . . . . . 33.4.5 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . 33.5 Extensions of the Basic PNs . . . . . . . . . . . . . . . . . . . . . . . 33.5.1 Weighted Arcs, Inhibitor Arcs and Repeated Places . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.5.2 Predicates and Assertions/Messages . . . . . . . . . . . 33.5.3 New Validation of Transitions and Firing Rules . . 33.6 Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.6.1 Priority of the Transitions . . . . . . . . . . . . . . . . . . 33.6.2 Suspended Events (Transition with Memory) . . . . 33.6.3 Probabilistic Switches . . . . . . . . . . . . . . . . . . . . . 33.6.4 Dynamic Transitions . . . . . . . . . . . . . . . . . . . . . . 33.7 Miscellaneous Modelling Techniques . . . . . . . . . . . . . . . . . 33.7.1 Common Cause Failure Modelling . . . . . . . . . . . . 33.7.2 Modelling Maintenance and Maintenance Supports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.8 Undertaking System Modelling . . . . . . . . . . . . . . . . . . . . . 33.8.1 Modelling of the System . . . . . . . . . . . . . . . . . . . 33.8.2 Monte Carlo Simulation of the Model . . . . . . . . . 33.8.3 Timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.8.4 Pre-Processing and Table of Impacted Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

587 587 588 590 591 591 592 593 595 597 600

. . . . . . . . . .

. . . . . . . . . .

601 602 603 604 604 604 607 609 610 610

. . . . .

. . . . .

614 618 618 619 621

. . 622

Contents

xxv

33.8.5 Preventing Endless Loops . . . . . . . . . . . . 33.8.6 Markov Graph Generation . . . . . . . . . . . . 33.9 Undertaking System Calculations . . . . . . . . . . . . . . 33.9.1 Availability and Unavailability . . . . . . . . 33.9.2 MTBF, MUT and MDT . . . . . . . . . . . . . 33.9.3 Reliability and MTTF . . . . . . . . . . . . . . . 33.9.4 Token Counting Related Results . . . . . . . 33.9.5 Production Availability Calculation . . . . . 33.10 Accuracy of Results and Data Uncertainty Handling 33.11 Building PNs Related to Large Systems . . . . . . . . . 33.11.1 Main Drawback: Legibility Problem . . . . 33.11.2 Increasing Legibility of Large PNs . . . . . 33.11.3 Modularization of Large PNs . . . . . . . . . . 33.11.4 Modelling of Binary Systems . . . . . . . . . 33.11.5 Modelling of Multistate Systems . . . . . . . 33.12 Coloured Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . 33.13 Conclusion About PNs . . . . . . . . . . . . . . . . . . . . . 33.14 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

34 Dynamic Modelling Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.1 Markovian Approach Exercises . . . . . . . . . . . . . . . . . . . . . 34.1.1 Example: Pumping System . . . . . . . . . . . . . . . . . 34.1.2 Description of the Exercises Related to the Pumping System . . . . . . . . . . . . . . . . . . . . . . . . . 34.1.3 Solutions of the Exercises Related to the Pumping System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2 Petri Net Approach Exercises . . . . . . . . . . . . . . . . . . . . . . . 34.2.1 Example: Service Station . . . . . . . . . . . . . . . . . . . 34.2.2 Description of the Exercises Related to the Service Station . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2.3 Solutions of the Exercise Related to the Service Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part V

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

623 625 627 627 629 630 630 631 632 634 634 636 638 641 649 656 657 658 660

. . 661 . . 661 . . 661 . . 662 . . 663 . . 679 . . 679 . . 680 . . 682 . . 706

Production Availability and Functional Safety (SIL) Modelling and Calculations

35 Production Availability Related Modelling and Calculations 35.1 Characteristics of Production Systems . . . . . . . . . . . . . 35.1.1 Size and Complexity of the Systems . . . . . . . 35.1.2 Multistate and Multiphase Systems . . . . . . . . 35.1.3 Multiple Product Systems . . . . . . . . . . . . . . . 35.1.4 Multiple Information Sources . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

709 709 709 710 710 711

xxvi

Contents

35.2

Classification of Failure and Restoration Events . . . . . 35.2.1 Failure Events . . . . . . . . . . . . . . . . . . . . . . 35.2.2 Restoration Events . . . . . . . . . . . . . . . . . . . 35.2.3 Planned Maintenance . . . . . . . . . . . . . . . . . 35.3 Characteristics of Production Availability Studies . . . . 35.3.1 Economic Calculations . . . . . . . . . . . . . . . . 35.3.2 Rare Events . . . . . . . . . . . . . . . . . . . . . . . . 35.4 Case Study for Comparison of Production Availability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.4.1 Description of the Production System . . . . . 35.4.2 Modelling with Flow Diagrams . . . . . . . . . . 35.4.3 Modelling with Reliability Block Diagrams . 35.4.4 Modelling with Markov Graphs . . . . . . . . . . 35.4.5 Modelling with Petri Nets . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

711 711 713 714 714 714 716

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

716 717 719 721 725 727 747

36 Functional Safety Related Modelling and Calculations . . . . . . . 36.1 Introduction and Standardization . . . . . . . . . . . . . . . . . . . 36.2 Safety Integrity Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.1 Establishing the Safety Integrity Levels (SIL) Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.2 Low Demand Versus High Demand Mode of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.3 Probabilistic Requirements: PFDavg and PFH . . . 36.2.4 Failure Classification . . . . . . . . . . . . . . . . . . . . . 36.2.5 Loss-of-Power Versus Emission-of-Power Safety Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.6 Safe Failure Fraction: The False Good Idea . . . . 36.2.7 Fault Tolerance (Architectural Constraints) . . . . . 36.2.8 Use of k Out of n Logic . . . . . . . . . . . . . . . . . . 36.3 Probabilistic Calculations . . . . . . . . . . . . . . . . . . . . . . . . . 36.3.1 Input Data Needs and Conservativeness . . . . . . . 36.3.2 Simplified Analytical Approach . . . . . . . . . . . . . 36.3.3 Markovian Approach . . . . . . . . . . . . . . . . . . . . 36.3.4 Boolean Approach . . . . . . . . . . . . . . . . . . . . . . 36.3.5 Petri Net Approach . . . . . . . . . . . . . . . . . . . . . . 36.3.6 Uncertainty Handling in SIL Calculations . . . . . 36.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.5 Associated Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . 749 . . . 749 . . . 751 . . . 751 . . . 756 . . . 756 . . . 760 . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

764 765 768 770 774 774 775 793 801 811 817 822 823 824

Contents

Part VI

xxvii

Standardization, Data Collection and Uncertainties . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

829 829 830 831 831 831 832 833 835 836

38 Data Collection and Uncertainties . . . . . . . . . . . . . . . . . . . . . . . 38.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.2 The Bare Necessity of Input Data . . . . . . . . . . . . . . . . . . 38.3 Data Collection Standards and Databases . . . . . . . . . . . . . 38.3.1 IEC and ISO Data Collection Related Standards . 38.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.4 Reliability Data Estimation . . . . . . . . . . . . . . . . . . . . . . . 38.5 Data Uncertainty Modelling . . . . . . . . . . . . . . . . . . . . . . . 38.5.1 Data Accuracy Versus Field Feedback . . . . . . . . 38.5.2 Uniform and Triangular Distributions: Expert Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.5.3 Chi-Square Distribution: Statistics from Field Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.5.4 Bayesian Approach and Gamma Distribution . . . 38.5.5 Log-Normal Distribution: Practical Approach . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

839 839 839 841 841 842 844 845 845

37 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.1 Introduction to Standardization . . . . . . . . . . . . . . . 37.2 Standardization Versus Regulation and Certification 37.3 Standardization Organization Overview . . . . . . . . . 37.3.1 Standardization Bodies . . . . . . . . . . . . . . 37.3.2 Development of a Standard . . . . . . . . . . . 37.3.3 Type and Content of Standards . . . . . . . . 37.4 Safety and Dependability Related Standardization . . 37.5 Concluding Remarks About Standardization . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . 846 . . . .

. . . .

. . . .

848 852 857 859

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863

Abbreviations and Notations

Notations: General Convention for Boolean and Logic Variables X  X; X x; x S, S; S; s; s m/n

Name of an item Boolean variables (X in good state/faulty) Logic variable (X in good state/faulty) Name, states and logic variables related to a system m out of n (majority vote logic)

Subscripts/Superscripts as avg ccf eq ind no o S DD DU

Asymptotic Average Common cause failures Equivalent Independent Non-operating Operating Safe or System state Dangerous detected Dangerous undetected

Greek Letters aj ak;j ðtÞ b

Shape parameter Transition rate from state k to state j Common cause failure ratio of 2 items out of 2

xxix

xxx

b b em c ci dðtÞ h k l g x q q r kðtÞ kv ðtÞ lðtÞ K M U X

Abbreviations and Notations

Shape parameter of Weibull probability distribution Inverse scale parameter State (m) efficiency Probability of failure upon demand Conditional probability of failure of item i upon a non-lethal shock Dirac probability distribution First test interval Constant failure rate Mean of the probability distribution Scale parameter of Weibull probability distribution/expected number of failures Occurrence rate for lethal shocks Occurrence rate for non-lethal shocks Planned production rate Standard deviation Instantaneous failure rate Instantaneous conditional failure intensity/Vesely failure rate Instantaneous active repair rate Constant failure rate of a system Constant repair rate of a system Impossible event Certain event

Latin Numbers I, II, etc.

Severity class

Latin Letters h n q t f ðtÞ wðtÞ  ðTÞ w I

Hour Number of items, variables, etc. Pseudo error factor Elapsed time/instant Failure density Unconditional failure intensity/failure frequency Average failure frequency Input

Abbreviations and Notations

O T Tr Pl A, B I AðtÞ Ast ; Asa ApdðTÞ BPIF S ðCk Þ kÞ CIF S ðC DIF S ðCk Þ kÞ ðC DIM Hi  S FðtÞ Lð:Þ MðtÞ MIF S ðCk Þ N ð0; 1Þ PdrðtÞ PdyðtÞ PrðxÞ RAW S ðCk Þ RRW S ðCk Þ RRW 0 S ðCk Þ RðtÞ S#Ck SCk TeqðTÞ UðtÞ U st ; U sa kÞ VF RS ðC

xxxi

Output Time period, interval of time Transition (Petri net) Place (Petri net) Matrix Identity matrix Instantaneous availability Steady-state, asymptotic availability Accumulated production over the duration T Barlow-Proschan importance factor related to item Ck belonging to system S Critical (Lambert) importance factor related to the down state of item Ck with regards to the system down state Diagnostic importance factor related to item Ck belonging to system S Differential importance measure related to the down state of item Ck with regards to the system down state and under hypothesis H i Unreliability Laplace transform Maintainability Marginal (Birnbaum) importance factor related to item Ck belonging to system S Standard normal probability distribution Instantaneous production rate Instantaneous production availability Probability of x Risk achievement worth related to item Ck belonging to system S Risk reduction worth related to item Ck belonging to system S Inverse of RRW S ðCk Þ Reliability Exclusive co-factor related to the up state of item Ck with regards to the system up state Inclusive co-factor related to the up state of item Ck with regards to the system up state Equivalent production time over the duration T Instantaneous unavailability Steady-state, asymptotic unavailability Vesely-Fussell importance factor related to the down state of item Ck belonging to system S

xxxii

Abbreviations and Notations

Abbreviations and Acronyms Note: when used as variables, acronyms are noted in italics (e.g. MTTF is the acronym and MTTF is its value). AST BBN BDD BDV BN BPCS CCD CCF CDF CMF CPT DAG DD DET DFT DRBD DT DU ESD ETA FD FIFO FMEA FMECA FT FTA GALE GHA HACCP HAZID HAZOP HCR HEART HEF HFT HIPPS HIPS HMI HRA

Accumulated sojourn time Bayesian belief network Binary decision diagram Blowdown valve Belief network Basic process command–control system Cause–consequence diagram Common cause failure Cumulative distribution function Common mode failure Conditional probability table Directed acyclic graph Dangerous detected Dynamic event tree Dynamic fault tree Dynamic reliability block diagram Down time Dangerous undetected Emergency shutdown Event tree analysis Flow diagram First in, first out Failure mode and effects analysis Failure modes, effects and criticality analysis Fault tree Fault tree analysis Globally at least equivalent Gross hazard analysis Hazard analysis and critical control points Hazard identification Hazard and operability Human cognitive reliability Human error assessment and reduction technique Hazardous event frequency Hardware fault tolerance High integrity pressure protection system High integrity protection system Human–machine interface Human reliability assessment

Abbreviations and Notations

IEC IEV ILS IPL ISO LCC LCV LOPA LS LT MAD MART mcs MDT MFDT MIF MLD MLE MORT MRT MST MTBF MTTF MTTFF MTTRes MUT N.A. OREDA P&ID PFDavg PFH PHA PHL PM PN PS PSH PSHH PSLL PSS PST PSV PT PU

xxxiii

International electrotechnical commission International electrotechnical vocabulary Integrated logistic support Independent protection layer International organization for standardization Life cycle cost Level control valve Layer of protection analysis Logic solver Level transmitter Mean administrative delay Mean active repair time Minimal cut set Mean down time Mean fault detection time Marginal importance factor/Birnbaum importance factor Mean logistic delay Maximum likelihood estimator Mean overall repair time Mean repair time Mean sojourn time Mean time between successive failures (of repaired item) Mean time to failure Mean time to first failure (of repaired item) Mean time to restoration (of repaired item) Mean up time Not applicable Offshore and onshore reliability data Piping and instrumentation diagram Average probability of dangerous failure on demand (average unavailability) Probability of failure per hour (average dangerous failure frequency) Preliminary hazard analysis Preliminary hazard list Preventive maintenance Petri net Pressure sensor Pressure sensor high Pressure sensor high high Pressure sensor low low Process safety system Process safety time Pressure safety valve Pressure transmitter Process unit

xxxiv

PV RAMS RBD RBI RCM RRF SD SdF SDV SFF SIF SIL SIS SU SV SWIFT THERP TTF TTR UT

Abbreviations and Notations

Pressure control valve Reliability availability maintainability safety Reliability block diagram Reliability-based inspection Reliability-centered maintenance Risk reduction factor Safe detected Sûreté de fonctionnement (in English: RAMS) Shutdown valve Safe failure fraction Safety instrumented function Safety integrity level Safety instrumented system Safe undetected Safety valve Structured what-if technique Technique for human error rate prediction Time to failure Time to restoration Up time

Part I

Introduction, Background and Overview

Chapter 1

Introduction

1.1 Human Enterprises Involve Risks Since the old ages, human beings and even their predecessors have had to rely on their tools and weapons to survive in a wild and dangerous nature. This is why they designed simple but effective artefacts made of wood, bone and stone. Plenty of objects from the Stone Age have been exhumed from archaeological sites spread all over the planet and this testifies of an intensive lithic craft industry where a kind of standardization can even be observed (raw material, form, size). The Stone Age has been followed by the Bronze and Iron Ages where the artefacts have been improved. At this point, it is interesting to notice that the life duration of the artefacts has begun to decrease: this is mainly observable for iron objects which are rapidly destructed by corrosion and an iron sword of some centuries is less likely to be intact when exhumed than a flint biface of several millenniums. The agriculture activities have been dominant in the historical ages but dedicated corporations and guilds of craftsmen have continued to develop and improve the production of objects until the industrial revolution occurred in the nineteenth century where the industry, as known nowadays, was born. Since this time, it seems that the world has entered in the Anthropocene era where the industrial systems become more and more complex and more and more likely to produce new artificial hazards from which people and environment have to be protected (safety). At the same time, the economic point of view becomes more and more prevalent and designing efficient and cost-effective systems (i.e. dependable) also becomes more and more important (dependability). Then, any human enterprise implies some risks, those risks increase with the complexity of the systems developed nowadays and numerous events have occurred to remind that: • Fukushima (Japan, 2011), Chernobyl (Russia, 1986), Three Mile Island (USA, 1979) for the nuclear risk.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_1

3

4

1 Introduction

• Boeing 737 Max 8 (Ethiopia, 2019), Concorde (France, 2000), Tenerife (Spain, 1977) for the aeronautic risk. • Ariane V (France, 1996), Challenger (USA, 1986), Apollo 13 (USA, 1970), for the spatial risk. • Bhopal (India, 1984), Flixborough (United Kingdom, 1974), Seveso (Italy, 1976) for the chemical risk. • Elgin (North Sea, Norway, 2012), Macondo (Gulf of Mexico, 2010), Piper Alpha (North Sea, Scotland, 1988) for the oil and gas risk. • America Grande (France, 2019), Prestige (France, 2002), Exon Valdez (Alaska, 1979), Torrey Canyon (Scilly Islands, United Kingdom, 1967) for spill oil risk. • Lac-Mégantic (Canada, 2013), Santiago de Compostela (Spain, 2013), Eschede (Germany, 1998) for the railway risk. • Costa Concordia (Italy, 2012), Estonia (Baltic Sea, 1994), Herald of Free Enterprise (Belgium, 1987), RMS Titanic (North Atlantic Ocean, 1912) and several ferries in Indonesia or Korea for the naval transportation risk. The various blackouts observed throughout the world (New York, France), the stock exchange crashes (New York, worldwide), the pandemics (plague, Spanish influenza, AIDS, COVID-19) and the climatic disturbance could be added to this list which validates the commonly claimed assertion that any human enterprise involves risks and that the zero risk does not exist. As a matter of fact, many managers were very reluctant to accept this assertion in the seventies because they claimed that applying rules and regulations was necessary and sufficient to make the risk disappear. Nowadays, some of the same managers have completely changed their mind as they have realized that they can use it as an alibi to explain that accidents occur because everyone knows that the zero risk does not exist!

1.2 Philosophy to Master the Risks If the zero risk is a utopia, its reduction to an acceptable level has been a quest since the beginning of human history. From the trial and error approach used since the early days to the sophisticated approaches used nowadays, the key point is to use the past experience to improve the future. In this view, the thought of the French positivist philosopher Auguste Comte who said about philosophy—“like the plain common sense, the true philosophical mind consists in knowing what is, to predict what will be, in order to improve it as far as possible”—can be adopted as a way of thinking for reliability engineers/risk managers: knowing ⇒ forecasting ⇒ improve.

1.2 Philosophy to Master the Risks

5

The various approaches developed in the reliability field, including these proposed in this book, have been developed to help doing that, provided that they are combined with a minimum of plain common sense of the analysts.

Chapter 2

Background

2.1 A Short Story of Reliability Analysis 2.1.1 Premises Before looking at the future, it can be very instructive to look at the past. At any time, the authorities have tried to master the risks by issuing laws and regulations. This was generally done after some detrimental event had been observed. A good example of the way to manage the risks in the nineteenth century is given by the orders from Napoléon the 1st to the Prefect of Var (a department in the South East of France) about forest fires occurring in this place: “Shoot on location who is suspected to have lighted the fires and you will be replaced by a new Prefect in case of new fires”. History has not recorded what actually happened but this is likely to have had only a small influence as nowadays fires continue to occur every year in this area! Even if the strong repression described above is no longer in use, rules or regulations continue to be issued almost each time new serious detrimental events occur. Deterministic in nature, this very old trial and error approach is expected to prevent the unwanted events to occur again and is still the basis for designing safe systems. Nevertheless, rules and regulations are effective only to some extent as the unwanted events are seldom completely prevented and, in addition, they have no impact on events which have never been observed. Then, probabilistic oriented techniques have been introduced as a complement and this has been done through a relatively slow process which has started at the end of the First World War. At this time, the idea to quantitatively compare systems designed to do the same things is born in the mind of specialists in aeronautics. As they have noticed that twin-engine aircrafts seemed less prone to crashes than single-engine aircrafts, they decided to calculate an indicator equal to the ratio of the number of crashes by the number of flight hours. As expected, this indicator was higher for single-engine than

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_2

7

8

2 Background

for twin-engine aircrafts and this confirmed the feeling of the specialists and also proved the usefulness of redundancy with regards to reliability. This simple statistic approach has been used to compare aircrafts from a crash point of view until the thirties when a new idea has emerged: beyond the comparison between aircrafts from past events, this indicator could be useful to predict the future crashes. This is the starting point of a new discipline which has been recognized and named reliability theory later on in the fifties.

2.1.2 The Beginning It is during the Second World War that the reliability theory really rises. The most famous example is, unfortunately, its use for developing the German flying bomb V1 and flying rocket V2. At the beginning all the V1 exploded on their launch pad or fell down into the Channel. This lasted as long as the aircraft engineer Robert Lusser was involved in the project and found what was wrong in the V1 design. Making an analogy with a simple chain, he brought to light that “one chain cannot be stronger than its weaker link”. Translated into the reliability field, this led to the first and fundamental reliability property: one system made of several components in series cannot be more reliable than the less reliable of these components. Translated in turn into a probabilistic form, this led to the famous Lusser theorem: “The probability of success of a series of components is equal to the product of the probabilities of success of each of these components”. From this point, it has been understood that the identification and improvement of the weak points was of utmost importance to improve the reliability of a given system. Even if this seems only common sense, applying this simple idea likely allowed to improve the probability of “success” in launching the V1 and then the V2… but this certainly has not been seen as a progress by those who have been exposed to these bombs. Redundancy is a technique widely used to increase reliability and nowadays the hunting for weak points often consists in the identification of the possible common cause failure between redundant elements. In the forties and fifties, the dissemination of the reliability approach has been observed mainly in the aeronautic, nuclear and military industries and mainly in the United States. During the Second World War, 50% of the spares and equipment in storage became unserviceable before use. At the beginning of the Korean War, about 70% of US Navy electronic gear did not operate properly and the analysis failure of items was initiated later on Kececioglu (2002). When the US Ministry of Defense noticed that for 1$ of electronic equipment, 2$ of maintenance (Villemeur 1988 or 1992) was spent per year, it became clear that the equipment should be reliable by design: the science of failure was born. Reliability requirements have been introduced in the call for tender of electronic components but, at that time, the analysis techniques were not available and the engineering know-how not sufficient to meet

2.1 A Short Story of Reliability Analysis

9

the requirements. Therefore, the providers willing to bid have had to demonstrate the reliability of their product from statistics obtained by undertaking long range tests. Later on, the results have been used to issue the very famous reliability data handbook Military standard 217. Reliability prediction of electronic equipment which has been used for decades (first issue in 1965) to assess the reliability of electronic systems. The latest issue Mil HDBK 217 F (1995), Quanterion 217Plus (2015) of this document might be replaced by the standard IEC 63142 based on the FIDES project UTE (2011) when it will be issued. It is also in the forties that the FMECA (failure modes, effects and criticality analysis) has been originally developed by the US army. This bottom-up (inductive) approach is still widely in use nowadays and has been standardized in IEC 60812 (2019). At the same time as the science of failure began, the development of the science of the restoration of failed equipment (maintainability) also began with the aim of decreasing the maintenance costs. Then, at the end of the fifties, the importance of human failures has been brought to light and the first attempts to take them into account have been made mainly in aeronautics. Reliability engineering started to develop into a separate discipline in 1952; the first National Symposium on Reliability and Quality Control has been organized as soon as 1954 and the first Annual Reliability and Maintainability Symposium was held in 1962 in the United States (Kececioglu 2002).

2.1.3 A Step Forward of the Reliability Approach In the sixties, the works developed in the fifties have been extended to the systems made of electrical, mechanical or hydraulic components. This has needed to develop new analysis techniques better adapted than those developed for electronic components. It is in this context that H. E. Watson of the Bell laboratories has designed the so-called FTA (fault tree analysis) method which has been used to assess the safety of the launching of the Minuteman missiles (Watson 1961). Very quickly this topdown (deductive) approach has been disseminated throughout the industry because it allowed to describe how complex (or rather, complicated) systems can fail. Since the beginning it has been used by the NASA for the Mercury, Gemini and Apollo (after the Apollo 1 launch pad fire) programs or Boeing to design commercial aircraft. Since the sixties, the fault tree analysis has been widely used, often in combination with FMECA. It is still widely used by reliability engineers and has been standardized in IEC 61025 Ed. 3.0 (in progress) and it is invaluable as it is the single top-down approach used to model and analyse the system reliability in its broad acceptation. It has to be noted that the RBD (reliability block diagram) models were likely to be in use at this time but history has not kept track about by who and when they have been invented. It is in the sixties that the HAZOP (hazard and operability study) has been developed in chemical industry. Like FMECA, it is a bottom-up approach which has been standardized in IEC 61882 (2016) and which is still widely used nowadays.

10

2 Background

The first comprehensive textbook exclusively devoted to reliability engineering was published by Igor Bazovsky in 1961 (Bazovsky 2004) and the MIL-STD-882 “System Safety Program Requirements” was first issued in 1969. The PHA (preliminary hazard analysis) has been formally instituted and promulgated by developers of MIL-STD-882A. Like FTA and HAZOP, it is still widely used. In parallel of the technological part, A. D. Swain has developed, in 1963, the THERP method (technique for human error prediction) (Guttmann and Swain 1983) to assess the safety of nuclear weapons. This is the ancestor of the HRA (human reliability assessment) (Bell et al. 2009) approach used for evaluating and reducing the probability of human error occurring during a specific task. With the arrival of the Boeing 747, a wide-body aircraft, airline operators realized that their maintenance activity would require considerable change due to a large increase in scheduled maintenance costs. Airline operators jointly organized the socalled Maintenance Steering Group (MSG) which issued the MSG-1 document in 1968. The term “reliability centered maintenance” (RCM), which appeared first in the civil aviation industry, is now commonly used in all industries IEC 60300-3-11 (2017). At the same time as the reliability modelling was improved, the need for reliability data was increasing: no probabilistic calculations possible without reliability data. Therefore, the first tables providing reliability data for components (pumps, valves) or human factor have been established and issued since the beginning of the sixties.

2.1.4 Consolidation of the Reliability Approach Following the lead of the aerospace industry, the nuclear power industry began to use the method in the design and development of nuclear power plants in the seventies and proceed to the main improvements which were observed. Due to its “original sin”, Hiroshima, the nuclear industry has been obliged, for the first time in industry, to prove that the risk was acceptable even before the plants were built, that is to say without experience from previously operated installations. Therefore, the lack of field feedback had to be compensated by modelling, analysing and calculating and many engineers have been mobilised for this purpose. It is in this context that N. Rasmussen from the US Nuclear Regulatory Commission and his teams achieved and issued the first comprehensive risk analysis of nuclear power plants: PWR (pressurized water reactor) and BWR (boiling water reactor). FTAs invented in the sixties have been used to analyse the system failures and then they have been combined through event trees to identify and calculate the probabilities of scenarios leading to various detrimental consequences. The THERP method has also been used to assess the influence of human factor with regards to nuclear safety. This resulted in the large Wash 1400 report (Rasmussen 1975) which, beyond the results about nuclear safety, has consolidated the success of FTA and made the success of the ETA (event tree analysis). It has also brought to light the importance of common cause failures. The ETA approach has been widely used since the seventies and is still in use nowadays. The success of

2.1 A Short Story of Reliability Analysis

11

the event trees which are now standardized in IEC 62502 (2010) has, unfortunately, almost completely obliterated the cause-consequence diagram approach (Nielsen 1971) which has been previously developed and was easier to handle: it could be interesting to re-discover this approach. In France, the first methods and tools allowing to perform probabilistic safety studies have been introduced in the sixties in telecom by Schwob and Peyrache (1969) and then in aeronautics by Lievens (1976) and nuclear industries by Pagès and Gondran (1986) and by Villemeur (1988). A bureau devoted to probabilistic safety studies has been created in the French atomic energy commission (CEA—Commissariat à l’énergie atomique) since the beginning of the seventies. The approaches developed at this time included FMECA, FTA, ETA, cause consequence diagrams and Markov graphs. A great effort has been made since 1974 to develop software packages dealing with probabilistic calculations: analytical calculations based on fault trees or Markov graphs and Monte Carlo simulation based on specific models. It has to be noted that, developed in 1906, the Markov approach is certainly the oldest approach used for reliability calculations. It is very much used for academic works. Like Monte Carlo simulation, the use of Markov calculations increases thanks to the increasing computer calculation power. Nowadays it is often used in combination with fault trees and has been standardized in IEC 61165 (2006). At the end of the seventies, the UKAEA (United Kingdom atomic energy authority) has applied the techniques developed in the nuclear industry to perform the complete safety analysis of the Canvey Island (Canvey 1978) petrochemical complex. This study, which has been almost entirely published, has been the first such study performed on non-nuclear installations. No casualties but huge economical losses were reported from the Three Mile Island accident (Rogovin et al. 1979) which occurred at the end of the seventies. As usual, this stimulated the research and development in the domain of nuclear safety in general and reliability in particular. Within the same period, the Aerospatiale company (Now Airbus Industry) has developed new reliability analysis approaches (Lievens 1976) adapted for developing aircrafts and based on the combination of summarized significant failures. They have been systematically used to design the supersonic aircraft Concorde and then for the Airbus program and have certainly participated to the high degree of reliability of these aircrafts. Nowadays, aircraft transportation is one of the safest means of transportation even if, unfortunately, accidents occur from time to time, which consolidates the idea that the zero risk does not exist and confirms that the designers have to be vigilant all the time about reliability and field feedback. It is during the development of Concorde that the first probabilistic regulations have been promulgated. The failures were classified in minor, significant, critical and catastrophic and risk objectives set in term of failure per hours of flight. For example, at that time the objective has been set to 10−7 per hour for catastrophic failures (i.e. crash of the aircraft). The aim was to divide by 10 what was observed on the previous aircraft generation in order to increase very much the air flight safety and to maintain the crash frequency to an acceptable level. This was done for obvious safety reasons

12

2 Background

but also to avoid economic problems due to flight ban from safety authorities or to user rejection of aircraft transportation: the single accident of the Concorde has quickly led to its abandonment.

2.1.5 Dissemination in All the Industry Sectors The link between the seventies and the eighties has been done through the Seveso accident occurred in 1976. This accident which has occurred in Italy made the European citizens suddenly aware of the risks linked to chemical processes: one cloud of dioxin had been released in the atmosphere and, in many aspects, this was similar to a nuclear accident with toxic fallouts that the exposed inhabitants were not able to detect. Therefore, they were not able to know if they were contaminated or not because some experts explained on the media that there was absolutely no risk while, at the same time, others explained that dioxin was one of the most dangerous products ever produced by human beings! This results in frightening the European population and, although no casualties have been recorded, this accident is still considered to have been very serious. This led the European Union to issue in 1982 the Seveso directive imposing to identify the hazards related to major-risk plants and to produce safety reports assessing the probability and consequences of these identified hazards (Seveso III 2012). It has to be noted that the 7,575 official immediate deaths (and certainly much more) of the Bhopal accident (India) in 1984 (Wikipedia 2020) have not had such an impact on regulations … but it was far away. The Seveso 1 directive has been the first regulation where probabilistic assessment and fault trees have been explicitly mentioned and, at the present time, the 3rd version of the directive has been in use since 2015. Beyond the aeronautics, spatial and nuclear sectors already mentioned, this directive as well as the need to improve the economic aspects have progressively led to the dissemination of the reliability approaches in most of the industry sectors: oil and gas, chemistry, petrochemistry, railways, automobile, etc. Further works have been undertaken to improve the previous approaches, new approaches have been developed and this process is still in progress nowadays. It has been boosted by the huge increase in the calculation power of personal computers which in turn has allowed to develop powerful reliability software packages to help the analysts to perform, at low cost, accurate analyses and exact calculations on large industrial systems. This has even opened to the use of the Monte Carlo simulation which, previously, was too time-consuming and costly to be commonly used for reliability studies. In parallel to these improvements, an intensive standardization effort has been undertaken toward civil industry. It is difficult to mention all the developments achieved since the eighties until now but the following can be mentioned:

2.1 A Short Story of Reliability Analysis

13

• improvement in Boolean model calculations (fault trees, reliability block diagrams, event trees): implementation of the binary decision diagrams (BDD) (Bryant et al. 1986); • improvement in Markov model calculations: increase of the model size and use in combination with fault trees (FT-driven Markov processes); • development related to the functional safety (IEC 61508 2010; ISO/TR 12489 2013): safety integrity levels (SIL) requirements for safety instrumented systems (SIS) and techniques like risk graphs, LOPA (layer of protection analysis) or bowtie models (CCPS 2001; Torres-Echeveria 2014; IEC 61511 2016; ISO/IEC 31010 2019; Wikipedia Bowtie 2020); • development related to maintenance: reliability centred maintenance (RCM) (IEC 60300-3-11 2017), reliability-based inspection (RBI), integrated logistic support (ILS) (IEC 60300-3-12 2011); • development related to dynamic systems: dynamic reliability block diagrams (DRBD), dynamic fault trees (DFT) and especially Petri nets (IEC 62551 2012) used in combination with Monte Carlo simulation; • extension to economic aspects (dependability), e.g. production availability of production systems (ISO 20815 2018); • development of high-level formal languages to model both the good functioning and the system failures: specialized language (e.g. AltaRica (Batteux et al. 2019)) or deriving from model-based engineering (e.g. SysML, UML, AADL) (Roques 2013; SysML 2020; Wikipedia SysML 2020; UML 2020; Wikipedia UML 2020; SAE 2012; Wikipedia AADL 2020); • development of the link with theLCC (life cycle cost) (IEC 60300-3-3 2017; ISO 15663 Ed. 1.0 2021); • development of new approaches to take the human factor into account: human cognitive reliability (HCR), or human error assessment and reduction technique (HEART) (William JC 1985). A complete corpus of methods and tools is now available for reliability studies and this implies that more and more reliability or operational data are needed to feed the models and to perform accurate calculations. The effort undertaken for reliability data collection has been weaker than for improving models and, in addition, the existing data bases are generally dedicated to a given industry sector and difficult to use outside this sector. This is why the analysts are often starved of data but, nevertheless, the following can be mentioned: • data bases developed within the military equipment framework (MIL-HDBK-217 F 1995); • data bases developed within the regulatory framework of nuclear industry (e.g. SRDF (Aupied and Procaccia. 1984)); • OREDA data base developed in oil and gas industry since 1982 (OREDA 2020); • PERD (Process Equipment Reliability Database) developed in the chemical industry (CCPS 2020); • FIDES (IEC 63142 to be issued; UTE 2011) already mentioned developed for electronic components; • etc.

14

2 Background

The standards IEC 60300-3-2 (2004) (Collection of dependability data from the field), ISO 7385 (1983) (Nuclear power plants—Guidelines to ensure quality of collected data on reliability), ISO 14224 (2016) (Collection and exchange of reliability and maintenance data for equipment) and ISO 6527 (1982) (Nuclear power plants. Reliability data exchange) can be used as bases to undertake effective reliability data collections.

2.2 Why, When and How to Implement Reliability Studies 2.2.1 Why As shown in the short history above, the safety of industrial systems has been provided for a long time by the know-how of engineers (the state of the art) and the rules and regulations promulgated after serious accidents have occurred. This practice has been effective for a long time but, since the nineteenth century, the complexity of systems has continuously increased as well as the severity of the consequences of accidents. Then, at one moment it became clear that, if the application of the state of the art, rules and regulations was still of utmost importance, it was no longer able to insure an acceptable level of risk for high risk activities: the application of the above approach, deterministic in nature, remains necessary but becomes no longer sufficient. In addition, nowadays and beyond safety, the increased commercial pressure on companies also requires accurate assessments of economic oriented risks (e.g. plant economic effectiveness based on availability calculations). This is the aim of the dependability analyses. The implementation of new complementary approaches being obviously needed, the reliability and risk analyses have been developed to close the gap: they have proven to be very effective to identify, analyse, manage and mitigate the risks.

2.2.2 When Achieving a reliability study is relatively costly but, like the insurance premium, this seems too expensive … as long as no problem occurs. Therefore, the decision to perform such an analysis should be taken with great precautions and the main reasons for deciding to launch a study are the following: • Novelty: the past experience gained from similar systems in operation is the main basis for designing industrial systems. Therefore, when designing a new system for which such experience is not available, it is necessary to anticipate the events which are likely to occur when it will be actually in operation. The use of reliability models is an easy way to help the analysts to create this missing experience on the paper. They can play with the models and perform calculations to identify

2.2 Why, When and How to Implement Reliability Studies

15

the strengths and weaknesses of the system under study and implement measures preventing the unwanted events (related to safety or economic losses) to occur even before they can be observed on the system in operation. • Complexity: the increasing size, the participation of the more and more specialized (electronics, electrotechnics, hydraulics, software, …) disciplines and the introduction of more and more automatisms make the system reliability more and more difficult to design. From Aristotle it is well known that the whole is not the sum of its parts and this is confirmed more recently by the Bellman’s theorem (Wikipedia Bellman 2020) which says that a system made of optimized parts is not necessarily optimum itself (see Fig. 2.1). Therefore, even when each part of a system is designed at the best (according to the specific discipline), this does not guaranty that the whole system is going to work well especially when it is made of various interacting subsystems whose interactions can be difficult to identify and take into account. As there is no specialist to stick back the parts and the subsystems together, difficulties often occur when they are gathered to form the final system. Fortunately, the analysis of failures does not care about the breakdown of the systems between various disciplines and different subsystems. The reliability approaches naturally direct the analysts toward a systemic point of view and provide efficient means to cope with the above difficulties. • Competing risks: safety and dependability generally lead to antagonistic risks: improving safety without regards to the economic aspects is likely to degrade dependability and reciprocally. With the increasing system complexity, other Fig. 2.1 Non optimum system made of optimum parts

16

2 Background

competing risks are emerging: for example, closing a valve to prevent an overpressure (safety action) can lead in turn to safety problems when reopening the valve. Therefore, many trade-offs have to be made and reliability and risk analyses prove to be very effective as decision aids to determine the best ones. • Severity of consequences: the application of the traditional deterministic approach based on know-how, rules and regulations is sufficient when the expected consequences are low whereas it is not sufficient for high risk activities. This is why, in addition, reliability studies are commonly performed in the industry sectors when events can occur with potential severe consequences with regards to safety, environment, assets or production. This allows to identify and implement preventive actions, to decrease the probability of occurrence of the detrimental events and to implement means to mitigate the consequences when, unfortunately, the detrimental events occur. • Occurrence of detrimental events: the occurrence of a detrimental event with severe consequences to safety, environment, assets or production is always an opportunity for the operators of similar systems to wonder what to do to prevent this event to occur again. This generally triggers changes in the internal company rules and a significant augmentation of the reliability studies/risk analyses performed in this industry sector. This has been observed in the nuclear field after Three Mile Island (1979), Chernobyl (1986) or Fukushima (2011) accidents as well as in oil and gas field after Piper Alpha (1988) or the Deep-water Horizon accidents (2010). When the detrimental event is symptomatic of a deep problem, new regulations are also often promulgated or/and new standards are developed as this has been the case in Europe after the Seveso accident or in the oil and gas field after the Deep-Water Horizon accident.

2.2.3 How As previously written, the know-how, the rules and the regulations are the bases for designing systems. In fact, it would not be realistic to try to design a new system by using only the risk analyses/reliability studies from scratch. Figure 2.2 illustrates how deterministic and probabilistic approaches can be combined to design a system according to safety and operational performance requirements. The safety requirement comes from regulations (from safety authorities), standards (international or sectoral) and company rules whereas the operational requirements are established only to satisfy company needs. The design in itself is generally not a one-shot process but an iterative process where, at each stage, the safety and operational performances are compared to the criteria to be fulfilled. The design process stops when all the criteria are fulfilled but it goes on as long as they are not fulfilled. In this case, the design has to be improved and the result compared again to the criteria. When the criteria are easy to be fulfilled by using the deterministic approach, no complementary approach is needed. When they are difficult to fulfil, it is often interesting to switch to the probabilistic approach to

2.2 Why, When and How to Implement Reliability Studies Safety principles Probabilistic criteria

Design with regards to safety and operation No

Acceptable Yes

Deterministic criteria

Risk analysis/ reliability

Decision Aid

Accepted

Verification /Survey

Probabilistic approach

Deterministic approach

Operational requirements

17

Fig. 2.2 Complementarity between deterministic and probabilistic approaches

determine which improvements are really needed to reach an acceptable risk level while satisfying the operational requirements. The problem often occurs when multiple deterministic safety criteria independent from each other and related to different purposes have to be applied. This can lead to requirements too much conservative and, sometimes, contradictory. In this case, only a systemic approach can help the designers to decide what to do. In addition, the deterministic approach is manichean in nature and there is no room for intermediate answer between “the rule is applied and the system is OK” and “the rule is not applied and the system is not OK”. To cope with this problem, a more nuanced and flexible approach is useful. This is precisely what can be expected from the probabilistic approach (risk/reliability analyses) which has proven, as shown in Fig. 2.2, to be very useful to be used in complement of the deterministic approach. In this case, probabilistic criteria (e.g. probability of occurrence of unwanted events) can be used instead of deterministic criteria to get the final design. The probabilistic approach can also be used to verify which risk level can be actually expected from the system designed in a deterministic way. In addition, when contradictory rules are encountered, it is a way to prove to a safety authority that the safety level is acceptable even if all the regulations have not been implemented.

2.3 Name for the New Discipline According to the industry sectors, the probabilistic approaches mentioned above have been gathered under various denominations like reliability studies, risk analysis, probabilistic risk assessment, risk management, cindynics (from the Greek cindynos: danger/hazard), aleatics (from the Latin alea: dice game/randomness), etc.

18

2 Background

In France, the term adopted has been “sûreté de fonctionnement (SdF)” which means something like “functioning sureness” were “sure” is used with the same meaning as in “assurance”. This is related to the confidence in the good functioning of systems and encompasses a wide range of points of view ranging from the impact of events with regards to safety to the impacts on economic aspects. This is a good point of view as those aspects should be considered at the same time to find the best trade-offs between them. SdF has no exact equivalent in English but it is close to the meaning of the acronym RAMS (reliability, availability, maintainability and safety) which is often used. The main difficulty comes from the fact that safety and economic aspects (dependability) are considered separately in the standardization field where dependability and safety are treated in different technical committees: IEC TC56 committee alone for dependability and numerous other committees for safety. French and English being the two languages used for the international standardization, equivalent terms should be used in the two languages: it is not the case for “dependability” which has been improperly translated by “sûreté de fonctionnement”. This is quite confusing as safety is, in principle, excluded of the scope of the IEC TC56 although it is the main topic addressed by French reliability engineers working in the SdF field. Even if the gap has been closed a little bit in the last years (safety is no longer completely taboo in the TC56), this terminology problem remains and analysts should be aware and cautious about it. It is to avoid these confusing terms that “Reliability assessment of safety and production systems” has been adopted for the title of this book which aims to cover both safety and dependability.

2.4 Notion of Risk 2.4.1 Etymology. Danger Versus Peril, Risk and Hazard Until now in the book, the terms risk, reliability, etc. have been used in their vernacular acceptation and as most of the people could use them. In fact, according to the needs of various industries, various disciplines, various people or various standards, these terms and many others have been defined in different ways and have become very much polysemic: this is specially the case of the term “risk” for which dozens of definitions can be found (e.g. in the ISO standards). To catch the very meaning of the term “risk”, it is necessary to look at the origin of this term. English dictionaries indicate that “risk” comes from the French word “risque”. This is interesting as it implies that the word has, in principle, the same meaning in both English and French languages. About the etymology of these words, the French dictionaries indicate that they come from the old Italian “risco” which, in turn, comes from the Latin “risicus, resecare” (cut). This Latin etymology (i.e. what is cutting) has led to the acceptation of steep rock kept in the Spanish “risco” meaning

2.4 Notion of Risk

19

reef. This last term is clearly associated with the risks incurred by goods transported by boats. Therefore, from the origin, the present time meaning has been the result of a long semantic process. It has always been associated with a detrimental (negative) aspect and never to a beneficial (positive) aspect. According to dictionaries, it means: “Danger/losses more or less foreseeable”. When looking at the usual dictionaries to make the difference between risk, hazard, danger and peril, a kind of vicious circle is entered because the meanings are very close and each of them is used to define the others. In the end, they are presented as if they were quasi-synonyms. This is not accurate enough for our purpose but, fortunately, a key is given in the Littré which is a reference dictionary for the usage of French language. It indicates that “Risk may be easily distinguished from danger/hazard as it contains less the idea of peril than this of random chance but considered on the wrong side”. Although this has been written at the end of the nineteenth century, it clearly identifies the characteristic of the risk concept: • random chance: i.e. probability or frequency of occurrence of an event • detrimental consequences of this event if it occurs. The danger in itself is simply related to the list of hazardous events that a system is able to generate but there is no link with their probability of occurrence.

2.4.2 Safety Versus Risk Management Definitions Among the dozens of definitions of risk, two are very important as they are used as bases to develop standards: • ISO/IEC guide 51 (2014): “Combination of the probability of occurrence of harm and the severity of that harm” • ISO guide 73 (2009): “Effect of uncertainty on objectives”. The ISO/IEC guide 51 is used for developing safety standards whereas the ISO guide 73 is used for developing risk management standards. The ISO/IEC guide 51 is fully in line with the acceptation of the term risk as explained in the previous Subsection 2.4.1: something detrimental/unpleasant (harm) and which can occur (probability of occurrence). This could be perfect if this definition was more related to the measure of the risk than to the concept of risk in itself. This is why the risk managers have designed the definition found in the ISO guide 73 which intends to overcome the shortcomings of the definition found in the ISO/IEC guide 51. The problem is that several difficulties arise when using this definition: • uncertainty: this concept is not defined in the ISO guide 73 and therefore, according to the rules for building consistent terminology, the concept of risk is neither defined; • objective: this implies that no risk exists when no objective is defined;

20

2 Background

• positive risk: as the effect can be positive or negative, the users of this definition tend to consider, against the sound meaning of the concept, that the risk is no longer only negative but can also be positive. Uncertainties: They are usually classified into two different categories: aleatory uncertainties and epistemic uncertainties. • The aleatory uncertainties are linked to randomness which, according to B. Mandelbrot (Wikipedia 2020), can be split into mild randomness and wild randomness. Wild randomness cannot be managed because it leads to chaotic behaviours. • The epistemic uncertainties are linked to the ignorance of the phenomena. Obviously, something which is completely ignored is not manageable. As a consequence of the incompleteness theorems developed by Kurt Gödel (Wikipedia 2020) in 1931 in the formal logic framework, the introduction of smart components can lead to unpredictable behaviours. As it is more and more widely used, it is likely to become an important source of epistemic uncertainties in the near future. Finally, only mild randomness, i.e. uncertainty linked to the usual probability distribution, can actually be manageable and this comes back to the usual probability framework mentioned in the ISO/IEC guide 51. Objective: With the ISO guide 73 definition, the risk disappears when the objective is not defined: no objective implies no risk! This should be an easy way for a head-in-the-sand policy! But beyond the joke, the definition is, however, problematic as it obviously excludes the risks against which nothing can be done like natural risk (climate problems, earthquake, meteorite, giant electromagnetic impulse, …) or which are incurred but not identified yet (unknown diseases, exposition to substances with unknown toxic effect, …). Positive Risk: According to its definition, the risk is deeply linked to detrimental potential events and claiming that it can be positive is a little bit puzzling. This mistake is often done by people who say that they have taken the risk to win at the lottery when they actually have taken the risk to lose their bet! Following those people, the term “positive risk” is often used in casual conversation between the users of the risk management standards (ISO 31000 2018). However, it does not appear in the ISO/IEC guide 73 (2009) which only mentions that the effects and the consequences can be positive or negative. There is nothing new in that as the deviation from the expected values is the direct result of the random aspects of both occurrence and consequences of the related events. More interesting, and beyond natural randomness, the aim seems to cope with changing situations which

2.4 Notion of Risk

21

can become less favourable or more favourable than expected. This is a part of the epistemic uncertainties which are revealed and which need that decisions are taken in a reactive manner. The ISO 31000 talks about threats and opportunities and this explains why the term “opportunity”is often associated to the “positive” (sic) risk. Anyway, even if an opportunity is properly seized, this changes the objective and, once again, the risk is that this new objective is not reached. The philosophy of the risk/reliability analysis is to systematically use conservative or, at least, best estimate assumptions. Therefore, if the results are better than expected, this is the result of a rational reasoning and of voluntary actions, not the result of a stroke of good fortune like the term “positive risk” may imply. Contingency and Reconciliation between Definitions: The consideration of the contingency concept is a way to solve the "positive risk" issue. Contingency is related to what can happen or not happen. Then, it may be considered as the source of the uncertainties mentioned in the ISO guide 73 definition and this allows to introduce the following definitions: • Risk: potentiality of detrimental contingent effects. • Opportunity: potentiality of beneficial contingent effects. Of course, the fact that effects are detrimental or beneficial is a matter of point of view and depends on the stakeholders. Potentiality can be measured in term of chance, probability or likelihood and effects in term of consequences. Therefore, a way of reconciliation can be found in the ISO guide 73 NOTE 4 which indicates that the “risk is often expressed in terms of a combination of the consequences of an event and the associated likelihood of occurrence” and which is almost the same definition as in the ISO/IEC guide 51 (2014). Then, both guides agree on the way to measure the risk. If not in letter and beyond the terminology problems raised above, there is no real discrepancies in spirit between them.

2.4.3 Risk Overview in Industrial Context Risk as a Two Dimension Concept Events rather frequently occur to remind us that every human enterprise implies risks (see Chap. 1). This is precisely the case when writing these lines (March 2019): the media report that the chemical plant of Yancheng has exploded in China (78 deaths and hundreds of casualties) and that a cruise ship, the Viking Sky, is stuck in a very rough Norwegian Sea (see Chap. 5). She risks to capsize due to the breakdown of its four propulsion engines and, while watching at the rescue of the passengers on TV, the idea of having caught a beautiful common cause failure for the book is gently gaining ground in our mind! Fortunately, all the passengers have been rescued and consequences have been only economical.

22

2 Background

This consolidates the points of view adopted by the ISO/IEC guide 51 (for the definition of risk) and of the ISO guide 73 (for the measure of risk) and this also confirms the pragmatic point of view adopted by reliability engineers since the origin of reliability studies and long before the guides have been issued. The risk is a two-dimensional concept and its severity is related to these two dimensions: • Chance for a detrimental event to occur. This can be expressed in terms of frequency, probability or likelihood. • Consequence when the detrimental event has occurred. This implies that the risk can be represented by using a system of coordinates with two axes with, for example, the chance (probability/frequency) in ordinate and the consequence severity in abscise. This allows to divide the risk space between an acceptable and a not acceptable zone as illustrated in Fig. 2.3 for various cases: • On the left-hand side of the figure, the risks are considered to be equivalent provided that the product probability x consequence is the same: this leads to split the space between an acceptable and a non-acceptable zone with a linear curve. This is the most simplistic approach. • In the middle of the figure, it is considered that the high probability x low consequence risks (e.g. car accident) are more acceptable by society than the low probability x high consequence risks (e.g. aircraft crash, nuclear power plant accident) and this leads to a non-linear curve to split the space between an acceptable and a non-acceptable zone. This kind of curve has been introduced by F. R. Farmer for analysing the nuclear power plant risks (e.g. in the Wash 1400 report Rasmussen (1975)) and is known as the Farmer curve.

Not acceptable risk 2

Acceptable risk

3

Consequence severity

Not acceptable risk

Acceptable risk Consequence severity

Fig. 2.3 Example of various risk matrices

Probability / frequency classes

1

Probability / frequency

Probability / frequency

On the right-hand side of the figure, the chances and the consequences have been split into discrete classes and this leads to a risk matrix where three zones are identified instead of only two: acceptable zone where there is nothing to do, not-acceptable zone where the design has to be imperatively improved and the tolerable zone where the risk has to be analysed in more detail to determine if it can be really acceptable as it is (e.g. as low as reasonably practicable, ALARP) or if further improvements are needed (HSE 2020; Wikipedia ALARP 2020). The classes of probability are

Not acceptable risk Tolerable risk Acceptable risk Consequence severity classes

2.4 Notion of Risk

23

generally described in qualitative terms (e.g. certain, likely, possible, unlikely) as well as the consequence severity classes (e.g. negligible, marginal, critical, catastrophic). Such matrices are very useful to discuss between safety engineers and the wording can change according to the industrial domains and users. This is more flexible to take decisions than using the simple curves splitting the space between an acceptable and a non-acceptable zone and figures can be associated to characterize the classes in more rigorous ways. The use of such matrices is illustrated in Chap. 36 in order to handle the risk linked to instrumented safety system failures. This also implies that there are two ways to decrease the risk from the nonacceptable to the acceptable zone: decreasing the probability of occurrence (1), mitigating the consequences (2) or both (3), (see Fig. 2.3). Safety-oriented versus Dependability-oriented Risks No assumption is made above about the nature of the consequences and therefore the combination of chance and consequence provides a very broad definition encompassing all kinds of risks. Consequences are specific to each industry sector but nevertheless can be split according to their impact: • safety impact: safety-oriented risks. E.g. safety properly speaking (including functional safety) and environment issues; • economic impact: dependability-oriented risks. E.g. production availability, profitability and assets issues. Then, the risk issues related to a given system are not reducible to a single indicator and this is why, as already mentioned, the acronym RAMS (reliability, availability, maintainability and safety) is often used when various risks have to be considered at the same time. This is always the case in industry where safety and dependability are associated to operate the systems safely and with the maximum of benefits. Unfortunately, safety and dependability objectives are antagonistic most of the time: improving one is likely to be detrimental for the other and vice versa. This universal problem should not be forgotten by designers specially when dealing with safety systems for which the safety action can be either inhibited or untimely triggered by component failures. The first case is dangerous because the safety action is not achieved when needed and an accident can occur. The second case is safe as, the protected system being shut down, any danger is supposed to disappear. However, the real life is more complicated because a safe failure is safe only with regards to a given situation but can be dangerous with regards to another one. For example, the spurious closure of an emergency shutdown valve, devoted to protect a plant against overpressure, can produce a water hammer detrimental to the installation piping. It can also lead to an increasing pressure upstream the valve inducing a new risk when reopening the valve. On the other hand, too much spurious actions are likely to increase the probability of human failure and the safety system can

24

2 Background

even be, purely and simply, taken out of service if the frequency of spurious actions leads to inacceptable production losses. Indeed, in this last case, the safe failure is transformed into a very dangerous one! This is why achieving safety to the detriment of dependability should be avoided. For example, it is wise to be cautious about an indiscriminate use of the SFF (safe failure fraction) encouraged by the functional safety standard IEC 61508 (2010) which assumes that increasing the probability of safe failure is always beneficial for safety (see Chap. 36). At the limit, considering only safety without taking dependability into account leads to a super safe installation because it is even no longer able … to start and considering only dependability without taking safety into account leads to a super profitable plant … between accidents. Therefore, between these two extremes, tradeoffs are needed and the designers should be vigilant about obtaining a good balance between safe and dangerous failures. This is why, for example, the implementation of 2 out of 3 majority vote logic is often adopted: two dangerous failures are needed to inhibit the safety action but also two safe failures are needed to untimely trigger it. This is a good compromise improving safety and dependability at the same time. Consequently, safety and dependability issues should be considered together at the same time (see for example Ciliberti et al. 2019) and by the same analysts but, unfortunately, this is seldom observed in industry. Nevertheless, the systemic probabilistic approaches described in this book are useful to design both safe and dependable systems. Safety versus Dependability from a Probability Point of View This book being devoted to reliability modelling and calculations, it is useful to analyse the difference between the safety and dependability aspects which have to be considered during the studies. These differences are twofold: • Probability and consequence point of view: – Safety studies generally deal with events with low probability of occurrence (rare events which, hopefully, are not really expected to be observed during the life of the installation) but with heavy consequences in terms of casualties, environment or asset impacts. – Dependability studies generally deal with frequent events (expected to be seen several times in the life of the installation) with small consequences each time the events occur. • End users of the study point of view: – Safety studies are generally achieved for safety authorities in order to be allowed to operate a given installation. Therefore, they aim to provide conservative estimations in order to prove that the risk is lower than a threshold set by the safety authority. – Dependability studies are generally achieved for internal use as a decision-aid to take balanced decisions. Therefore, they aim to provide best estimations as close as possible to the actual risk level.

2.4 Notion of Risk

25

Therefore, the two types of analysis are very different from the probabilistic point of view. For safety, the probabilities being low, the usual analytic approximations work well and simplifying assumptions can be used, provided that they are proven to be conservative and that the target risk level is reached. For dependability studies, this is completely the opposite, the events are frequent, the usual analytic approximations do not work and the assumptions have to be as close as possible to the real life. The result is that dependability studies imply the use of more detailed models than safety studies and that analytical calculations are generally not manageable and must be replaced by Monte Carlo simulations. Nevertheless, for safety studies with complex interrelationships, it may also be necessary to use the Monte Carlo simulation. This is more and more easy thanks to the increasing calculation power of present time computers (personal computers or main frame computers). As a side effect, the increasing calculation power of computers leads to increasingly demanding requirements from the project leaders ordering the studies. This is why increasingly powerful models and tools have been developed and are now available. Unfortunately, data collection is a little bit late to feed them. Safety & Dependability Related Constituent parts As mentioned above, the traditional constituent parts of safety and dependability studies are reliability, availability, maintainability and safety (RAMS, see Chap. 4) but numerous other topics are now commonly considered like production availability, integrated logistic support (ILS), reliability centered maintenance (RCM), reliability based inspection (RBI), life cycle cost (LCC), security (computer abuse, hacking), legal risk (penalties due to regulation violation), fault tolerance (redundancy), confidentiality, … Indeed, safety and dependability studies are also closely related to risk management, asset management and quality. In fact, all the approaches mentioned above pursue a common goal: mastering the risks. They are interlinked and when addressing one of them, and provided that the analysis is accurate enough, all the others are likely to be concerned to some extent at some point.

References Aupied JR, Procaccia H (1984) SRDF: a system for collecting reliability data from French PWR power plants. Method of failure analysis. Application to the processing of valves data. Nuclear Eng Des 81(1):127–137, Elsevier Batteux M, Prosvirnova T, Rauzy A (2019) AltaRica 3.0 in 10 modeling patterns. Int J Crit Comput Based Syst. Inderscience Publishers. 9:1–2. pp 133–165. 2019. https://doi.org/10.1504/ijccbs. 2019.098809 Bazovsky I (2004) Reliability theory and practice, Dover Publications Inc Bell J, Holroyd J (2009) Review of human reliability assessment methods. RR679. HSE. Buxton, UK

26

2 Background

Bryant R (1986) Graph based algorithms for Boolean functions manipulation. IEEE Trans Comput 35(8):677–691. IEEE, USA Canvey Island (1978) An investigation of Potential Hazards from Operations in the Canvey Island/ Thurrock Area HMSO, London CCPS (2001) Layer of protection analysis—simplified process risk assessment. American Institute of Chemical Engineers, Center for Chemical Process Safety, New York, USA, 2001 CCPS (2020) Process Equipment Reliability Database (PERD): https://www.aiche.org/ccps/resour ces/process-equipment-reliability-database-perd. Accessed 18 Apr 2020 Ciliberti V, Ostebo R, Selvik J, Alhanati F (2019) Otimize safety and profitability by use of the ISO 14224 standard and big data analytics. OTC-19634-MS. Houston, USA Guttmann H, Swain A (1983) Handbook of human reliability analysis with emphasis on nuclear power plant application, NUREG/CR-1278. USNRC, Washington HSE (2020) ALARP at a glance. https://www.hse.gov.uk/risk/theory/alarpglance.htm. Accessed September 2020 IEC 61025 Ed. 3 (in progress) Fault tree analysis (FTA). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61165 Ed. 2 (2006) Application of Markov techniques, International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical / electronic / programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61511 Ed. 2.0 (2016) Functional safety. Safety instrumented systems for the process safety sector (3 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61882 Ed.2 (2016) Hazard and operability studies (HAZOP studies)—application guide. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 62502 Ed. 1.0 (2010) Analysis techniques for dependability. Event tree analysis (ETA). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-3-12 (2011) Dependability management: application guide—integrated logistic support. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 62551 Ed. 1.0 (2012) Analysis techniques for dependability. Petri net techniques. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-3-11 (2017) Dependability management: application guide—reliability centred maintenance. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60300-3-2 Ed. 2.0 (2004) Dependability management, Part 3-2: Application guide—collection of dependability data from the field. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-3-3 Ed. 3.0 (2017) Dependability management, Part 3-3: Application guide, Life Cycle Costing, International Electrotechnical Commission, Geneva, Switzerland IEC 60812 Ed. 3.0 (2019) Failure modes and effects analysis (FMEA and FMECA), International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 63142 (in progress) A global methodology for reliability data prediction of electronic components. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 31000 Ed. 2.0 (2018) Risk management. Guidelines. International organization for standardization (ISO), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland ISO 15663 Ed.1.0 (2021) Petroleum, petroctechnical and natural gas industies-Life cycle costing. Organization for Standardization and International Electrotechnical Commission. Geneva, Switzerland ISO 20815 Ed. 2.0 (2018) Petroleum, petrochemical and natural gas industries. Production assurance and reliability management. International organization for standardization (ISO), Geneva, Switzerland

References

27

ISO 6527 Ed. 1.0 (1982) Nuclear power plants. Reliability data exchange. General guidelines. International organization for standardization (ISO), Geneva, Switzerland ISO 7385 Ed. 1.0 (1983) Nuclear power plants. Guidelines to ensure quality of collected data on reliability. International organization for standardization (ISO), Geneva, Switzerland ISO guide 73 Ed. 1.0 (2009) Risk management—vocabulary. International Organization for Standardization (ISO). Geneva ISO/IEC 31010 (2019) Risk management–risk assessment techniques. International Organization for Standardization and International Electrotechnical Commission. Geneva, Switzerland ISO/IEC Guide 51 Ed. 3.0 (2014) Safety aspects. Guidelines for their inclusion in standards. International organization for standardization (ISO) and International Electrotechnical Commission (IEC), Geneva, Switzerland ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland Kececioglu D revised edition (2002)., Reliability engineering handbook, DEStech Publications Inc, Lancaster Lievens C (1976) Sécurité des systèmes. Cepadues-Editions, Toulouse, France MIL-HDBK 217 F notice 2 (1995) Military handbook: reliability prediction of electronic equipment, Department of Defense, Washington DC, USA MIL-STD-882E (2012) Standard practice: system Safety, US Department of Defense, Washington, USA MSG-1 (1968) Maintenance evaluation and program development, air transport association steering group and US federal aviation administration, USA Nielsen D S (1971) The Cause-Consequence Diagram Method as a Basis for Quantitative Accident Analysis. RISO-M-1374. AEK Riso, Roskilde, Denmark. Roskilde. Denmark OREDA (2020): https://www.oreda.com/. Accessed September 2020 Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering, Springer Quanterion 217Plus (2015) Handbook of 217Plus. Reliability prediction models, Quanterion Solutions Inc. Utica NY. USA Rasmussen C (1975) Reactor safety study. An assessment of accidents risks in U.S. commercial power plants; WASH 1400 (NUREG 75/014), U.S. Nuclear Regulatory Commission, Washington. USA Rogovin M, Frampton G F (1979) Three mile Island: a report to the commissioners and to the public Vol 1 to 3. NUREG /CR-1250. USNRC. USA Roques P (2013) Modélisation des systèmes complexes avec SysML. Eyrolles, France SAE Aerospace standard (2012) Architecture analysis and design language (AADL). AS 5506. http://www.sae.org Schowb M, Peyrache G (1969) Traité de fiabilité. Masson & Cie editeurs, Paris SEVESO III (2012) Directive 2012/18/EU of the European parliament and of the council of 4 July 2012 on the control of major-accident hazards involving dangerous substances, amending and subsequently repealing Council Directive 96/82/EC SysML (2020) Open Source Project. https://sysml.org/. Accessed September 2020 Torres-Etcheverria A (2014) On the use of LOPA and Risk Graph for SIL determination. TEES 17th annual international symposium. College Station, Texas, USA UML (2020) https://www.uml.org/. Accessed September 2020 UTE C80-811A (2011) Reliability methodology for electronic systems, FIDES guide, Issue A, AFNOR éditions. France Villemeur A (1988) Sûreté de fonctionnement des systèmes industriels. Collection de la Direction des Etudes et Recherche d’Electricité de France, Eyrolles Villemeur A (1992) Reliability, availability, maintainability and safety assessment. Wiley, England Watson HA (1961) Launch Control Safety Study. Section VII Vol 1. Bell Laboratories, Murray Hill, New Jersey. USA

28

2 Background

Wihipedia Bowtie (2020: https://www.cgerisk.com/knowledgebase/The_bowtie_method. Accessed September 2020 Wikipedia AADL (2020) https://en.wikipedia.org/wiki/Architecture_Analysis_&_Design_Lan guage. Accessed September 2020 Wikipedia UML (2020): https://fr.wikipedia.org/wiki/UML_(informatique). Accessed September 2020 Wikipedia ALARP (2020): https://en.wikipedia.org/wiki/ALARP. Accessed September 2020 Wikipedia Bellman (2020) https://en.wikipedia.org/wiki/Richard_E._Bellman. Accessed September 2020 Wikipedia Bhopal (2020): https://fr.wikipedia.org/wiki/Catastrophe_de_Bhopal. Accessed September 2020 Wikipedia Gödel (2020) https://fr.wikipedia.org/wiki/Kurt_Gödel. Accessed September 2020 Wikipedia Mandelbrot (2020): https://fr.wikipedia.org/wiki/Benoit_Mandelbrot. Accessed September 2020 Wikipedia SysML (2020) https://fr.wikipedia.org/wiki/Systems_Modeling_Language. Accessed September 2020 William JC (1985) HEART—a proposed method for achieving high reliability in process operation by means of human factors engineering technology. In: Proceedings of a Symposium on the Achievement of Reliability in Operating Plant, Safety and Reliability Society (SaRS). NEC, Birmingham

Chapter 3

Reliability Study Overview

3.1 Overview Figure 3.1 illustrates the steps of a reliability study: after having defined the boundaries and the goal of the system, object of the study, it is essential to understand how the system works (functional analysis) before analysing how it fails (dysfunctional analysis). A satisfactory result is generally not obtained at once and some iterations are needed. At each iteration (illustrated by the dotted arrow), the design is improved and/or the goal is modified. When the result is considered to be correct, then conclusions can be drawn about various topics including the following: • System operation: – Specification of some components. – Maintenance policy (test frequency, spare parts, special tools,…). • System failure: – – – – –

Possible test plans for some components. Main contributors (weak points), sensitivity studies and importance factors. Reliability, availability, maintainability estimations. Spurious failure frequency (for safety systems). Production availability for production systems.

It has to be noted that a reliability study is not an ordinary study because what has to be analysed is not completely known since the beginning and has to be identified when the study is going on and specially at the risk identification stage: this implies to consolidate the scope of work according to the progress of the study and each time a new problem is identified. When the study is performed by a contractor, it should be clearly specified by the client that this consolidation of the scope of work is a part of the analysis and not an opportunity for a costly change order.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_3

29

30

3 Reliability Study Overview Goal & system definitions

Dysfunctional analysis

Functional analysis Issue identification System modelling Reliability & operational data selection

- Synthesis - Decisions

Discussions with field specialists

Qualitative analysis Quantitative analysis

Conclusions

Fig. 3.1 Steps of a reliability study

3.2 Goal and System Definition The first stage of the analysis is fundamental: defining the system to be analysed as well as its purpose. This seems trivial but, in fact, is not as easy as that because numerous mistakes, omissions and misinterpretations can occur which are likely to have important impacts on the conduct, relevance and accuracy of the reliability study. Therefore, it is of utmost importance to be sure to identify the right system to be analysed and this implies to gather information/documents like: • Targets of the project: this depends on the type of system (e.g. safety or production system), the design stage (e.g. preliminary or basic engineering, in construction, in operation,…) or the wanted level of detail. • Technical definition of the system: definition of the boundaries (battery limits) of the system and various diagrams like process flow diagrams, PFDs, or process and instrumentation diagrams, P&IDs, safety flow charts, describing the system in a schematic way. • Procedures: installation, start-up and sometimes dismantling stages. • Philosophy of operation: document describing how the system will be operated. • Philosophy of maintenance: works organization (e.g. work order management) for curative and preventive maintenance operations, spare part management, test intervals for safety systems, etc. • Intervention means: on-site and off-site emergency response capabilities (firefighters, rescuers,…) in case of accident. • Environment: location close to or far from a settlement or a sensitive area (river, sea, protected from an ecologic point of view,…). It has to be remarked that the systems are increasingly interlinked and increasingly using computer codes (e.g. command-control, automata, tele-maintenance, tele-operations). This makes the boundaries not necessarily easy to determine. It is

3.2 Goal and System Definition

31

even more difficult with open systems (Web, electric grid,…) without actual bounds or for which the bounds change at any time. From an environment point of view, it is also obvious that if the system is operated in a desert or on an unmanned installation, the problems are minimised compared to a system operated within a town (which has expanded until surrounding the plant). It is also of great importance to determine which are the types of probabilistic parameters of interest. For subcontracted studies, it is even more important due to the polysemy of the terms: a reliability (“reliability” used as an umbrella generic term) study is not necessarily devoted to calculating the system reliability (“reliability” used as a specific probabilistic term). For example, a “reliability” study of a safety system may be devoted to estimating itsaverage unavailability (PFDavg) or its average failure frequency (PFH). In the same way and even if this is obvious, the availability of a production system will not necessarily be interpreted as being the production availability by the contractor. Therefore, to avoid nasty surprises and costly “change orders”, it is very important to specify exactly, in the scope of work, which probabilistic parameters have to be calculated during the study. Of course, depending on the level of detail of the study, all or only part of the above information is needed.

3.3 How It Works (Functional Analysis) The second stage is to understand how the system works. This consists in identifying the various functions comprised within the system and analysing the relationship between these various functions. For simple systems, this can be done simply by establishing the list of the different functions whereas, for more complex systems, a more elaborated modelling technique can be implemented like the functional analysis. This is a graphical representation based on functional block diagrams1 allowing to represent which subsystems are necessary for the good functioning of another subsystem (including the whole system itself). A better name should have been functioning analysis instead of functional analysis because of the possible confusion with the formal techniques also named functional engineering analyses which are used to design a system rather than describing how it works (e.g. structured analysis (Wikipedia SA 2020a), structured systems analysis and design (Wikipedia SADT 2020b) or RELIASEP (Vogin 1988; Pitto 1996)). These formal functional analyses are used to think in term of functions to be performed rather than in term of actual physical components performing these functions. This allows to postpone as far as possible the choice of actual physical components and this is useful to avoid points of non-return in the design due too early choices: it is said that, without such precautions, 90% of the future design is stuck after only 5% of the time devoted to the design stage. 1 Functional

block diagrams are similar to the reliability block diagrams described in Chapter 15 but the blocks represent functions instead of components.

32

3 Reliability Study Overview

Of course, the application of formal functional analysis has an important beneficial impact on system reliability. Nevertheless, they are outside the scope of this book where only the functional (functioning) analysis is presented.

3.4 How It Fails (Dysfunctional Analysis) 3.4.1 Point About Terminology After having understood how the system works (and only after), it is time to understand how it fails. Several steps are needed for that and, in French, they are gathered under the umbrella term of “analyse dysfonctionnelle” i.e. “dysfunctional analysis”. This is logic as the Greek prefix “dys” means “bad/abnormal” and, then, dys-functional is the opposite of functional. Unfortunately, in English this term is connoted with family problems and seldom used for physical systems even if it is more and more encountered for this purpose. The terms misfunctional or malfunctional are also sometimes used instead. It is not the purpose of this book to decide about the issue and, following the French from which this word comes in English, the term “dysfunctional analysis” will be used to avoid heavy circumlocution when needed to talk about the analysis of how a system can fail.

3.4.2 Issue Identification The first step of the dysfunctional analysis is to identify the potential issues (risks) related to the system under study and relevant with regards to the objectives of the study. This is of utmost importance as, in case of mistakes at this step, all further analyses would be irrelevant. The approaches described in Part 2 are very helpful for that purpose. They are inductive in nature (cause ⇒ effects, bottom-up). The general philosophy is to consider potential causes at subsystem/component level and look at the effect of these causes at the overall system level. The difference between these approaches comes from the type of cause and effects which are considered and also from the way they are conducted (alone or in group). Preliminary hazard analysis (PHA), hazard and operability studies (HAZOP), failure modes, effects and criticality analysis (FMECA) are widely used for this purpose. These approaches are mainly qualitative in nature but semi-quantitative estimations can be done when using FMECA (the criticality is given by a combination of the probability of the cause and the severity of the effect). When sequential aspects have to be considered, the cause consequence analysis (see Chap. 26) is useful to complete the previous approaches and to describe both the functional and dysfunctional aspects of the system under study.

3.4 How It Fails (Dysfunctional Analysis)

33

When all potential causes are analysed in a systematic way, the identified effects range from no-effect to incident and accident. Therefore, the effects can be sorted by degree of severity. It is wise to consolidate these results by a close collaboration with specialists of the system under study in order to identify the main issues and to prioritise what can be discarded, what has to be analysed and what has to be analysed first. Then further analyses can focus only on major issues and no time is wasted on second order problems. As previously mentioned, this is at this stage that the scope of work has to be reviewed to specify more accurately the issues which have to be analysed within the reliability study. When the study is performed by a contractor, this should be considered as an integral part of this reliability study and should not be an opportunity for a change order. Beyond the identification of the issues, the systematic application of the above techniques allows to: • • • •

understand in detail the usefulness of any components; understand the impact of component failures on the overall system; identify the detection means of the failures and the related safety actions; identify the possible improvement of the system under study with regards to the objectives.

The implementation of the inductive approaches does not imply complicated theory but only the use of common sense and good will. Under this simplistic aspect, they are in fact of a formidable efficiency and, when properly undertaken, they provide the analyst with a very deep knowledge of the functioning/dysfunctioning of the system under study. This knowledge can be often deeper than the knowledge of the operators operating similar systems for years on actual plants. Their use is essential when performing reliability analyses.

3.4.3 System Modelling 3.4.3.1

Generalities

When the main issues have been identified and prioritised, the next stage is normally to proceed to a kind of system modelling and the main models are represented in Fig. 3.2. For simple systems with little or no redundancy, this has already been done when identifying the issues (see 3.4.2) because PHA, HAZOP, FMECA are in fact the simplest reliability models allowing to analyse single (i.e. simple) events (e.g. causes, failures,…). Further studies may not be useful and the above elementary approaches are necessary and sufficient for achieving the wanted reliability study. This is often the case in industry where they are well known by engineers and are widely used. However, when high risks are involved, the systems are generally redundant and no single cause would, normally, lead to a complete failure (e.g. an accident) of the

34

3 Reliability Study Overview Overview of reliability models

Elementary models

Advanced models

Analytical calculations

PHA

Boolean models

HAZOP

FTA

FME(C)A

RBD

Etc.

Sequential models

Monte Carlo simulation

Taylor expansion RBD driven Markov process

Markovian model Behavioural models PN

FT driven Markov process RBD driven Petri nets

Formal models

Qualitative Semi-quant. Quantitative

Fig. 3.2 Overview of reliability (safety and dependability) models

whole system. Therefore, if the above simple approaches are still necessary, they are no longer sufficient and have to be completed by more advanced approaches able to handle the combinations of several events. This leads to the use of the advanced models presented in Fig. 3.2 where they are classified according to the way probabilistic calculations are performed. Beyond this classification, they can also be classified between static and dynamic models: • static models: reliability block diagrams (Chap. 15), fault trees (Chap. 16); • dynamic models: – sequential models: e.g. cause consequence diagrams, event trees, LOPA (Chap. 26); – Markov models (Chap. 31); – behavioural models: Petri nets (Chap. 33), specific formal models (not developed in this book). As shown in Fig. 3.2, a whole corpus of methods and tools is available nowadays to deal with safety as well as with dependability aspects. Software packages are also available to help the analyst implement this corpus of methods and tools. This has to be done cautiously and the users should have enough knowledge about the pros and the cons in order to use each of them in a relevant way.

3.4 How It Fails (Dysfunctional Analysis)

3.4.3.2

35

Static Models

A system is called “static” when the relationships linking its state to the states of its components do not depend on time. When, in addition, the states can be gathered into two classes (operating/failed, up/down, true/false), the Boolean models can be used to model the logic linking the system state to the components states. These models constitute an important family encompassing the reliability block diagrams (RBD), the fault tree analysis (FTA) and can be extended to the usual sequential models linked to the use of fault trees: e.g. cause-consequence diagrams, event tree analysis (ETA). Therefore, the sequential models also belong to static models. All these approaches have graphical representations which make them easy to handle (by the analyst who builds the models) and to understand (when they are shared for discussion with other engineers). Among them the FTA plays a particular role as this is the unique approach which is based on a deductive (effect ⇒ cause, top down) reasoning: it starts from unwanted effects and aims to identify the causes of these effects. This unique feature gives an invaluable analysis power to this approach which is widely used by reliability engineers to analyse failure combinations.

3.4.3.3

Dynamic Models

A system is called “dynamic” when it is not “static”! More precisely, a dynamic system jumps from state to state after random delays related to failure, repairs or any other event and this can be modelled by using stochastic processes. Markov processes Among the stochastic processes, the Markov process is the most famous one. It has been used since the beginning of the twentieth century in plenty of domains ranging from fundamental physics to contemporary music (stochastic music) and including reliability analysis. It belongs to the state-transition models and can be graphically represented by a Markov graph where the system states and the transitions from state to state are represented. This is an analytical approach relatively easy to use but, unfortunately, it has two main limitations: only constant failure rates (exponential distributions) can be used and the size of the model grows exponentially with the number of states (i.e. in the order of magnitude of 2n states for n components). Nevertheless, it is very useful for modelling small complex systems and also to explain the calculation of the various probabilistic parameters like reliability, availability, failure frequency, failure rate,… One of the main interests of the Markov approach is that it can be combined with FTs (FT-driven Markov processes) or RBDs (RBD-driven Markov processes) where FT or RBDs model the logic of the system under study and small Markov processes the individual components. This implies that the components behave independently from each other but when this assumption holds, then this allows to build tractable models equivalent to very large Markov processes.

36

3 Reliability Study Overview

Other stochastic processes Markov processes cannot be used for several reasons: • too many states to be practically handled – risk of error when identifying the states, impossibility to identify all the states; – computational problems: approximation not realistic, numerical instabilities, model too big to be stored,… • non exponential distributions. The first case occurs for large systems when FT/RBD-driven Markov processes are not usable due to a lack of independency between the components. However, it has to be noted that, for the Markovian approach, the notion of “large” is very relative as a system with 10 components has potentially 210 = 1024 states and a system with 30 components has potentially 230 = 1.073 billons states. Therefore a 10 component system is already difficult to be managed by hand and it is illusory to think to model 30 component systems by using the Markovian approach. What to do to deal with an actual industrial system with 300 components and more than 290 states? It is obviously needed to look at new models overcoming these limitations. Fortunately, they are provided by the behavioural models which also belong to the state-transition models. Among them, the Petri nets are the most popular but some specific formal languages are also available for doing that (e.g. AltaRica language (Aupetit 2020)). The size of the models being linear with regards to the number of components, they lead to much more compact models compared to Markov graph. This is why they are sometimes used as Markov graph generators to produce large Markov graphs with a huge number of states and error free. But it is when they are used as support for Monte Carlo simulation (Chap. 32) that they prove all their modelling and calculation powerfulness. The Monte Carlo simulation (by reference with gambling), is based on the generation of random numbers used to calculate random delays participating to particular realizations (trajectories) of the modelled stochastic processes. When plenty of such trajectories have been simulated, then the probabilities of interest are estimated by simple statistical calculations. The secret of the powerfulness of Monte Carlo simulation is that, during the simulation, the number of potential states does not matter as only the scenarios with the highest probabilities are actually observed. The modelling capabilities are virtually endless and the spectrum of the results is practically limited only by the analyst imagination (production availability, maintenance costs, spare parts needs,…). This is paid by the main drawback of the Monte Carlo simulation: the computer time increases quickly when the probabilities to be estimated decrease. This is why this approach has been discarded for a long time until the calculation power of computers became sufficient to obtain accurate enough results. Nowadays it has already begun to supersede the analytical techniques.

3.4 How It Fails (Dysfunctional Analysis)

3.4.3.4

37

Extension to Multistate and Multiphase Systems

Multistate systems Safety systems are generally considered as typical systems with two classes of states: ready to trigger the safety action/safety action inhibited. Therefore, the Boolean models are particularly adapted to deal with them and are widely used for this purpose. Nevertheless, if the untimely triggering of safety actions is considered, a third class of states is introduced and the Boolean approaches are no longer able to be used, so it is necessary to switch to the dynamic models. Having more than two classes of states is also the case for production systems where, between the perfect state (production rate of 100%) and the completely failed state (production rate of 0%), numerous degraded states can exist with production rates comprised between 0% and 100%. These systems are called multistate systems to indicate that more than two classes of states are considered. Taking into account all the states (perfect, degraded, completely failed) leads to extend the classical concept of availability to a more general concept sometimes named effectiveness or efficiency. For production systems, the terms production availability or production assurance are often used: this is the mathematical expectancy of the system production taking into account all the random events (failures, restorations,…) which can occur over the period of observation. Beyond the traditional reliability parameters used when dealing with conventional systems, this allows to make the link with economical parameters related to dependability aspects. Multiphase systems The systems presenting several operating phases constitute another modelling issue. This is typically the case when dealing with the dangerous undetected failures of the safety systems. In this case, the failures remain hidden until periodic tests are performed. Then, these failures are repairable when they are revealed by a periodic test but not repairable within the test interval. Same problems arise for systems which are repairable during one period of time and not repairable during another period (e.g. a subsea platform in a rough sea environment which is repairable in spring, summer and autumn but not in winter) or for systems whose production requirements change along the time. More generally, this happens when the operating configuration is not permanent but changes from time to time. These systems are called multiphase systems as sequences of several operating phases are considered rather than only a single one. The Boolean approaches are not able to model multiphase systems properly alone and the Markovian approach is very well adapted but only for very small systems, e.g. components. Therefore, as already mentioned above, when the components are relatively independent, they can be combined together. This is the case of safety systems for which the use of FT-driven Markov processes has proven to be very effective to model the dangerous undetected failures (see Chap. 36). This allows to perform the calculations on large safety systems and in an analytical way.

38

3 Reliability Study Overview

In more complex situations where the components are not independent or what occurs in one phase depends on what has occurred in the previous ones, it is necessary to switch to the behavioural models and to the Monte Carlo simulation.

3.4.4 Reliability and Operational Data Selection None of the approaches described above, even the simplest one, can be undertaken without a minimum of information about reliability and operational data. This can be done on the form of categories based on the judgment of the experts performing the analyses: for example, certain, likely, possible, unlikely and rare to characterize the event occurrence and catastrophic, critical, marginal and negligible for the severity of the consequences of the occurring events. Such rough approaches are widely used to assess the criticality of the events analysed in FMECA and to build risk matrices to verify that an acceptable level has been reached. Of course, when accurate probabilistic figures are wanted, the expert judgment finds its limits and more objective data have to be used. They are normally obtained through the field feedback. This consists in organizing data collection on similar systems already in operation and using a statistical approach (see Chap. 38) to estimate reliability parameters like failure or repair rates. The more accurate the data collection, the more accurate the statistical estimations and the more accurate the calculations performed with the reliability models described above. The accuracy of the estimations depends on the size of the collected samples and often a single operator may not be able to collect enough data. This is why it is wise to proceed it in a standardized way (e.g. ISO 14224 (2016)) to be able to exchange data with other operators operating similar systems. Even if implementing the most sophisticated model with the most powerful computer, it is naive to expect accurate probabilistic results without using relevant reliability data.Unfortunately, this is sometimes forgotten… Therefore, when undertaking a reliability study, a cautious survey of the reliability data available in the industry sector has to be done (general data base, specific data bases, in-house data bases) in order to be sure to be able to feed properly the model which has been chosen. When data are missing, data bases from other sectors can be used with some adaptations based on expert judgment or a specific study can be launched to collect the field feedback available in-house and to estimate the missing data. As the sophistication of the available models and tools increases, the project leaders ordering the reliability studies are more and more demanding and require calculations beyond the classical reliability results. This implies that, beyond the classical failure and repair rates, more and more data are needed to be closer and

3.4 How It Fails (Dysfunctional Analysis)

39

closer to the real world. This especially arises when behavioural models are used (e.g. Petri nets) for economical estimation purposes. In this case, information about operation and maintenance philosophies (mobilisation of maintenance tools, spare parts management, priority for repairs, test and preventive maintenance frequency,…) are often needed.

3.4.5 Qualitative Analysis The first outputs which can be expected from the simplest models are of qualitative nature. As described in 3.4.2, the elementary (inductive) approaches are very effective to produce qualitative results related to the impact of potential detrimental causes (e.g. component failure, overpressure,…) and to identify the issues needing a further analysis. They are also effective to verify that relevant detection means are implemented to detect a given detrimental cause. This is an important qualitative information because a detrimental cause is obviously less dangerous when it is detected than when it is not detected: escalation can occur when it is not detected whereas mitigation actions can be undertaken when it is detected. Therefore, identifying the non-detected detrimental causes is of utmost importance. This enables new detection means to be implemented or periodic tests to be performed when no means of detection are available. The inductive approaches are effective only to analyse single causes and a step forward is to use the fault tree analysis or the reliability block diagrams to identify the combinations of events at component level leading to an unwanted event at the system level. These combinations are called minimal cut sets (see Chap. 17): the order of a minimal cut set (i.e. the number of event) provides a rough qualitative estimation of its probability of occurrence: the more events in the combination and the less probable the minimal cut set. The sequential models which also provide combinations of events but organized in sequence are also useful from a qualitative analysis point of view: the shorter the sequence, the more probable it should be. This allows to sort the combinations by increasing order and to improve the systems by considering the lowest order first. The single failures (order 1) are generally the weak points to be improved in priority for a maximum of efficiency. The Markovian and the behavioural models are mainly dedicated to probabilistic calculations and seldom used for qualitative analysis.

3.4.6 Quantitative Analysis Semi-quantitative Analysis When reliability data are available, they can be used to increase the accuracy of the qualitative results obtained above. This is generally done just by calculating the

40

3 Reliability Study Overview

probabilities of the single events and multiplying them to obtain the probabilities of the combinations. This allows to obtain a more accurate and detailed sorting than the one obtained on a pure qualitative basis and then to consolidate the identification of the point to be improved in priority. The severity of the combination (minimal cut set or sequence) can also be considered in order to sort according to the two criteria of probability and severity i.e. risk. Purely Quantitative Analysis A step forward is to perform probabilistic calculations involving the system as a whole rather than considering the event combinations separately. All the advanced models (Boolean, Markovian and behavioural) allow to perform the following calculations (see Chap. 4) more or less easily and more or less accurately: • • • • •

unavailability and average unavailability; failure frequency (unconditional failure intensity) and average failure frequency; Vesely failure rate (conditional failure intensity); unreliability; failure rate.

Unreliability and failure rate are rather difficult to calculate by using the Boolean approaches. This involves the use of the Birnbaum importance factor (see Chaps. 22 and 24) and this can be done only with approximation. Fortunately, when the failures are quickly detected and repaired, the approximations are pretty good. Warning: a reliability block diagram is not really devoted to calculating the system reliability but rather to calculating its probability of failure or its unavailability. The FT or RBD-driven Markov processes inherit the same limitations as the Boolean models. The Markovian models alone and the behavioural models overcome these difficulties and allow, in addition, to perform calculations devoted to multistate systems: system efficiency with regard to a given capability (e.g. production availability). They also allow to perform the calculations of maintainability, MUT (mean up time), MDT (mean down time), MTBF (mean time between failures), etc. (see Chap. 4). The behavioural models are the most flexible models with regards to the possibilities of results which can be obtained and this is why they are more and more currently used. The possibilities are virtually endless but the price to pay for that is an increasing computation time which is less and less a problem with the present time computers.

3.5 Comparisons and Decision

41

3.5 Comparisons and Decision The qualitative and semi-quantitative analyses described above can be used to identify the top contributors to the issues identified at the first stage of the dysfunctional analysis. These can be weak points or combinations of events (critical paths). In addition, and beyond the probabilistic results regarding safety and/or dependability, the purely quantitative analyses can also provide information about: • the impact of a given event on the results (see importance factor, Chap. 24); • the sensitivity of the results with regards to the reliability parameter of a given component (or of a family of components). When all the qualitative, semi-quantitative and purely quantitative results are obtained, they can be used as a decision aid by the managers in charge of the project to verify if the regulatory, safety and economic objectives are achieved and, when they are not achieved, decide of the trade-offs necessary to improve the system in the most effective way.

3.6 Prevention and Risk Mitigation Figure 3.3 illustrates the various stages of a risk analysis for which various reliability studies can be undertaken in several places. This figure shows how a situation can escalate step by step from a potential cause to a major event (accident). It gives the example of a gas leak which can give a gas cloud which can be lighted and then can

Mitigating measures

Preventive measures Corrective measures

Passive protection

Means & ultimate procedures

Potential causes

Gas leakage (initiating event)

Dangerous drifts Hazardous situation

Gas cloud

Accidental event

Fire

Escalation factor Major event

Residual risk

Fig. 3.3 Overview of reliability (safety and dependability) models

Explosion

42

3 Reliability Study Overview

produce an explosion if relevant actions (i.e. mitigating measures) are not undertaken to avoid the escalation. The first way to avoid such major events is to implement preventive and corrective measures to eliminate the potential causes (e.g. replace screwed connections by welded connections, use of fireproof or explosion proof components), decrease their probability of occurrence (e.g. use more reliable components or better system design) or quickly detect the problems (e.g. install gas sensors in right locations) and trigger a safety action before the escalation has time to begin. When the preventive and corrective measures are not able to completely eliminate the causes or escalation, it is then possible to implement physical features limiting the severity of the consequences on operators, other systems and environment. For example, the avoidance of closed areas favouring explosions, the building of merlons or the establishment of a no-man’s-land can be used to limit the impact of explosions in the case of gas leaks. At each step, reliability studies can be undertaken to identify weak points and the most detrimental failure scenarios and to estimate their probabilities of occurrence. This allows to take decisions about the most effective trade-offs enabling to reduce the risk until an acceptable level is reached: improvement of the design (e.g. implement redundancy), use of better components (e.g. better quality, atmosphère explosive (ATEX) approved), implementation of effective detection and safety systems,… Even if relevant preventive, corrective and passive measures are taken, a major event can nevertheless occur. Then, it is necessary to ascertain the readiness of the intervention means (e.g. fire fighters) and ultimate procedures (e.g. evacuation of the hazardous area) needed to face such a situation.

References Aupetit B (2020) Calcul d’indicateurs de sûreté de fonctionnement de modèles AltaRica 3.0 par simulation stochastique. Doctoral thesis of the University Paris-Saclay prepared at Centrale Supélec. Paris, France ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland Pitto, JP (1996) RELIASEP – a Technique for Safe Design. In: Product Assurance Symposium and Software Product Assurance Workshop, Proceedings of the meetings held 19–21 March, 1996 at ESTEC, Noordwijk, the Netherlands, p 345. Edited by Michael Perry. EAS SP-377, European Space Agency Vogin R (1988) RELIASEP: une méthode d’analyse fonctionnelle de la fiabilité, revue de la sûreté de fonctionnement n 2. France, Paris Wikipedia SA (2020): https://en.wikipedia.org/wiki/Structured_analysis. Accessed September 2020 Wikipedia SADT (2020): https://en.wikipedia.org/wiki/Structured_analysis_and_design_tec hnique. September 2020

Chapter 4

Introduction of Basic Core Concepts

4.1 Preamble Until now in this book, the reliability-related terms (reliability, availability, failure rate, etc.) have been used in their general meanings which are commonly used by most of the people. However, when actually performing reliability studies, sound definitions are needed to avoid confusion and to make sure of a common understanding between the various stakeholders. Many of these terms have been currently used for a long time both in vernacular and technical languages and with various different points of view (common people, common engineer, reliability engineer, maintenance engineers, manager, academic professor, …). Therefore, little by little, semantic drifts took place which led to the relatively important polysemy observed nowadays. Therefore, numerous terms currently used have several—close but different—meanings. The purpose of this chapter is to provide definitions and explanations about the core concepts used in the book. For not adding confusion to the already existing confusion they are based, as far as possible, on the IEC 60050-192 (2015) (IEV192 in short) standard which, developed by the IEC TC56 technical committee (see Chap. 37), should be the basis for reliability-related definitions. As this committee is mainly focused on dependability related standards, these definitions will be completed/compared with definitions provided by other safety related standards: IEC 61508 (2010) or ISO/TR 12489 (2013). IEC definitions are also available on the IEC web site Electropedia: The World’s Online Electrotechnical Vocabulary (IEC Electropedia 2020).

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_4

43

44

4 Introduction of Basic Core Concepts

4.2 Item Definition IEV 192 provides the following generic term to define the subject of a study: Item: subject being considered (IEV 192-01-01). The meaning is clarified in the notes added to this definition: an item may be an individual part, component, device, functional unit, equipment, subsystem and system and each of them may consist of hardware, software, people or any combination thereof. This is in accordance with the definitions found in dictionaries (element, entity) and this is typically an umbrella term encompassing everything which can be considered as a whole in order to be analysed. This definition is self-contained and does not need further explanations.

4.3 States of an Item 4.3.1 Up and Down States An item evolves over time and the international electrotechnical vocabulary, the IEV 192 (2015), proposes a list of various periods of time to be considered. This is represented in Fig. 4.1 where each period of time is related to a given state with the same name. This figure shows a first dichotomy of the overall time, between up and down times. This implies that the item states are divided into two complementary classes of states—up and down state classes—gathering several other states (also represented in the figure). IEV 192 provides the following definitions to up and down states: Up state: state of being able to perform as required (IEV 192-02-01). Down state: state of being unable to perform as required, due to internal fault, or preventive maintenance (IEV 192-02-20).

MUT (192-08-09)

Up time (192-02-02) Operating time (192-02-05)

Down time (192-02-21)

MDT (192-08-10)

Non-operating time (192-02-05) Standby time (192-02-13)

Idle time (192-02-13)

Externally disabled time (192-02-13)

Fig. 4.1 Various times comprised within the up time

Preventive maintenance (192-07-05)

Time to restoration (192-07-06)

4.3 States of an Item

45

A difficulty appears as these complementary events (if the item is not in up state, it is in down state and vice versa) are not defined in a complementary form. The definition of down state seems to indicate that there is room for a third state where the item would be unable to perform as required, due to something different from an internal fault, or from preventive maintenance. In fact, there is no such third state and, from a terminology point of view, the second part of the definition—due to internal fault, or preventive maintenance—is just superfluous and should be moved into an explanatory note. Therefore, up and down states should be defined in the following complementary form: Up state: state of being able to perform as required (IEV 192-02-01). Down state: state of being unable to perform as required. These above definitions are going to be used throughout all the remaining parts of this book.

4.3.2 Operating and Non-operating States Figure 4.1 shows a second dichotomy of the overall time, between operating and nonoperating times. Again, this implies that the item states are divided into two complementary classes of states—operating and non-operating state classes—gathering several other states (also represented in the figure). It has to be noted that: • The dichotomy between up and down states is different from the dichotomy between operating and non-operating states and this is a little bit confusing as, in addition, a third dichotomy (not presented in the figure) between enabled and disabled states is proposed in IEV 192. • Due to the splitting of the preventive maintenance between up and down states, the up state comprises a part of the non-operating state where the item still remains “able to perform as required”. IEV 192 provides the following definitions to operating and non-operating states: • Operating state: state of performing as required (IEV 192-02-04). • Non-operating state: state of not performing any required function (IEV 19202-06). As shown in Fig. 4.1, operating and non-operating states are complementary states: the item is either operating or not operating. Therefore, their definitions must also be complementary to be consistent. This is not the case as “performing as required” is not exactly equivalent to “performing any required function”. If the definition of operating state was correct, then the definition of non-operating state would be “state of not performing as required”. This does not seem correct as when the item is idle, externally disabled or under maintenance, it actually performs as required.

46

4 Introduction of Basic Core Concepts

Therefore, it is the definition of non-operating state which is correct and leads to the complementary definition: Operating state: state of performing any required function. The above definition seems more in line with the intent of this concept and it clarifies the subtle difference between up and operational states: in the up state the item is only able to perform a required function whereas in the operating state it is actually performing this function. In the same way, in down state the item is not able to perform a function whereas in the non-operating state, it is just not performing this function. However, if this seems correct when the item is idle, externally disabled or under preventive maintenance, it is questionable when it is in standby state, as for example in the following cases: • a redundant item operated in hot standby state but which is ready to replace a similar item as soon as it fails; • a safety system operating in demand mode which spends most of the time in standby position but is ready to trigger a safety action as soon as a demand occurs. Therefore, and as this is done, for example, in the ISO 14224, it should be wise to consider the standby states as operating states. This is indicated by the arrow in Fig. 4.1. The above analysis illustrates the difficulty to design a corpus of definition taking all the particular cases into consideration.

4.3.3 Restoration States Figure 4.1 shows that the down time is split between the restoration of item failures and the part of the preventive maintenance where the item is not able to perform as required. Both of them gather several other times and Fig. 4.2 shows which times/states can be considered within the time/state to restoration. It gathers the dependability (IEV 192) and the functional safety (IEC 61508 or ISO/TR 12489) points of view. The proposal of IEV 192 is very comprehensive and easy to follow when the whole picture is presented. However, the differences between restoration, corrective maintenance and repair times are not readily understandable when considered independently from each other. The proposal of IEC 61508/ISO 12489 is less detailed but easier to understand and handle: the time to restoration is only split into four different times which can be gathered into two relevant times: • the fault detection time; • the overall repair time. This allows to highlight the main difference between production and safety systems with regards to the restoration time:

4.3 States of an Item

47 Maintenance time (192-07-02)

Preventive maintenance time (192-05-05)

MTTR (192-07-23)

MTTRes (ISO 12489)

Time to restoration (192-07-06 / IEC 61508-4, 3.6.21, ISO 12489) Corrective maintenance time (192-07-07)

Fault detection time (192-07-11)

Administrative delay (192-07-12) MAD (192-07-26) Deconsignment

MFDT (ISO 12489)

MORT

Active corrective maintenance time (192-07-10) Logistic delay (192-07-13)

Technical delay (192-07-15)

MLD (192-07-27) Preparation and/or delay (IEC 61508)

Repair time (192-07-19) Fault localisation (192-07-18)

Fault correction (192-07-14)

MACMT (192-07-22) MRT (192-07-21) Function checkout (192-07-16)

Effective (active) repair time MART (IEC 61508 - ISO 12489) (ISO 12489)

(Overall) repair time

(MRT, IEC 61508-4, 3.6.22)

Fig. 4.2 Various times comprised within the down time (excluding preventive maintenance)

• when a production system fails, the failure is generally revealed at once and the fault detection time is negligible; • when a safety system operated in demand mode fails due to an undetected dangerous failure, a long time can be needed before it is revealed and it is the overall repair time which is negligible. The functional safety decomposition is compatible with IEV 192 decomposition except for the part of the administrative delay devoted to the final authorisation to put the item back in service (de-consignment) which is considered separately in functional safety standards. It has to be noted that among all the times identified above, only the repair time (also called effective or active repair time) is intrinsic to the item itself. All the other times depend on the operation and maintenance philosophies implemented in the plants in which this item is used. Therefore, if the (active) repair time is similar from a plant to another, the variability of the other times can be important from an installation to another (e.g. between a subsea installation and an onshore installation). In IEV192, the preventive maintenance time is also decomposed in various other times. This decomposition is similar to this of the corrective maintenance time.

4.3.4 Degraded and Critical States Another way to classify the states of a system is to consider how far they are from perfect and down state classes.

48

4 Introduction of Basic Core Concepts Up state class Perfect

Degraded

Down state class

Up state class

Down state class

Critical down states

Critical up states Down states

Up states

Non critical up state

Fig. 4.3 Example of degraded and critical states

This is illustrated in Fig. 4.3 by a state-transition diagram. On the left-hand side of the figure, the up state class has been split between the perfect state class and the degraded state class: • Perfect state: state where no failures have occurred; • Degraded state: state where some failures have occurred but where the system is still able to perform as required when required. When dealing with safety systems, the degraded states are characterized by an increased probability of failure when a safety action is demanded. When dealing with production systems, the degraded states are generally characterized by lower than nominal production levels. In the same Fig. 4.3, the up state class has been also split between non-critical and critical state classes regarding the transition from up state class to down state class: • Non-critical up state: up state distant from the down state class by more than one transition (e.g. more than one failure); • Critical up state: up state distant from the down state class by only one transition (e.g. one failure). The concept of critical state is very important as it is the basis to define the system failure density, failure frequency, failure intensities or failure rate (see Chaps. 22 or 31). It has to be noted that the perfect state can, itself, be a critical state: this is the case for e.g. the up state of non-redundant components comprised within a system. This concept can be extended to non-critical and critical down states regarding the transition from down state class to up state class (e.g. with regards to repairs). This is illustrated on the right-hand side of Fig. 4.3: • Non-critical down state: down state distant from the up state class by more than one transition (e.g. more than one repair); • Critical down state: down state distant from the up state class by only one transition (e.g. one repair).

4.3 States of an Item

49

In a similar way as above, the critical down states are the key for repair frequency, repair intensities or repair rate (see Chap. 31).

4.4 Failure and Fault Concept 4.4.1 Failure Definition Figures 4.1 and 4.2 deal with the time to restoration and corrective maintenance time and that means that some failures can occur which have to be repaired. This term is going to be analysed now by starting with the definition given in IEV 192: Failure: loss of ability to perform as required (IEV 192-03-01). From a simple logical reasoning, it is clear that the item has to be able to perform as required before it can lose its ability to perform as required. This definition implies that a failure can occur only from a state of being able to perform as required, i.e. from the up state. It also implies that the result is a state of being unable to perform as required, i.e. the down state. Therefore, this definition leads to see the failure as a jump from the up state to the down state. This view is simple and widely used but, unfortunately, failures can occur from every state of the item: for example, a hidden failure can occur during the preventive maintenance or the restoration operations, failure which will be revealed only when the item will be used to perform a required function. Such failures are often due to human errors and in this case a failure is a jump from a state of the down state class to another state of the down state class. This implies that further failure of an item can occur even if the ability to perform as required is already lost. In addition, Fig. 4.1 shows also that an item can jump from up state to down state due to the starting of the preventive maintenance: this is a normal operation and obviously not a failure. Another problem occurs when dealing with the safe failures of safety systems: not only the ability to perform the safety action as required is not lost but it is even improved. In fact, this improvement with regards to safety is a failure with regards to spurious safety actions. This highlights that the nature of the event (failure or not failure) depends on the context and on the concerns of the analyst (safety, dependability or both) and this is likely to arise when antagonistic functions are implemented. To cope with the above difficulties, it could be useful to slightly change the definition to: Failure: loss of ability to perform a given function as required, not due to preventive maintenance.

50

4 Introduction of Basic Core Concepts

With this definition, a failure is not triggered by the beginning of normal preventive maintenance operations and the failure is not seen globally but required function by required function: if the item is failed with regards to a given function, it can fail also with regards to another required function.

4.4.2 Fault Definition The concept of fault is closely linked to this of failure and this is why they are analysed together. The definition of IEV 192 is the following: Fault: inability to perform as required, due to an internal state (IEV 192-04-01). This is not very clear but, fortunately, this is clarified by a note: Note 1 to entry: A fault of an item results from a failure, either of the item itself, or from a deficiency in an earlier stage of the life cycle, such as specification, design, manufacture or maintenance. From this note it is clear that a fault is a state resulting from a failure which is an event. Unfortunately, in IEV 192 the following definition is found: Software failure: failure that is a manifestation of a dormant software fault (IEV 192-03-22). Here, it is the failure which appears to be the result of a fault and this is in contradiction with the note above. This is also a constant source of confusion between reliability engineers who consider that a fault results from a failure and software engineers who consider the contrary. In fact, there is a chain which begins with a failure, e.g.: coding failure ⇒ dormant software fault ⇒ software failure. Therefore, it is the initial failure which is actually the cause of the software failure. Anyway, in the remaining part of this book, it is considered that a fault is a state resulting from a failure which is an event. This is illustrated in Fig. 4.4 for a failure occurring when the item is in the up state after a time equal to TTF (Time To Fail).

Fig. 4.4 Failure occurring in up state and corresponding faulty state

Failure (event)

Up

Fault (state)

Down TTF

T

4.4 Failure and Fault Concept

51

4.4.3 Failure and Fault Classification The failures/faults can be classified according to different criteria: • • • • •

hardware versus software and human failures; systematic versus random failures; revealed versus not revealed failures; time-dependent versus demand-dependent failures; dangerous versus safe failures. Failures are analysed in detail in the subsections detailed hereafter.

4.4.3.1

Hardware Versus Software and Human Failures

An industrial system is made of physical components (hardware) which become more and more smart (software) and which are operated by human beings. This provides three different sources of failures/errors: • hardware; • human beings; • software. At the present times, hardware, software and human failures/errors are handled by different specialists who, often, have difficulties to discuss together because their domains of expertise are far apart. The result is that the three types of failures are mainly analysed separately and that their interactions are not really accurately analysed. This is certainly a challenge for the future to develop an integrated approach allowing to handle the three types of failures within the same framework. The present book is focused on the methods and tools developed mainly to handle hardware failures. Some of them allow to handle software or human failures to some extent, but the very specific approaches related to the analysis of software and human failure are beyond the scope of this book.

4.4.3.2

Random Versus Systematic Failures

Only systematic failures are defined in IEV 192 and this leads to: Random failure: failure whose occurrence is aleatory in time. Systematic failure: failure that consistently occurs under particular conditions of handling, storage or use (IEV 192-03-10). A random failure is an ordinary failure which occurs in a random way and whose times to failure can be modelled by using a probability distribution. Therefore, it occurs with a given probability. It is generally linked to the natural physical degradation of the item and it cannot be removed. However, its probability can be reduced

52 Table 4.1 Examples of random failures

4 Introduction of Basic Core Concepts Random failures Hardware failures

Human failures

Early failures

Omissions/errors

Catalectic failures

Non-routine task

Wear out failures

Operation in hurry/delayed operation

Leaks

Communication problems

Mechanical jam

Repetitive spurious safety action

by using an item of better quality. This is this kind of failures which is modelled when quantitative calculations are performed. A systematic failure is a failure which occurs in a deterministic way as soon as particular conditions are gathered. Therefore, its probability of occurrence is the probability of occurrence of these particular conditions. If these particular conditions never occur, the systematic failure will not occur and this is why it may be observed after a long time because the operating conditions have changed. A systematic failure is generally linked to errors in specification, design or fabrication. Contrarily to a random failure, it can be removed but this implies to modify the item design or the item software. It is difficult to have reliability data about such failures which are generally analysed on a qualitative basis. Hardware and human beings are subject to both random failures and some examples are proposed in Table 4.1. In Table 4.1, the early failures, also called youth or infant failures, occur when an item begins to be used. They can be both random (with a high probability of occurrence) or systematic. Debugging is generally undertaken to identify and remove such failures as soon as possible by improving the design or the operating procedures. This can be done off or in line. The wear out failures are the failures due to the progressive degradation of the item when it is used. Preventive maintenance is generally undertaken to delay these failures as long as possible. The catalectic failures occur after the early life period and before the wear out period during the useful life of the item. The term catalectic failure is an important core concept. It has been used since the origin of reliability works but seems almost forgotten nowadays. It is not defined in IEV 192 but, fortunately, has been brought back to memory by ISO/TR 12489: Catalectic failure: sudden and complete failure (ISO/TR 12489). The distribution of such a failure is an exponential distribution with a constant failure rate (see 4.7.6.1). This is the basis for the Markovian approach (see Chap. 31) and without this property most of the analytical probabilistic calculations would not be tractable. However, they are not used only because they facilitate the calculations, they are also pretty well adapted to model the failures of electronic devices (for which they have been used first) and items within their useful life (see 4.7.6.3). Any analyst currently using constant failure rates should be aware that he deals with catalectic failures and that such failures occur without warning and is not possible to forecast just by examining the item. This is the contrary of failures which

4.4 Failure and Fault Concept Table 4.2 Examples of causes of systematic failures

53 Systematic failures Hardware failures

Software failures

Human failures

Specification

Specification

Lack of training

Design

Coding

Wrong mental picture

Installation

Implementation

Communication problem

Unexpected constraints

Inappropriate updates

Wrong procedures

Wrong circulating fluid

Inappropriate tests Ergonomic problems (HMI)

are the result of a progressive degradation (e.g. wear out) and can be prevented by inspection and/or tests. The counterpart of catalectic failures are the failures whose probability increases when the item is in use or just because time elapses: Wear out failure: failure due to cumulative deterioration caused by the stresses imposed in use (IEV 192-03-15). Ageing failure: failure whose probability of occurrence increases with the passage of calendar time due to cumulative deterioration (IEV 192-03-16). The wear out failures are related to items actually in use whereas ageing failures are related, for example, to items in storage. Hardware, software and human beings are subject to systematic failures and some examples are given in Table 4.2. Mistakes/errors made at early stages are an important source for early hardware or software failures. They can generally be detected quickly after the item enters in use: this is the debugging phase. Besides, these kinds of software failures are called “bugs”. Systematic failures occur also when the items are used in inappropriate ways or when inappropriate changes are performed. Systematic human failures are often due to misunderstanding about what is actually happening, inappropriate procedures or confusing human-machine interfaces (HMI).

4.4.3.3

Revealed Versus Not Revealed (Hidden) Failures

When a failure occurs, the corresponding fault remains hidden until it is detected. If the detection is done before it leads to the occurrence of an incident/accident, a restoration can be undertaken and the incident/accident avoided. If it is not detected, then the corresponding incident/accident can actually occur and this would, de facto, reveal that the failure has occurred.

54

4 Introduction of Basic Core Concepts

Therefore, strictly speaking, any failure is revealed at one moment or another but the term revealed failure is used for a failure detected before it actually leads to the occurrence of an incident/accident. This leads to another classification between the failures which are known and these which are unknown: Revealed failure: failure which has become evident to operations and maintenance personnel; Not revealed failure/hidden failure: failure which has not become evident to operations and maintenance personnel. The above definitions are slightly different from what is proposed in ISO/TR 12489 and the faults due to hidden failures are close to the latent and dormant faults introduced in IEV 192. Among the revealed failures, several classes can be identified: • Self-revealed failures: failures revealed as soon as they occur (e.g. a motor which stops); • Failures revealed by diagnostic tests: failures quickly detected by diagnostic tests performed at high frequency (e.g. short circuit of a sensor); • Failures revealed by periodic tests: failures which detection is delayed until a specific test is performed. In the functional safety standards, the first two classes are gathered into a single class named “detected failures”. When a failure (or the corresponding fault) is revealed, the corrective maintenance can be undertaken to prevent related incidents/accidents. Of course, this is more effective when the failure is quickly revealed and this is why the detection of failures is so important with regards to the prevention of incidents/accidents. This is also why it is so important to undertake the inductive approaches (FMECA, HAZOP, see Chap. 3) to identify the potential failures and their detection means. Nevertheless, the delay before starting this repair depends on the criticality of the revealed failure. When several revealed failures/faults have to be repaired at the same time, the priority is generally the following: 1. dangerous failures (e.g. related to safety systems); 2. failures resulting in the immediate loss of a required function (e.g. valve closure stopping the production flow); 3. failures resulting in the loss of a required function only when it is repaired (e.g. valve stuck open in a production pipe); 4. failures with no impact on the required functions (e.g. a burned-out light bulb). 4.4.3.4

Operating Versus Non-operating Failures

As said in Sect. 4.4.1, an item failure can occur from any state and therefore the failures can also be classified according to the state from which they occur. The general classification is then:

4.4 Failure and Fault Concept

55

Operating failure

Up

Non-operating failure

Op. Non-op. Down

T

T TTF

TTF

Fig. 4.5 Example of operating and non-operating failures

• Operating failure: failure occurring when the item is in operating state (i.e. during the operating time); • Non-operating failure: failure occurring when the item is in non-operating state (i.e. during the non-operating time). This is illustrated in Fig. 4.5 where the time to failure (TTF) is the sum of the time UT O spent in operating state and of the time UT nO spent in non-operating state. The item fails during UT O on the right-hand side but not on the left-hand side. Obviously, UT O does not play the same role and this will be analysed further in Sect. 4.6.3. This can be detailed by taking into account the various states belonging to the non-operating state class as, for example: • Standby failure: failure occurring when the item is in standby state (i.e. during the standby time); • Maintenance failure: failure occurring when the item is in maintenance state (i.e. during the maintenance time). The above analysis could be refined by introducing idle failures, corrective maintenance failure, repair failure, etc. It is the job of the analyst to identify which is the exact kind of failure he is dealing with.

4.4.3.5

Time-Dependent Versus Demand-Dependent Failures

Failures can also be classified depending on whether their probability of occurrence is time-dependent or time-independent: Time-dependent failure: failure whose probability of occurrence depends on the elapsing time. Such a failure is characterised by a random time to failure, i.e. by a time-dependent failure distribution F(t) = Pr(t ≤ TTF) like for example an exponential distribution. Typical examples of time-dependent failures are early life, catalectic and wear out failures. Time-independent failure: failure whose probability of occurrence depends on the number of operations or cycles. Such a failure is characterised by a constant probability of failure, γ , when the item is started or stopped. Typical examples of time-independent failures are the following:

56

4 Introduction of Basic Core Concepts

– mechanical failure of a relay depending on the number of position changes; – over-tension failure of an electric device which undergoes over voltages each time it is switched on or switched off; – mechanical failure of a standby device due to the rapid change of state when it is started. The common feature between the above examples is that the probability of the time-independent failures depends on the number of cyclic changes rather than on the time elapsed between these changes. When the frequency of the cycles is constant, it is possible to define a probability of failure per unit of time but this probability is a simple ratio which has nothing to do with a failure distribution. Standby systems are subject to time-independent failures when they are activated to perform a required function. This is particularly the case of safety systems used in demand mode of operation. It has to be noted that the time-independent failures occurring on demand have been called “on-demand failures” for a long time until the acceptation of this term has changed when the functional safety standards have decided to use it to name the “failures observed when a demand occurs”, which is not the same thing… This explains why the genuine on-demand failures are, most of the time, just ignored in safety instrumented systems analyses. This inaccurate appellation is a source of ambiguity consistently observed in the functional safety field. The periodically tested items with the two types of failures are analysed in detail in the ISO/TR 12489 (2013).

4.4.3.6

Dangerous Versus Safe Failures

Another classification of failures is related to their impact on safety. With this point of view a failure can be safe or unsafe. These concepts are defined in the functional safety standards and the definitions proposed in ISO/TR 12489 fit with this purpose: • Safe failure: failure of a safety system which tends to favour a given safety action (ISO/TR 12489 2013). • Unsafe failure, dangerous failure: failure of a safety system which tends to impede a given safety action (ISO/TR 12489 2013). The concept of safe failure is directly linked to the concept of fail-safe systems: • Fail-safe system: system with only safe failures. The above definitions encompass several degrees of safeness/unsafeness and safe failures can be split between: • Non-critical safe failures: safe failures which just increase the probability of success of a safety action; • Critical safe failures, spurious failures: failures which untimely trigger this safety action.

4.4 Failure and Fault Concept

57

It has to be noted that spurious failures can lead to dangerous situations (e.g. when restarting the installation). A failure, safe with regards to a given situation, can be unsafe with regards to another one and the analyst should be very cautious about that. For example, too much spurious failures are likely to lead to the disconnection of the safety system, which obviously changes the safe failure into an unsafe failure. Unsafe failures can also be split between: • Non-critical unsafe failures: unsafe failures which just decrease the probability of success of a safety action; • Critical unsafe failures, critical dangerous failures: failures which completely inhibit this safety action. It has to be noted that even if all unsafe failures are called “dangerous” in the functional safety standards, the non-critical dangerous failures are not really dangerous as the safety action is still available. Only the critical dangerous failures are really dangerous. It is important to notice that, contrary to the classifications analysed in the previous subsections, these properties of criticality are not “intrinsic” but “systemic”: a given safe/unsafe failure may be non-critical or critical depending on the state of the whole safety system. Let us consider a system of three sensors (A, B and C) organized in 2 out of 3 logic (see Chap. 36): • If A has a safe failure first, it is a non-critical safe failure, but if B has already had a safe failure, the same safe failure of A will lead to a spurious failure. • If A has an unsafe failure first, it is a non-critical unsafe failure, but if B has already had an unsafe failure, the same unsafe failure of A will lead to a critical dangerous failure. Although their definitions are very different, the terms systematic and systemic are sometimes mixed up due to their sound and writing proximity. The difference is explained in more details in Selvik and Signoret (2020).

4.4.4 Failure Cause, Failure Mode A given item can fail due to various causes and in various ways. The causes are simply called failure causes and the ways are called failure modes. This is simply defined in IEV 192 as: Failure cause: set of circumstances that leads to failure (IEV 192-03-11). Failure mode: manner in which failure occurs (192-03-17).

58

4 Introduction of Basic Core Concepts

This can be illustrated by the failure modes of a valve which are the following: – stuck open or stuck close; – spurious opening/closure; – external/internal leaks. The concept is important as a failure mode can lead to the loss of a given function and have no impact on another one or can be dangerous for a given function and safe for another one. Let us consider, for example, an emergency shutdown valve protecting a production system: – the failure mode “stuck open” leads to a dangerous situation as the safety action (closure of the valve) is inhibited but has no impact on the production function; – the failure mode “spurious closure” is safe with regards to the safety action but has a strong detrimental impact on production; – the failure mode “external leak” has an impact on both safety (with regards to operators) and production (the installation will have to be shut down).

4.4.5 Common Cause, Common Mode and Single Failures When several similar items are used and operated in the same way within a system, they also can fail in the same way and this leads to the concept of common mode failures: Common mode failures (CMFs): failures of different items characterized by the same failure mode (IEV 192-03-19). When several similar items are used and operated in the same way within a system, they also can fail due to the same cause and this leads to the concept of common cause failures: Common cause failures (CCFs): failures of multiple items, which would otherwise be considered independent of one another, resulting from a single cause (IEV 19203-18). Common cause failures can lead to common mode failures. CCF is a very important concept as this is the main limiting factor when implementing redundancy (fault tolerance) to reduce the probability of failure of a system: Redundancy: provision of more than one means for performing a function (IEV 192-10-02). Fault tolerance: ability to continue functioning with certain faults present (IEV 192-10-09). The limitation of fault tolerance is illustrated in Fig. 4.6 where the redundancy increases from 0 (1oo1 architecture on the left-hand side) to 5 (1oo5 architecture on

4.4 Failure and Fault Concept

59

Probability 1

10-1

10-2

1/1

Probability of failure of the k/n

1/2

CCF 1/3

10-3

k/n

1oo1

1oo2

1oo3

1/4

1/5

1/6

1oo4

1oo5

1oo6

Fig. 4.6 Limitation of the effectiveness of redundancy by common cause failures

the right-hand side). The components used in these systems are similar with a probability of failure equal to 10−1 . According to the classic probabilistic calculations, 6  the probabilities should decrease from 10−1 to 10−1 = 10−6 whereas, in fact, they only decrease from 10−1 to 10−3 due to a CCF with a probability of 10−3 . This figure shows an important impact when the redundancy increases from 0 to 2 and a small improvement when going from 2 to 3 and almost nothing after. This illustrates the fact that it is not possible to decrease the probability of failure to zero just by implementing redundancy (see Chap. 23). This limitation can be mitigated to some extent by implementing redundancy: – diverse redundancy (IEV 192-10-13), i.e.components from various providers or based on several principles. But this may be difficult to manage from a maintenance point of view; – standby redundancy (IEV 192-10-04) or mixed redundancy instead of active redundancy (IEV 192-10-04) i.e. a single item or few items operating simultaneously instead of all the items. Anyway, this limit cannot be completely removed and this is the occasion to talk about the ultimate CCF which defines a limit impossible to cross: Ultimate CCF: the statistical experience about the life duration of the universe is 13 billon years and this about the life duration of Earth is 3.5 billon years. Therefore, the probability that the universe disappear next year is about 7.8 × 10−11 and the probability that the Earth disappear about 3.3 × 10−10 . This implies that every study concluding to a probability of failure lower than 10−10 over next year has just forgotten this ultimate common cause of failure. Outside the redundancy consideration, CCFs have to be thoroughly identified and analysed to be sure that events which are expected to be independent are really

60

4 Introduction of Basic Core Concepts

independent. This is particularly important when dealing with highly reliable systems where it is of utmost importance to identify the single failures which are called single-point failures in IEV192: Single-point failure: system failure caused by the failure of only one of its constituent items (IEV 192-10-01). Single failures generally constitute weak points of the systems under study and are often due to common causes.

4.4.6 Critical Failures and Repairs/Restorations Figure 4.7 illustrates the counterpoint of the critical up and down states described in 4.3.4: the critical failures and the critical repairs/restorations: • Critical failure: failure leading directly from an up to a down state; • Critical repair/restoration: repair leading directly from a down state to an up state. Combining the critical up state probabilities with the corresponding critical failure probabilities allows to calculate the system failure density, failure frequency, failure rate or failure intensity (see Chaps. 22 and 31). In the same way, combining the critical down state probabilities with the corresponding critical repair/restoration probabilities allows to calculate the system repair/restoration density, repair/restoration frequency, repair/restoration rate or repair/restoration intensity (see Chap. 31). Up state class Perfect

Degraded

Down state class

Up state class

Critical down states

Critical up states

Non critical up state

Down state class

Critical failures

Down states

Up states

Fig. 4.7 Example of critical failures and critical repairs

Critical repairs

4.5 Maintenance Related Concepts

61

4.5 Maintenance Related Concepts 4.5.1 Maintenance, Restoration and Repair Definitions The concept which comes immediately to mind after “failure” is, obviously, “maintenance”. Actually, these two concepts are in close relationship and this is shown by the definition provided in IEV 192: Maintenance: combination of all technical and management actions intended to retain an item in, or restore it to, a state in which it can perform as required (IEV 192-06-01). Then, the aim of maintenance is twofold—retain and restore—and the maintenance operations can be split accordingly: Preventive maintenance: maintenance carried out to mitigate degradation and reduce the probability of failure (IEV 192-06-25). Corrective maintenance: maintenance carried out after fault detection to effect restoration (IEV 192-06-06). The above definition of preventive maintenance does not cover the periodic tests performed to reveal hidden failures before they lead to a detrimental situation. In this case, the preventive maintenance is not used to reduce the probability of the failure itself but to reduce the probability that this failure lead to a detrimental consequence. Such periodic tests are systematically performed to detect the dangerous undetected failures of the safety systems operating in demand mode (see Chap. 36) and they are the basis to ensure the required probability of success of the safety action when a demand occurs. They are part of the maintenance policy and do belong to the preventive maintenance. Figure 4.2 shows that, according to IEV 192, the corrective maintenance time does not encompass the administrative delays (e.g. to issue work orders) and this is also not in line with the above definition. Therefore, the definitions of preventive and corrective maintenance have to be completed in the following ways: Preventive maintenance: maintenance carried out to mitigate degradation, reveal hidden failures and reduce the probability of failure. Corrective maintenance: maintenance carried out after fault detection and administrative delays to effect restoration. Figure 4.2 also shows that the maintenance time is split into the preventive maintenance time and the time to restoration. Therefore, the concept of restoration is the next one to be defined: Restoration: event at which an up state is re-established after failure (IEV 192-0623). According to this definition, the restoration is an event where the item is reestablished in the up state. That means that the item was in the complementary state

62

4 Introduction of Basic Core Concepts

just before this event occurs and, finally, the restoration is an event where the item moves from the down state to the up state: this is the counterpart of the failure where the item generally moves from the up state to the down state. Figure 4.2 also shows that, according to functional safety standards, the time to restoration is split between the fault detection time and the overall repair time. This splitting clearly identifies the period where no repair is possible as the fault is unknown (fault detection time) and the period when a repair is possible as the failure has been revealed. This is very important when dealing with safety systems operated in demand mode for which failures are detected by performing periodic tests. Fault detection time: time interval between failure and detection of the resulting fault (IEV 192-07-11). Overall repair time: restoration time excluding fault detection time. The overall repair time starts when the fault has been detected and finishes when the restoration is completed. The last term to be defined in this subsection is “repair” because it is used in different ways. In IEV 192 repair and repair time are defined as follows: Repair: direct action taken to effect restoration (IEV 192-06-14). Repair time: part of active corrective maintenance time taken to complete repair action (IEV 192-07-19). Surprisingly the term “repair” is not used to define “repair time” and the definition of “repair time” mentions the “repair action” whereas repair is already defined as an action. This is not completely consistent with regards to a terminology point of view. It has to be noted that IEC 61508 talks about “effective time to repair” and ISO 14224 and ISO/TR 12489 of “active repair time” instead of “repair time”. This seems more accurate and meets the “repair time” definition where the “repair” is a part of active corrective maintenance.

4.5.2 Repairable Versus Repaired Items With regards to failures, the items can be split into two different categories according to whether they are repairable or not. Again, IEV 192 provides the following definition for that: • Repairable item: item that can, under given conditions, after a failure, be returned to a state in which it can perform as required (IEV 192-01-11). • Non-repairable item: item that cannot, under given conditions, after a failure, be returned to a state in which it can perform as required (IEV 192-01-11).

4.5 Maintenance Related Concepts

63

These definitions are based on the concepts of “failure” which has already been analysed above. Surprisingly, the concept of repair is not actually used in the definitions: state in which it can perform as required is used instead. Therefore, these concepts are more related to the restoration than to repair as defined in IEV 192-06-14 (see 4.5.1). As the return to up state is clearly mentioned, these definitions could be more simply defined as: Repairable item: an item that can, under given conditions, after a failure, be returned to an up state; Non-repairable item: an item that cannot, under given conditions, after a failure, be returned to an up state. In fact, with regards to the reliability analysis, what is relevant is not that an item is repairable but that this repairable item is actually repaired after it has failed. The concepts of repaired/non-repaired items had been used in previous IEV 192 issues instead of repairable item/non-repairable item but they are now deprecated. They are no longer in use in this standard and this allows to use them with the following acceptations: Repaired item: repairable item which is actually restored when faulty. Non-repaired item: non-repairable item or repairable item which is actually restored when faulty. These definitions take into account how the items are actually operated when they fail and the last one allows to handle the non-repairable items and reparable items not actually repaired within the same mathematical framework of non-repaired items. This is useful when reliability calculations are performed (see e.g. Chap. 22).

4.6 Acronyms and Operational Concepts 4.6.1 General Considerations The mean values of many of the times comprised into the up and down times are of interest and specific acronyms have been given to them. Some of them (MUT, MDT, MTTF, MTTR …) are indicated in Figs. 4.2 and 4.4. As these various times are random variables, the mean values are defined as expected values and this implies that they fall into the probabilistic domain. However, when field data feedback is implemented, these expected values can be estimated by statistical calculations. Therefore, the probabilistic parameter and its statistical estimation should not be confused. Some of these mean values (e.g. MTTF) lead directly to the estimation of reliability parameters (e.g. failure rate) to be used in probabilistic calculations and, in turn, the

64

4 Introduction of Basic Core Concepts

results of these predictions can be verified from the field feedback. Therefore, the mean time values constitute an efficient bridge between real life (items in operation) and probabilistic forecasting.

4.6.2 MUT and MDT Up and down times are defined in 4.3.1. and illustrated in Figs. 4.1 and 4.2. The definitions of their mean values are the following: Mean up time (MUT): expectation of the up time (IEV 19-08-09). Mean down time (MDT): expectation of the down time (IEV 19-08-10). If the preventive maintenance is not considered, the item moves from up state to down state when a failure occurs (see Sect. 4.4.1) and moves from down state to up state when the corresponding fault is repaired. This is illustrated in Fig. 4.8 for a j repaired item i considered over a period T: UTi is the up time elapsing before failure j j occurs, and DTi the down time elapsing before the fault is repaired. It has been noted that in Fig. 4.8 the last value of the up time, UTi3 , is censored because the 3rd failure has not occurred yet. j From this figure, the MUT can be estimated as the average value of UTi and the j MDT as the average value of DTi . In real life, such estimations are not performed from the feedback of a single component but from the feedback of a sample of similar components in order to gather more data and achieve a more accurate estimation. The maximum likelihood estimate (see Chap. 38) is generally used for doing that. This consists to sum up all the up or down times (including censored data) and divide these sums by the number or observed failures. Therefore, for n similar items and k observed failures, the MUT can be estimated as: 

MUT ≈ MUT k =

n  i=1

j

j

UTi

(4.1)

k

In this formula, index k has been used to indicate how many failures have been used to obtain the estimation MUT k . In the same way, the MDT can be estimated as: 

Fig. 4.8 Behaviour of a repaired item between up and down states

1st Failure

1st Restoration

2nd Failure

2nd Restoration

Up Down T

4.6 Acronyms and Operational Concepts

65

n 



MDT ≈ MDT k =

j

j

i=1

DTi

k

(4.2)

j

It has to be noted that, when the distribution of UTi ≡ UTi regardless of j and when j the distribution of DTi ≡ DTi regardless of j, the chronogram presented in Fig. 4.8 is the result of a renewal process (Cox 1962) for which MUT and MDT converge toward asymptotic values when T goes to infinity. It is in this case that these parameters find their main interest with regards to reliability analyses. This implies that the item is as-good-as-new after each restoration (i.e. the restoration/repair is perfect). When the up time can be split between operating and non-operating time, it can be interesting to consider only the mean up time related to the operating time. This leads to: j

j

j

UTi = UTO,i + UTnO,i

(4.3)

j

i where UTO,i is the accumulated operating time during the up-time interval i and ji UTnO,i the accumulated non-operating time during the same interval.

MUT related to operating failures (MUTO ): ratio of the sum of the accumulated operating time to failure by the number of operating failures. n  MUT =

j

j

i=1

O

UTO,i

q

(4.4)

where q ≤ k is the number of operating failures. In the same way, the down time can be split between the part due to operating failures and the part due to non-operating failures: j

j

j

DTi = DTO,i + DTnO,i

(4.5)

j

i where DTO,i is the accumulated down time due to operating failures during the down ji time interval i and DTnO,i the accumulated down time due to non-operating failures during the same interval.

MDT related to operating failures (MDTO ): ratio of the sum of the accumulated operating time to failure by the number of operating failures. n  MDT = O

i=1

j

j

q

where q ≤ k is the number of operating failures.

DTO,i

(4.6)

66

4 Introduction of Basic Core Concepts

These concepts are useful with regards to the analysis of MTBF performed in Sect. 4.6.4.

4.6.3 MTTF and Related Acronyms 4.6.3.1

Problem with the Definitions

If MUT and MDT are defined in similar ways in various standards or publications, a plurality of definitions can be found for MTTF. It is surprising as this is a core concept directly in relationship with reliability calculations (see Sect. 4.7.4). Table 4.3 gives a sample of the MTTF definitions that can be found in literature. Table 4.3 Different definitions of the MTTF No

Definition

References

1

Expectation of the operating time to failure

IEV 192 (2015)

2

Mean time to failure

IEC 61508-6 (2010)

3

Expected value of the time to failure

Kumamoto and Henley (1996)

4

Expected time before the item fails

ISO 14224 (2016) ISO/TR 12489 (2013)

5

Average of the durations before failure

Pagès and Gondran (1986)

6

Average time of failure

Tobias and Tindade (2012) ∝

Expected value for the time to failure, ∫ R(t)dt

Kapur and Pecht (2014)



∫ R(t)dt

Dhillon (2007)

9

Average time when the component is in UP

Zio (2007)

10

Total cumulative time observed to the total number of failures observed for non-reparable items

Smith (2017)

11

Expected time to failure of non-reparable components

Modarres et al. (2017)

12

Mean (expected) time to failure of non-reparable components

Rausand and Høyland (2009)

13

Mean time till dangerous failure

Boulanger (2013)

14

Mean time to first dangerous undetected failure

Rausand (2014)

15

Average time before a device first failure

CCPS (2007)

16

Expected time between two successive failures

Elsayed (2012)

17

Expected time for system to degrade to the point that no maintenance, minor maintenance or major maintenance action is needed

Ossai (2019)

18

Mean lifetime

Nakagawa (2011)

19

Expected lifetime of the components

Cui and Lie (2007)

20

Mean of the distribution of a product life

Okaro and Tao (2016)

7

0

8

0

4.6 Acronyms and Operational Concepts

67

From Table 4.3, it appears evident that the standardized definition given in IEV 192 is not, at least yet, commonly accepted. This is not even used in other standards, such as IEC 61508, ISO 14224 or ISO/TR 12489. Overall, the most common definition seems to be captured in lines 2–10 of the table. The wording is somewhat different but it provides definitions with more or less the same meaning: “mean time to failure”. It has to be noted that the definitions in lines 7 and 8 make directly the link between MTTF and the reliability function which will be analysed in Sect. 4.7.4. Further, in lines 10–12 of the table, the MTTF is associated to non-repairable items and in lines 13 and 14 it is associated to dangerous failures. It can be questioned why the need to restrict the scope of the term? In lines 14 and 15 the MTTF is associated to first failures, despite it is common to use the acronym MTTFF for that. In line 16 MTTF is associated with successive failures and this is even more surprising as the acronym MTBF is normally used for that. The one suggested in line 17 is also a bit challenging, by suggesting a definition that seems to redefine what a failure is. To finish this review, the definition in lines 18–20 associate MTTF to life duration, which is a commonly observed error. This is not an extensive review, but it gives a general idea of the existing confusion in the terminology related to this acronym and some clarifications are needed to avoid misunderstandings.

4.6.3.2

Analysis of the Various Times to Failure

The MTTF is defined in IEV 192-05-11 (2015) as the expectation of the operating time to failure. This definition leads to two important remarks: • It makes the difference between operating and non-operating times. • But it does not make difference between failures occurring during operating times and failures occurring in non-operating times. Therefore, this definition mixes operating times and failures occurring during the operating times (left-hand side of Fig. 4.5) as well as failures occurring in the nonoperating times (right-hand side of Fig. 4.5). This is not very much consistent and this is likely to lead to meaningless results (see formula 4.14 in Sect. 4.6.3.4). In fact, analysing Figs. 4.4 and 4.5 leads to identify 5 different times to failure: (1) (2) (3) (4) (5)

time to undifferentiated failures, TTF in Fig. 4.5 left and right; operating time to operating failures, UT O in Fig. 4.5 left; non-operating time to operating failures, UT nO in Fig. 4.5 left; operating time to non-operating failures, UT O in Fig. 4.5 right; non-operating time to non-operating failures, UT nO in Fig. 4.5 right.

This analysis is limited here to operating and non-operating failures but could be extended to every state where an item can fail. According to the nature of up times

68

4 Introduction of Basic Core Concepts

and failures considered, various mean times to failures can be defined and this is analysed in the following subsection in order to clarify the situation.

4.6.3.3

Classical MTTF

Figure 4.4 is in fact the envelope of Fig. 4.5 when the nature of the failures is not considered. In this case, the time to failure is equal to the up time and this is the ideal case directly linked to reliability calculations (see Sect. 4.7.4). This is also the case leading to the common definition of MTTF: MTTF: expected time before the item fails (e.g. ISO 14224). This parameter can be estimated from field feedback from the observation of a sample of n non-repaired items by cumulating the MTTFs of the items and dividing by the number k of observed failures: n TTFi (4.7) MTTF ≈ MTTF k = i=1 k 

If repaired items are considered instead of non-repaired items, the MTTF is related to the first item failures and this is why it is named mean time to first failure: MTTFF: expected time before a repaired item fails for the first time. In this case, this is also the average of the up time before the first failure (see Fig. 4.8): 

MTTFF ≈ MTTFF k =

n i=1

UTi1

(4.8)

k

And now, under the assumption that the restoration is perfect (item as good as new after restoration), each restoration is a renewal point where the item is exactly in the same state as it was at time zero. When repair is effective, this assumption is realistic and allows to increase the number of values available to estimate the MTTF, which is also the MTTFF and the MUT of the item. Under this assumption and as represented in Fig. 4.9, the chronogram of Fig. 4.8 provides 3 simple chronograms like this illustrated in Fig. 4.4. Then, three values are provided to estimate the MTTF instead of a single one. Finally, under the assumption

Up Down T Fig. 4.9 Three sub-chronograms extracted from Fig. 4.8 under the assumption of as good as new after restoration

4.6 Acronyms and Operational Concepts

69

of perfect restoration, the MTTF of n similar items with k observed failures can be estimated as: 

MTTF ≡ MTTFF ≡ MUT ≈ MUT k =

4.6.3.4

n  i=1

j

j

UTi

k

(4.9)

Extended MTTF (MTTFO , MTTFnO , MTTFFO , MTTFFnO )

In Sect. 4.6.3.2, three different types of operating times to failure are identified (undifferentiated, operating and non-operating) and also three types of failures (undifferentiated, operating and non-operating failures) are identified. Therefore, this leads to 9 potential mean time estimations: (1) (2) (3) (4) (5) (6) (7) (8) (9)

mean time to undifferentiated failure (MTTF); mean time to operating failure; mean time to non-operating failure; mean operating time to undifferentiated failure; mean operating to operating failure; mean operating to non-operating failure; mean non-operating to undifferentiated failure; mean non-operating to operating failure; mean non-operating to non-operating failure.

The first one has been analysed in Sect. 4.6.3.3: this is the classical MTTF. The other 8 can be sorted into these which combine operating times with non-operating failures and vice versa (2, 3, 4, 6, 7, 8) and these which are consistent with regards to the nature of failures and times (5 and 9). Mixing operating times with non-operating failures and vice versa seems of little interest and only the mean times 5 and 9 can be retained: Mean operating time to operating failure (MTTFO ): expected operating time before an operating failure occurs. Mean non-operating time to non-operating failure (MTTFnO ): expected nonoperating time before a non-operating failure occurs. As the nature of times and failures are consistent, the acronyms have been simplified (i.e. MTTFO instead of MOTTOF and MTTFnO instead of MnOTTnOF). This in addition reminds that the estimation of these parameters can be used to estimate failure rates in the same way as the MTTF is used to. Figure 4.10 illustrates how the data, relevant to operating times and operating failures, can be extracted from Fig. 4.5. In the same way, Fig. 4.11 illustrates how the data, relevant to non-operating times and non-operating failures, can be extracted from Fig. 4.5.

70

4 Introduction of Basic Core Concepts

Fig. 4.10 Sub-chronograms extracted from Fig. 4.5 for operating failures and operating times

Operating failure

Op. Down

Fig. 4.11 Sub-chronograms extracted from Fig. 4.5 for non-operating failures and non-operating times

Non-operating failure

Non-op. Down

It has to be noted that, according to the operation needs, the operating and nonoperating times are not necessarily of a single piece but can be split in several intervals. Therefore in Fig. 4.10, UT O can be seen as the sum of several operating times and UT nO the sum of several non-operating times. This is illustrated in Fig. 4.12 which provides one operating time to operating failure and one censored non-operating time to non-operating failure. Operating failure Up

Op. Non-op. Down T

Operating failure

Fig. 4.12 Example of gathering of operating and non-operating times of a single non-repaired component

4.6 Acronyms and Operational Concepts

71

Figures 4.10, 4.11 and 4.12 right are similar to Fig. 4.4 in case of non-repaired items and therefore the corresponding times to fail UT O and UT nO can be used in the same way as the TTFs are used to estimate the MTTF. Let us consider that r operating failures and q non-operating failures have been observed on a sample of n similar non-repaired items. MTTFO and the MTTFnO can be estimated as: n O UTO,i O (4.10) MTTF ≈ MTTF r = i=1 r 

and n

nO



MTTF nO ≈ MTTF q =

i=1

UTnO,i q

(4.11)

For repaired items, the above formulae can be used to estimate the MTTFFO related to the first operating failure and MTTFFnO related to the first non-operating failure. For repaired items and under the assumption of perfect restoration, the same calculations as these done above for MTTF can be done for estimating MTTFFO and MTTFFnO and this leads to: MTTFF ≡ MTTF ≈ O

O

O MTTF r



n  =

j

i UTO,i

ji

i=1

r

(4.12)

and MTTFF

nO

≡ MTTF

nO



nO MTTF q



n  =

i=1

ji

q

j

i UTnO,i

(4.13)

In IEV 192, the MTTF or MOTBF are defined as the expectation of the operating time to failure (IEV 192-05-11). This is equivalent to the mean operating time to undifferentiated failure which is the number 4 in the list given at the beginning of Sect. 4.6.3.4. Such a parameter should be estimated by considering the sum k = (r + q) of operating and non-operating failures: n MTTFIEV ≈

UTO,i k

i=1

(4.14)

As said above, this parameter which mixes operating times with non-operating failures seems of little interest and MTTFO should be used instead. The same problem arises with the MTTFF defined in IEV 192 as the expectation of the operating time to first failure (IEV 192-05-12) without specification of the

72

4 Introduction of Basic Core Concepts Up time (ISO 14224) Operating time (ISO 14224) Start-up

Running

Run down

Non-operating time (ISO 14224) Hot standby

Cold standby

Idle

Fig. 4.13 Example of detailed operating times from ISO 14224

nature of this first failure. Again, this parameter which mixes operating times with non-operating failures seems of little interest and MTTFFO should be used instead. The analysis performed above on operating and non-operating failures can be extended, on the same principle, to any kind of failure. For example, ISO 14224 has defined for the needs of data collection and exchange the detailed subdivision of operating times presented in Fig. 4.13. According to this figure, the hot standby state belongs to the operating state and this is different from IEV 192 which considers that the standby state is a part of the non-operating state. The ISO standard is right as the hot standby state is close to the running state which also appears in Fig. 4.13. For running and hot standby states, this leads to define the following mean times: – mean running time to running failure: MTTFRu ; – mean hot standby time to hot standby failure: MTTFHSb . Those parameters can be estimated by cumulating the running times (respectively, hot standby times) and dividing by the number of observed running failures (respectively, hot standby failures). Of course, the same could be done for other states and also for the first failure.

4.6.3.5

Case of Hidden Failures (MTTFH and γ )

In the previous subsection, it is implicitly assumed that the times to failure are known and this implies that the dates of occurrence of the failures are known. This is true only in the case of failures which are revealed as soon as they occur. For failures which are never revealed (except when a detrimental event occurs due to them), it is not possible to gather field feedback, the formulae developed above cannot be used and the estimations of the mean times to failure are not possible. However, between the immediately revealed failures and the never revealed failures, there is room for the failures revealed by periodic tests (i.e. by preventive maintenance). This is typically the case for the dangerous undetected failures (see Chap. 36) of safety systems which are operated in standby position. Figure 4.14 gives an example of a hidden failure which is revealed by periodic tests performed at regular intervals with a duration equal to τ. The item is operating during UT i and the failure remains undetected until the next test is performed at θ = v · τ where v is the number of tests performed since t = 0. Therefore, the failure remains hidden

4.6 Acronyms and Operational Concepts

73 Hidden failure

Fig. 4.14 Example of periodically tested hidden failure

− Standby

Revealed failure

θ

Down

             

τ







  



T Tests

during a duration equal to v · τ − UT i . It has to be noted that the failure can be also revealed when an actual demand for a safety action occurs, but this arises normally far less frequently than the tests. From a field feedback point of view, when such a hidden failure is detected, its date of occurrence is unknown: it is just possible to say that it has occurred within the interval [(α − 1) · ν, α · ν]. Therefore, for a given parameter, lower and upper bounds can be obtained by replacing UTi by (α − 1) · ν and UTi by Θi in the equations established in the previous sections (e.g. in Sect.4.6.3.3). However, under the assumption of constant failure rate, it can be demonstrated that, when a failure is detected, it has occurred, in average, since half the test interval (see Chap. 36). Then, the average time spent to detect the failure can be estimated as v/2 and UTi can be estimated as UTi ≈ Θi − ν/2. Proceeding in this way allows to estimate, for example, the mean standby time to hidden failures (MTTFH ) or the mean standby time to first hidden failures (MTTFFH ). When the hidden failure is a dangerous undetected failure, this leads to the mean standby time to dangerous undetected failures (MTTFDU ) or the mean standby time to first dangerous undetected failures (MTTFFDU ). Nevertheless, it has to be noted that when a failure is observed when a test is performed, it can be the result of: – a failure occurred before the test, as shown in Fig. 4.14; – a failure due to the test itself, as shown in Fig. 4.15. Fig. 4.15 Example of periodically tested hidden failures



Failure due to test

Standby

Down

τ

T

Tests

74

4 Introduction of Basic Core Concepts

Such failures are clearly different in nature: the failures are time-dependent in Fig. 4.14, whereas they are demand-dependent in Fig. 4.15. Therefore, these failures cannot be combined, and they should be analysed separately. It is the task of the analyst, when a failure is collected, to sort out whether it is time-dependent or demand-dependent. Figure 4.15 illustrates a failure occurring due to a test demanding a standby item to start in order to verify that it is in good operation condition. For such failures due to demands, the number d i of demands can be estimated as: – di = INT (UTi /ν) when a failure due to a demand is detected over the period of observation T (see Fig. 4.15); – di ≈ INT (T /ν) when no failure is detected over the period of observation T; – di = INT (Θi /ν − 1) when a time-dependent failure is detected over the period of observation T. Note: The function INT(x) takes the integer part of x. Then, if k failures due to demand have been observed, the probability, γ , of such failures can be estimated as: k γ ≈ γk = n 

i=1

(4.15)

di

4.6.4 MTBF 4.6.4.1

Classical MTBF and Problem with the Definition

The original intent of the MTBF is to estimate the mean time elapsing between successive failures of repaired items and this leads to the following definition: Mean time between failure (MTBF): expected time between successive failures of a repairable item (ISO/TR 12489, 3.1.10). According to Fig. 4.16, a similar parameter can also be introduced: Restoration

R1

R2

R3

R4

R5

Up Down

Failure

F1

,

F2

,

F3

,

F4

,

F5

T

Fig. 4.16 Example of times between failures and times between restorations

4.6 Acronyms and Operational Concepts

75

Mean time between restorations (MTBRes): expected time between successive restorations of a repairable item (ISO/TR 12489, 3.1.10). According to Fig. 4.16, the time between the successive failures j and j + 1 of component i is equal to: j,j+1

j

j+1

= DTi + UTi

TBFi

(4.16)

And the time between the successive restoration of failures j and j + 1 of component i is equal to: j

j

j

TBRi = UTi + DTi

(4.17)

Therefore, for n similar items and k observed failures, the MTBF can be estimated as: 

n 

MTBF ≈ MTBF =

i=1

TBFi

k n 

=

i=1

j

j

DTi

k

j

+1 j UTi

i=1

+

n 

j,j+1

j

(4.18)

k

In the same way, the mean time between restorations, MTBRes, can be estimated as: 

n 

MTBRes ≈ MTBRes = +

i=1

j

k n  i=1

n 

j

TBRi

=

j

j

i=1

UTi

k

j j DTi

k

(4.19)

When the sample of observed data is large, the MTBF and MTBRes converge toward the same value which is the sum of the mean up time and the mean down time: MTBF ≈ MTTRes ≈ MUT + MDT

(4.20)

This formula is the traditional presentation of the MTBF. The MTBF is defined in a different way in IEV 192-05-13. MTBFIEV or MOTBFIEV : expectation of the duration of the operating time between failures. This definition leads to the same remarks already done for the MTTFIEV : • It makes the difference between operating and non-operating times. • But it does not make difference between failures occurring during operating times and failures occurring in non-operating times.

76

Up

4 Introduction of Basic Core Concepts

Op. Non Op. Down

,

,

,

,

,

F1

,

F2

,

,

,

,

,

F3

,

,

,

,

F4

F5 T

Fig. 4.17 Example of detailed times between failures

Applying this definition needs a more detailed representation of the up time: this has been done in Fig. 4.17 where the up times have been detailed into operating and non-operating times. From its definition and from Fig. 4.17, the MOTBFIEV is then the average value j of the UT O,i . It should be estimated as the cumulated operating times divided by the number k of observed failures: n  MOTBFIEV ≈

i=1

j

j

UTO,i

(4.21)

k

This leads to several issues: – k containing both operating and non-operating failures, the physical meaning of this definition is difficult to grasp and the usefulness of the above definitions seems questionable. – If the number q of operating failures is used instead of k, the above formula becomes meaningful but becomes similar to this of the mean operating time to operating failure (MUTO or MTTFO ). Therefore, the MOTBFIEV is not really a “time between failures”. The MTBFO introduced hereafter in Sect. 4.6.4.2 should be used instead. In Fig. 4.18, the times between operating failures have been illustrated. They include non-operating times and non-operating restoration times. Therefore, the average value of the time between operating failures seems of very little interest. ,

nOF1 Up

nOF1

Op. Non Op.

Down

,

,

OF1

,

,

,

,

,

OF2

,

,

,

,

,

,

OF3

T

Fig. 4.18 Example of times between operating failures and non-operating failures

4.6 Acronyms and Operational Concepts

77

In the same way, the times between successive non-operating failures include operating times and operating restoration times and, therefore, the average value of the time between non-operating failures seems also of little interest.

4.6.4.2

MTBF Related to a Specific Failure (MTBFO , MTBFnO )

The next idea is to analyse the mean operating time between operating failures or the mean non-operating time between non-operating failures. Unfortunately, they suffer of the problem raised for MTBFIEV : they are similar to some kinds of MTTFs or MUTs but are definitively not “times between failures”. The only way to go further is to extract from Fig. 4.18 the part relevant to the operating times and the part relevant to the non-operating times. This is done in Fig. 4.19 where only the parts of Fig. 4.18 specifically related to operating failures has been filtered to include only operating times, operating failures and restoration of operating failures. This figure is similar to Fig. 4.16 and allows to define a MTBF and a MTBR specific to operating failures. Then, if q operating failures have been observed, a formula similar to 4.18 and 4.19 can be obtained: n  O

n 

j

j

i=1

MTBF ≈ MTBR ≈ O

uTO,i

q

j

j

i=1

+

DTO,i

(4.22)

q





MTBF O ≈ MTBRO ≈ MUT O + MDT O

(4.23)

This leads to the definition of the MTBFO : MTBF related to operating failures (MTBFO ): ratio of the sum of the cumulated operating time to failure and the cumulated restoration time of operating failures, by the number of operating failures. Then, like the traditional MTBF, the MTBFO is the sum of the MUTO (part of the up time related to operating failures) and of the MDTO (part of the down time related to operating failures).

OR1

OR2

OR4

OR3

Op. Up

Non Op. Down

,

,

,

OF1

.

,

OF2

,

, ,

OF3

Fig. 4.19 Example of times between failures and between restorations specific to operating failures

78

4 Introduction of Basic Core Concepts

NOR2

NOR1

NOR3

Op. Up

Non Op. ,

Down

,

,

NOF1

,

,

NOF2

Fig. 4.20 Example of times between failures and between restorations specific to non-operating failures

In Fig. 4.20, the relevant part of Fig. 4.18 related to non-operating failures has been filtered to include only non-operating times, non-operating failures and restorations of non-operating failures. By similarity with the operating failure, when r non-operating failures are observed, this leads to: n  MTBF

nO

≈ MTBR

nO



i=1

n 

j

j

r

UTnO,i

+



i=1

j

j

DTnO,i

r

(4.24)



MTBF nO ≈ MTBRnO ≈ MUT nO + MDT nO

(4.25)

And to the following definition: MTBF related to non-operating failures (MTBFnO ): ratio of the sum of the cumulated non-operating time to failure and the cumulated restoration time of non-operating failures, by the number of non-operating failures. Then, like the traditional MTBF, the MTBFnO is the sum of the MUTnO (part of the up time related to non-operating failures) and of the MDTnO (part of the down time related to non-operating failures). The above analysis has been done for operating and non-operating failures but, exactly in the same way, it can be extended to any kind of failure, e.g. MTBFRu for running failures, MTBFHSb for standby failures or MTBFH for hidden failures.

4.6 Acronyms and Operational Concepts

79

4.6.5 Maintenance Related Acronyms (MTTR, MRT, MFDT…) In Sect. 4.5.1, the concepts related to maintenance have been introduced and Fig. 4.2 shows the acronyms related to most of these concepts. From this figure it is obvious that different acronyms can be found when various standards are considered: – The acronyms MTTR and MTTRes are used in IEV 192 IEC 60050-192 (2015) and ISO/TR 12489 (2013) for the same thing: “mean time to restoration”. – The acronym MRT is used for different things in IEV 192 and IEC 61508-4 (2010). – The acronyms MRT and MART are used in IEV 192 and ISO/TR 12489 for the same thing: “mean (effective/active) repair time”. This is confusing mainly for MTTR which has been used for decades and is still currently used with the meaning of “mean time to repair”. The problem comes with the use of the letter “R” for two different meanings: restoration and repair. This letter meaning “repair” in the other acronyms mentioned above, it makes sense to keep it for this meaning and to use the acronym MTTRes which is not confusing. According to Fig. 4.2 in Sect. 4.3.3, the “repair time” defined in IEV 192 belongs to the active corrective maintenance time and it makes sense to follow the ISO TR 12489 and to use MART instead of MRT. In IEC 61508-4, the MRT is defined as the “expected overall repair time” and then it makes sense to use the acronym MORT instead. Finally, the following acronyms should be retained: – Mean time to restoration (MTTRes): expectation of the time to restoration; – Mean active repair time (MART): expected active repair time; – Mean overall repair time (MORT): expected overall repair time. In Fig. 4.2, three other acronyms are mentioned which do not seem to make problem: – Mean fault detection time (MFDT): expected time needed to detect a fault (ISO/TR 12489, 3.1.35); – Mean administrative delay (MAD): expectation of the administrative delay (IEV 192-07-26); – Mean logistic delay (MLD): expectation of the logistic delay (IEV 192-07-27). From the above analysis, the important following formula can be highlighted: MTTRes = MFDT + MORT

(4.26)

This formula shows that the restoration of an item failure is split between two periods different in nature: before it is revealed and after it is revealed.

80

4 Introduction of Basic Core Concepts

4.7 Probabilistic Concepts 4.7.1 Introduction to Random Processes As said in Chap. 1, the philosophy behind reliability analyses is to use the existing knowledge to predict what can happen in order to proceed to improvement as far as possible. The term “predict” implies to estimate in one way or another the chances that the future be detrimental or beneficial. The engineering judgment based on previous experience can be used first for doing that, but for complex systems this has to be completed and consolidated by building models and performing probabilistic calculations on these models. The probabilistic field is a part of the mathematic domain and this is why the concepts handled when performing such calculations have sound mathematical definitions. Definitions which should be properly understood by the analysts using them in order to be aware of the underlying limitations and approximations which are likely to appear due to the difference between the ideal world of mathematics and the real physical world in which the systems are actually operating. Talking about probability is talking about chances to see something happening (or not happening) like, for example, a failure which can occur (or not) during a given period. The terms used to qualify such phenomena governed by “chance” are random, aleatory or stochastic which are synonymous (at least for the purpose of this book). The processes behind these phenomena are called random or stochastic processes and the variables related to these processes are called random variables: Random variable: variable whose possible values are numerical outcomes of a phenomenon governed by chance. There are two types of random variables, discrete and continuous. Random or stochastic process: collection of random variables. An example of simple random process is the tossing a coin: two outcomes, heads or tails, are possible and the underlying random variable is a discrete random variable. If the coin is not faked, then the probability of head or tail is the same and equal to 1/2 = 0.5. Nevertheless, this is not exactly true as the coin can land on the edge with a low probability! This is a so-called rare event. Another example of simple random process is the dice throwing. When the dice is not loaded, then the underlying discrete random variable can take one of 6 values (1– 6) which have the same probability of 1/6. Card games provide also endless examples of random process and discrete random variables. For discrete random variables, each outcome is characterized by a constant probability value. The occurrence of a failure is different from the example above as the underlying random variable is not discrete but continuous. This underlying continuous random variable is the time to failure (TTF) which has been already introduced in Sect. 4.4.2. In this case, TTF varies continuously in the interval [0, ∞] and the probability that

4.7 Probabilistic Concepts

81

a given value, t, is observed is given by a probabilistic function, f (t), called failure density. For continuous random variables, the outcomes are characterized by probabilistic distributions.

4.7.2 Basic Random Process The simplest random process encountered when dealing with reliability calculation is the process describing how an item can fail. This is illustrated in Fig. 4.21 where a state-transition diagram is represented at the top and a possible outcome (trajectory of the random process) at the bottom. In a state-transition model, the states are represented by circles and the transitions between states by arrows. Markov graphs (Chap. 31) and Petri nets (Chap. 33) are the main state-transition models described in this book. As said in Sect. 4.4.1, a failure can occur from all the item states. Then, the state before the failure occurs is named “OK” and the faulty state reached after it has occurred is named “KO”. This ensures the character of generality of what is said in this subsection. Anyway, in most of the cases, “OK” is an up state and “KO” a down state. In this model, the failure occurs when the item moves from OK to KO. The item is not repaired, then it stays in KO ad vitam aeternam after it has failed. Figure 4.22 illustrates the state-transition model related to a repaired item. In this case, the item is restored and can move from KO to OK when the restoration occurs. In this model the time to restoration (TTR) is another random variable and the random process represented in this figure comprises the two random variables TTF and TTR. The random processes presented in Figs. 4.21 and 4.22 are very important as they constitute the basis for the remaining part of this subsection. Fig. 4.21 Random process related to a failure occurrence

Failure OK

KO

Failure (event) OK

Fault (state)

KO

T

82

4 Introduction of Basic Core Concepts

Fig. 4.22 Random process related to a repaired item

Failure KO

OK

Restoration 1st 1st Failure Restoration

2nd Failure

2nd Restoration

OK KO

T

4.7.3 (Un)Reliability Versus (Un)Availability 4.7.3.1

Reliability Versus Instantaneous Availability

One of the main parameters calculated within reliability analyses is, of course, the reliability of the item under study. This sentence highlights the polysemy of the term “reliability” which is used as a domain of knowledge (as this has been done until now in reliability assessment or reliability analysis), as an ability (to perform as required over a given time interval) and also as a mathematical concept with a sound definition: Reliability, R(t): probability of performing as required for a given time interval, under given conditions. This definition is sometimes mixed up with the definition of the instantaneous availability which is the following: Instantaneous availability, A(t): probability of performing as required at a given instant.

Fig. 4.23 Reliability and availability of an item at time t

Reliability chronogram OK KO Availability chronogram OK KO t

4.7 Probabilistic Concepts

83

The difference is illustrated in Fig. 4.23 which provides outcomes of random processes related to a non-repaired item at the top and to a repaired item at the bottom: – At the top, the item has continuously been in the OK state until time t and this is what is required in the reliability definition. – At the bottom, the item is available at time t but it has had one failure before, failure which has been repaired. If no failure is allowed before t when reliability is considered, this does not matter for availability calculations. In this case, failures are allowed provided that they are repaired before t. This makes a big difference between the two parameters. In Fig. 4.23, a given failure is considered, and the item is reliable or available with regards to this specific failure. From a probabilistic point of view, the reliability, R(t), of the item over the interval [0, t] can be written as: R(t) = Pr(OK over [0, t]) = Pr(TTF > t)

(4.27)

And the instantaneous availability as: A(t) = Pr(OK at t)

(4.28)

It has to be noted that: – For a non-repaired item R(t) ≡ A(t) because, if the item is available at t, it had, necessarily, continuously be in the OK state until t. – For a repaired item R(t) is the probability that the 1st failure has not occurred yet before t. Figure 4.24 gives a typical example of the evolution of reliability and instantaneous availability as time elapses: – On the left-hand side, the reliability is a decreasing (or rather non-increasing) function which tends to 0 when time goes to infinity. This is normal as the failure is not repaired when it has occurred.

1

1

Availability, A(t)

Availability, A(t) Reliability, R(t) 0

Time

0

Fig. 4.24 Simple example of reliability and availability curves

Tests Time

84

4 Introduction of Basic Core Concepts

Fig. 4.25 Availability curve of an item mixing a non-repaired failure with periodically tested failures

1

0

Availability, A(t)

Tests Time

– On the left-hand side, the instantaneous availability is related to a repaired item with immediately revealed failures and as good as new after restoration. It reaches rather quickly an asymptotic value different from zero: the faster the restoration, the faster the asymptotic value is reached. – On the right-hand side, the instantaneous availability is related to a periodically tested item as good as new after restoration. It does not reach an asymptotic value but an asymptotic shape of the curve. The above examples are simple and in the general case the availability does not necessarily reach an asymptotic value nor an asymptotic shape. Figure 4.25 shows a more general example mixing a periodically tested failure with a non-repaired failure. It has to be noted that the concept of availability can be extended to production systems with several production levels. This leads to the concept of production availability described in Sect. 6.1.

4.7.3.2

Unreliability Versus Instantaneous Unavailability

In fact, when dealing with failure, this is rather the complementary probabilities which are of interest: Unreliability, F(t): probability of not performing as required for a given time interval, under given conditions. Instantaneous unavailability, U(t): probability of not performing as required at a given instant. This is illustrated in Fig. 4.26 for a given failure and a time t. From their definitions, unreliability and instantaneous unavailability are the complementary probabilities of reliability and instantaneous availability: F(t) = 1 − R(t) = Pr(TTF ≤ t)

(4.29)

U (t) = 1 − A(t) = Pr(KO at t)

(4.30)

These parameters should not be mixed up:

4.7 Probabilistic Concepts Fig. 4.26 Unreliability and unavailability of an item at time t

85

Reliability chronogram OK KO

Availability chronogram OK KO t

– the unreliability is the probability that the first failure occurs before time t; – the unavailability is just the probability that the item is faulty at time t. No matter if this is for the first time of not. It has to be noted that, for a non-repaired item F(t) ≡ U (t) because the item failure is necessarily its first failure. Figure 4.27 gives typical examples of the evolution of unreliability and instantaneous unavailability as time elapses: – On the left-hand side, the unreliability is an increasing (or rather non-decreasing) function which tends to 1 when time goes to infinity. This is normal as the failure is not repaired when it has occurred and it is sure that it occurs when the time is long enough. – On the left-hand side, the instantaneous unavailability is related to a repaired item with immediately revealed failures and as good as new after restoration. It reaches rather quickly an asymptotic value different from 1: the faster the restoration, the faster the asymptotic value is reached. – On the right-hand side, the instantaneous unavailability is related to a periodically tested item as good as new after restoration. It does not reach an asymptotic value but an asymptotic shape of the curve. The above examples are simple and in the general case the unavailability does not necessarily reach an asymptotic value nor an asymptotic shape. Figure 4.28 shows a more general example mixing a periodically tested failure with a non-repaired failure. 1

1

Unavailability, U(t)

Unreliability, F(t) Unavailability, U(t) 0

Time

0

Fig. 4.27 Example of unreliability and unavailability curves

Tests Time

86

4 Introduction of Basic Core Concepts

Fig. 4.28 Unavailability curve of an item mixing a non-repaired failure with periodically tested failures

1

Unavailability, U(t)

Tests

0

Time

4.7.4 Failure Distribution and Link with MTTF In Sect. 4.7.3.2, the unreliability has been defined as F(t) = Pr(TTF ≤ t). According to the theory of probability, this is precisely the definition of the cumulative distribution function (CDF) of the random variable TTF: Unreliability, F(t): cumulative distribution function of the item failure. This is an important result because it makes the link between the failure distribution and the item reliability, R(t), and also because the CDF is the key to enter into the Monte Carlo simulation world (see Chap. 32). Therefore F(t) can be directly used to simulate random failures and, among the functions analysed above, this is the single one to have this property. When the CDF is known, it is easy to calculate the failure density, f (t), which is its derivative. Then, f (t) · dt = Pr(t < TTF ≤ t + dt) is the probability that the random variable TTF is comprised between t and t + dt and f(t) is simply obtained by deriving F(t): Failure density: probability density function of the time to failure. f (t) =

dR(t) dF(t) =− dt dt

(4.31)

From the failure density, the unreliability and the reliability can be calculated in turn by proceeding to an integration: t

F(t) = 1 − R(t) = ∫ f (τ ) · d τ

(4.32)

0

If f (t) is the probability density of the TTF, it can be used directly to calculate ∞

the expected value of this random variable: MTTF = ∫ t · f (t) · dt. 0

Then, this formula can be integrated by parts and this gives:   ∞ MFFT = ∫ R(t) · dt − t · R(t)|∞ 0 0

4.7 Probabilistic Concepts

87

and finally: ∞

MTTF = ∫ R(t) · dt

(4.33)

0

This result highlights an important general property: the MTTF is directly linked to the reliability function R(t) and this property is independent of the form of the reliability function. Therefore, it is possible to make the link between the theoretical calculations of the MTTF and its estimation from data collected on items in operation (field feedback). Of course, this MTTF is related to the specific failure which has been considered in the above mathematic development. If operating failures are considered, this will lead to the reliability RO (t) of the item with regards to operating failures and to the corresponding MTTF O (see Sect. 4.6.3.4). If non-operating failures are considered, this will lead to RnO (t) and to MTTF nO , etc. For an item subject to both operating and non-operating failures, the estimations of MTTFO and MTTFnO proposed in Sect. 4.6.3.4 cope with the above theoretical results only if these failures are independent. If they are not independent, the estimation from field feedback will take the dependencies into account but this is not the case for the above theoretical calculations.

4.7.5 Average and Asymptotic Availability/Unavailability The instantaneous availability and unavailability are changing as time elapses and this is why the average values of these parameters are commonly used instead. The IEV 192 standard provides the following definitions: Mean or average availability: average value of the instantaneous availability over a given time interval (t 1 , t 2 ) (IEV 192-8-05). ¯ 1 , t2 ) = A(t

1 t2 ∫ A(t)dt t2 − t1 t1

(4.34)

When this parameter is calculated for an observation time T starting from 0, it can be noted: T ¯ ) = 1 ∫ A(t)dt A(T T 0

(4.35)

Mean or average unavailability: average value of the instantaneous unavailability over a given time interval (t 1 , t 2 ) (IEV 192-38-06)

88

4 Introduction of Basic Core Concepts

U¯ (t1 , t2 ) =

1 t2 ∫ U (t)dt t2 − t1 t1

(4.36)

When this parameter is calculated for an observation time T starting from 0, it can be noted: 1 T U¯ (T ) = ∫ U (t)dt T 0

(4.37)

It has to be noted that the mean unavailability is (improperly) called PFDavg (average of the probability of failure on demand) in functional safety standards. The availability can also be considered as the ratio of the accumulated sojourn time of the item in the OK state: this is even the way to estimate this parameter from the field feedback. Combined with formula 4.35, this gives: T ¯ ) = ASTOK (T ) = 1 ∫ A(t)dt A(T T T 0

And finally: T

ASTOK (T ) = ∫ A(t)dt

(4.38)

0

Therefore, the accumulated time spent in the OK state over a given period is equal to the integral of A(t) over this same period. This is the opportunity to highlight the difference between repaired items and non-repaired items: T

– repaired item: lim ∫ A(t)dt = ∞ T →∞ 0

T

– non-repaired item: lim ∫ R(t)dt = MTTF T →∞ 0

For a repaired item, the accumulated time in the OK state increases continuously whereas it reaches an asymptotic value (the MTTF) for a non-repaired item. In the same way, the unavailability can be considered as the ratio of the accumulated sojourn time of the item in the KO state and this gives: 1 T ASTKO (T ) = ∫ U (t)dt U¯ (T ) = T T 0 And finally: T

ASTKO (T ) = ∫ U (t)dt 0

(4.39)

4.7 Probabilistic Concepts

89

Then, the accumulated time spent in the KO state over a given period is equal to the integral of U (t) over this same period. When t goes to infinity, it goes to infinity more quickly for non-repaired items than repaired items. The average availability with regards to a given failure can be estimated from the field feedback just by accumulating the time spent in the OK state (see Sect. 4.7.3) and dividing by the observation time, T: 

¯ ) = ASTOK (T ) ≈ A¯ (T ) = ASTOK A(T T T 

(4.40)

In the same way, the average unavailability can be estimated by: ASTKO (T ) ASTKO ≈ U¯ (T ) = U¯ (T ) = T T 

(4.41)

Of course, OK and KO are replaced in most of the cases by the up and down states in the above formulae. It has to be noted that the average availability and the average unavailability exist in any case and should not be mixed up with the steady state (asymptotic) availability and unavailability which exist only when a steady state regime is reached after some time like this is illustrated on the left hand sides of Figs. 4.24 and 4.27. Steady state or asymptotic availability, Aas : limit, if it exists, of the instantaneous availability when time tends to infinity (IEV 192-08-07). Steady state or asymptotic unavailability,U as : limit, if it exists, of the instantaneous unavailability when time tends to infinity (IEV 192-08-078). The steady states are characteristic of renewal processes (Cox 1962). There are no constraints on failure and restoration distributions except that the item is repaired and as good as new after restoration. In this case, the average availability converges toward the asymptotic value, Aas , of the instantaneous availability: T T ¯ = lim 1 ∫ A(t)dt = 1 Aas ∫ dt = T Aas = Aas lim A(t) T →∞ T 0 T T 0

T →∞

(4.42)

It is the same for the average unavailability which converges toward the asymptotic value, Uas , of the instantaneous unavailability: lim U¯ (t) = Uas

T →∞

(4.43)

If MOKT is the mean time spent in the OK state and MKOT the mean time spent in the KO state (see Figs. 4.23 and 4.26 bottom), then it can be demonstrated that: Aas =

MOKT MKOT and Uas = MOKT + MKOT MOKT + MKOT

(4.44)

90

4 Introduction of Basic Core Concepts

When OK and KO are the classical up and down states (see Fig. 4.8) then the following formulae are obtained: Aas =

MUT MDT and Uas = MUT + MDT MUT + MDT

(4.45)

As MUT + MDT is also the MTBF, this leads to: Aas =

MUT MDT and Uas = MTBF MTBF

(4.46)

MUT, MDT and MTBF can be estimated from field feedback (see Sects. 4.6.3 and 4.6.4). It results that Aas and Uas can also be estimated from field feedback (provided that the assumption of perfect restoration holds).

4.7.6 Failure Rate and Failure Intensities 4.7.6.1

Failure Rate

The failure rate is an important parameter when reliability calculations are performed. It can be defined first by considering a set of N similar non-repaired items evolving between OK and KO states. Some of the items fail as time elapses and n(t) is the number of the items alive at time t, then the reliability can be estimated as: ˆ = n(t) R(t) N

(4.47)

Let m(t) be the number of items failing during an interval [t, t + t]. It is realistic to consider that this number is proportional to the number of survivors, n(t), and also proportional to t when this duration is small: m(t) ∝ n(t) · t. This formula is not homogeneous from a dimension point of view and a third factor has to be introduced to make it homogeneous: m(t) = n(t) · λ¯ (t, t) · 

(4.48)

¯ t), is the average value over [t, t + t] of the failure This new coefficient, λ(t, rate related to the set of items. Its dimension is [Time]−1 . The probability for one of the n(t) item to fail during t is then: Pf (t, t) =

m(t) ¯ t) · t = λ(t, n(t)

(4.49)

Therefore λ¯ (t, t)·t is the probability for an item surviving until t to fail between t and t + t and:

4.7 Probabilistic Concepts

91

¯ t) = λ(t,

Pf (t, t) t

(4.50)

According to this formula, the failure rate appears to be a probability per unit of time. Finally, the failure rate is obtained when t goes to zero: lim Pf (t, t) Pf (t, t) t→0 = t→0 t dt

λ(t) = lim λ¯ (t, t) = lim t→0

(4.51)

Figure 4.29 shows how the failure rate is related to the item failure at time t: – it has survived until t, R(t); – it fails with the probability λ(t) · dt; – or it remains in the OK state with the probability 1 − λ(t) · dt. The failure rate is then related to the single failure of a non-repaired item and to the first failure of a repaired item. From a probabilistic calculation point of view, this introduces a strong condition for repaired systems: the component failures can be repaired only until the system is not failed yet and if they do not produce an overall (i.e. complete) system failure. This is the main difficulty for the reliability calculations of repaired items. The mathematical definition of the failure rate is obtained when t goes to zero:  λ(t) = lim

t→0

Pr(failure between t and t +  t|No failure over[0, t]) t



Then: λ(t) =

Pr(failure between t and t + dt|No failure over[0, t]) dt

(4.52)

Several definitions of the failure rate are available. They are more or less relevant and the following one, which is consistent with the development above, is retained for this book: Instantaneous failure rate, λ(t): conditional probability per unit of time that the item fails between t and t + dt, provided that it performs as required at 0 and that the failure has not occurred within the interval [0, t]. Fig. 4.29 Illustration of the failure rate

R(t) OK

dt) KO

t

dt

1st failure

92

4 Introduction of Basic Core Concepts

In other words, λ(t) · d t is the conditional probability to fail at t, provided that the item has had no failure over [0, t].

4.7.6.2

Links Between Failure Rate, Reliability and Failure Density

According to Fig. 4.29, the infinitesimal increment of the probability of failure (unreliability) is linked to the reliability, R(t), and the failure rate, λ(t), by the following formula: dF(t) = R(t) · λ(t) · dt

(4.53)

Then: f (t) =

dF(t) = R(t) · λ(t) dt

(4.54)

This is the opportunity to define the failure density in the same way as the failure rate has been defined: Then: f (t) =

Pr(1st failure between t and t + dt|No failure at 0) dt

(4.55)

Failure density, f (t): conditional probability per unit of time that the item fails for the first time between t and t + dt, provided that it performs as required at 0. Only the conditions change between the definitions of f (t) and λ(t). According to formula 4.59, the failure rate is directly linked to the failure density f (t). As R(t) = 1 − F(t) this implies: dR(t) = −dF(t) = −R(t) · λ(t) · t Then λ(t) = −

dR(t)/dt R(t)

(4.56)

And the failure rate is directly linked to the logarithmic derivative of the reliability function: λ(t) = −

d {ln[R(t)] dt

(4.57)

And finally: t

R(t) = e

− ∫ λ(τ )·d τ 0

(4.58)

4.7 Probabilistic Concepts

93

The above formula constitutes the basic relationship between reliability and failure rate: every reliability function can be expressed from a failure rate and vice versa. In fact, failure rate and reliability function cover the same knowledge about the related item failure.

4.7.6.3

Constant Failure Rate and MTTF

Formula 4.58 becomes very simple when the failure rate is not time-dependent. In this case, it is constant and the formula of reliability becomes: R(t) = e−λ·t

(4.59)

This leads to the following formula for the unreliability: F(t) = 1 − e−λ·t

(4.60)

This last formula is well known and widely used as, when F(t) 1, then it is approximated by λ · t which is very useful for quick calculations: F(t) 1 ⇒ F(t) ≈ λ.t

(4.61)

Formula 4.33 gives the relationship between R(t) and the MTTF: ∞

MTTF = ∫ e−λ.t .dt = − 0

e−λ.t ∞ 1 | = λ 0 λ

(4.62)

This highlights the direct relationship between the constant failure rate and the MTTF: λ=

1 MTTF

(4.63)

Therefore, the MTTF estimations presented in 4.6.3 can be directly used to estimate the constant failure rates from the field feedback. If is even the main interest for reliability engineers as MTTF is a poor reliability indicator. Then the operating failure rate can be obtained from the MTTF O , the non-operating failure rate from the MTTFnO , etc. Nevertheless, when the item is subject to different failures, this can be done only under the assumption that the various failures have no impact on each other. The estimation of the failure rates and of their confidence interval is developed in Chap. 38 about field feedback collection and processing. The link between failure rate and MTTF is a good opportunity to show that the MTTF has nothing to do with the life duration nor a period free of failure for the related item.

94

4 Introduction of Basic Core Concepts

λ(t) Early failure 10-1 h-1

Wear out

Useful life

10-2 h-1 10-3 h-1 10-4 h-1 10-5 h-1

Time 10 000 h

20 000 h

30 000 h

40 000 h

50 000 h

Fig. 4.30 Typical bathtub curve of the failure rate

Figure 4.30 shows the typical bathtub shape of the failure rate of an item in operation. Three different periods of time are identified: – Early failure (youth) period when the item is debugged of most of its weak points. It is generally a short period; – Useful life period which is the period of time where the item is actually used. In this period, thanks to the maintenance (corrective and preventive), it is maintained reasonably as good as new after restoration; – Wear out period where the maintenance is no longer able to maintain the item as good as new and where the wear increases and leads quickly the item to fail. Therefore, the failure rate decreases during the early life period, reaches a zone of stability where it is more or less constant (useful life) and increases in the wear out period. In Fig. 4.30, the failure rate is of 10−5 /h during the useful life and this corresponds to a MTTF of about 11.4 years. However, according to the same figure, the life duration is of about 50,000 h i.e. 5.7 years only. Therefore, the MTTF in the useful life period has nothing to do with the actual lifetime of the item under interest. The time life (TL) could be estimated by observing a set of n similar items for a time period larger than 50,000 h and waiting for the n failures of the n items. However, in this case the estimation 1/TL has nothing to do with a constant failure rate. In practical applications, the field feedback generally comes from repaired items which are expected to be as good as new after restoration because the maintenance compensates for the wear out. Therefore, the failure rate remains approximately constant during a long period (the useful life). When the maintenance is no longer able to compensate for the wear out of the item, it is normally replaced by a new one, otherwise it would fail quite soon. In the example, the item should be replaced after about 38,000 h (4.3 years), i.e. after less than the half of its MTTF in useful life.

4.7 Probabilistic Concepts

95

A similar example is obtained by considering a set of 40-year old people, whose failure rate is of about 1.25 × 10−3 year−1 . This is equivalent to a MTTF of about 800 years and about 10 times the life duration which is, in France, currently greater than 83 years. Unfortunately, beyond 40 years, the failure rate of human beings increases rather quickly, and this explains these results! Another observation can be done with regards to MTTF: the probability to observe one failure for a time period equal to the MTTF is equal to the unreliability F(MTTF) of the item considered: F(MTTF) = 1 − e−λ·MTTF = 1 − e = 0.63

(4.64)

It means that there is a probability of 63% (i.e. more than 1 chance over 2), that the item fails before the end of the MTTF period. This is far from negligible, and it clearly does not make sense to consider the MTTF as a period free of failure. Consequently, the MTTF represents a poor estimator of life duration, and also a poor estimator of a period free of failure: it should not be used for this purpose.

4.7.6.4

Failure Intensities. Link with MTBF

Figure 4.31 is similar to Fig. 4.29 except that it is related to a failure which is not the first one. In this figure λV (t) · dt is the probability that the item fails over the interval [t, t + dt] provided that it is performing as required at time t. More precisely, the parameter λV (t) is named conditional failure intensity and its mathematical definition is the following: λV (t) =

Pr(failure between t and t + dt|No failure at time 0 and t) dt

(4.65)

Instantaneous conditional failure intensity, λV (t): conditional probability per unit of time that the item fails between t and t + dt, provided that it performs as required at 0 and at t. The definition looks similar to the definition of the failure rate. The difference lays in the condition which is less binding: only performing as required at 0 and at t instead of over the whole interval [0, t]. Considering items performing as required at 0 allows to exclude non-repaired items already failed at this time. Fig. 4.31 Illustration of the conditional failure intensity

A(t) OK dt) KO t

dt

Failure

96

4 Introduction of Basic Core Concepts

Fig. 4.32 Illustration of transitions between t and t + dt

A(t)

dt)

OK

dt) U(t)

dt)

KO

t

dt

The conditional failure intensity λV (t) is also named Vesely failure rate in spite that this is not really a failure rate and that there is no link between λV (t) and R(t). This is a source of confusion with the genuine failure rate λ(t). Confusion which is maintained by the fact that λ(t) and λV (t) are numerically close, for example, when failures are quickly revealed and restored. Then λV (t) provides a good approximation of λ(t). As it is easier to calculate, it is used instead of λ(t). This is the basis for reliability calculations made with the Boolean models (Chap. 22). In the case analysed here, the repaired item can be either in the OK state or in the KO state at time t where a failure occurs. As illustrated in Fig. 4.32, the increment of the unavailability between t and t + dt is equal to the sum of two terms: – OK state; one failure occurs; – KO state; no restoration/repair occurs. This leads to: dU (t) = A(t) · λV (t) · dt + U (t) · [1 − μ(t) · dt]

(4.66)

Therefore, contrary to the reliability, R(t), which is linked to the failure rate, there is no direct relationship between the availability and the Vesely failure rate, λV (t), which, in spite of its name and as mentioned above, is not really a failure rate. This should not be forgotten by the analysts using it for reliability calculations: even if the results are generally good, they are only approximations. Another parameter can be defined from Fig. 4.31: the unconditional failure intensity, w(t): Unconditional failure intensity, w(t): probability per unit of time that the item fails between t and t + dt, provided that it performs as required at 0. This parameter is named “unconditional” as the condition is very weak and because this makes the difference with the conditional failure intensity. According to its definition, w(t) is linked to the conditional failure intensity by the following formula: w(t) = A(t) · λV (t)

(4.67)

4.7 Probabilistic Concepts

97

Mathematically speaking, from formula 4.67, w(t) can be defined as: w(t) =

Pr(failure between t and t + dt|No failure at time 0) dt

(4.68)

This formula is similar to the one given for f (t). The difference comes from the failure which is not necessarily the first one. (t)] . Let us consider the instantaneous item frequency: z(t) = lim E[N (t+t)−N t t →0

In this formula E[·] represents the expectation of the value between the brackets and N (t) represents the number of failures occurring over [0, t]. A physical item cannot fail several times within [t, t + t] when t → 0. Therefore, E[N (t + t) − N (t)] is equal to 1 if a new failure occurs during [t, t + t] and equal to 0 otherwise. Finally, when t → 0, E[N (t + t) − N (t)] is equal to the probability to have one failure within the interval [t, t + t] and z(t) and w(t) are equivalent. This is why the unconditional failure intensity is also named failure frequency. Therefore, the expected failure frequency of a repaired item can be obtained by the following integral: w(T ¯ )=

1 T ∫ w(t)dt T 0

(4.69)

Average failure frequency, w(T ¯ ): expected number of failures over the interval [0, t] divided by T. According to the above formula, the expected number of failures of a repaired item over an interval [0, t], Nbf (T ), is equal to w(T ¯ ) · T . This number can also be calculated by simply dividing T by the MTBF of the item. Then: T

Nbf (T ) = ∫ w(t)dt = w(T ¯ )·T = 0

1 T ⇒ w(T ¯ )= MTBF MTBF

(4.70)

Then w(t), which is not a failure distribution, is not in relationship with the MTTF but its average value w(T ¯ ) is directly linked to the MTBF. Then the average failure frequency can be obtained from the MTBFO , the non-operating failure frequency from the MTTFnO , etc. Again, when the item is subject to different failures, this can be done only under the assumption that the various failures have no impact on each other. It has also to be noted that the average failure frequency, w(T ¯ ), is sometimes used whereas the reliability, R(t), should be used. This is, for example, the case in the functional safety domain where it is (improperly) called PFH (probability of dangerous failure per hour) and used for high demand or continuous mode of operation in safety standards.

98

4 Introduction of Basic Core Concepts

Table 4.4 Difference between failure rate and failure frequency

Failure rate

Average failure frequency

Non-repaired item

Repaired item

Conditional probability per unit of time

Number of failures per unit of time

λ = 0.03 per year

w = 0.03 failure per year

3 failures for 100 similar items over 1 year

3 failures for a single item over 100 years

Beyond PFH calculations, the average failure frequency is useful when it is necessary to “count” the number of failures, Nbf (T ), occurring over a given interval of time (i.e. when dealing with spurious actions of safety systems). This is also the way to extend the probabilistic calculations for items having several failures over the period of interest: in this case w(T ¯ ).T gives the expected number of failures when the unreliability F(t) goes to 1 and then, becomes useless.

4.7.6.5

Comparison Failure Rate/Failure Intensity

In case of a renewal process, the availability and the Vesely failure rate converge rather quickly toward asymptotic values, Aas and λV . Therefore, the failure frequency and the average failure frequency converge also toward an asymptotic value w: w = Aas · λv

(4.71)

The Vesely failure rate, λV , provides a good approximation of λ when the failures of the item are quickly revealed and repaired and in this case the availability Aas ≈ 1 and, therefore, λ ≈ λV ≈ w. These parameters, which may have close numerical values and seem to have the same dimension of a number of failures per unit of time, are also often mixed up. The difference is explained in Table 4.4.

4.7.7 Restoration/Repair Rate The definition of the maintainability in IEV 192 is the following: Maintainability, M(t1 , t2 ): probability that a given maintenance action, performed under stated conditions and using specified procedures and resources, can be completed within the time interval (t 1 , t 2 ) given that the action started at t = 0 (IEV 192-07-01). In this definition, the nature of the maintenance action is not specified and it can be related to the repair of failures and more specifically to the active repair time

4.7 Probabilistic Concepts

99

(ART ). This is a restricted acceptation which is used hereafter because only the active repair time is specific to a given item, the other times being specific to the operation/maintenance philosophy of the plant (see Fig. 4.2). Figure 4.33 illustrates a repair which has started at time t = 0. When it is completed, the item moves from the KO state to the OK state. This is the counterpart, for repair, of Fig. 4.21 which is related to failure occurrence. According to the definition and with regards to repair, the maintainability can be written as: M (t) = Pr(ART ≤ t)

(4.72)

Therefore, the maintainability is the counterpart, for repair, of the unreliability, for failures. This implies that similar developments can be undertaken. This implies also that 1 − M (t) is the counterpart of the reliability R(t). Then, the mean active repair time is equal to: ∞

MART = ∫ [1 − M (t)] · dt

(4.73)

0

By similarity with Figs. 4.29, Fig. 4.34 illustrates the active repair rate which can be defined as follows: Fig. 4.33 Random process related to failure repair

KO

OK

Repair End of repair OK

Start of repair

KO T

1-M(t)

Fig. 4.34 Illustration of the active repair rate

OK

dt) KO

t

dt

Restoration

100 Fig. 4.35 Illustration of the conditional active repair intensity

4 Introduction of Basic Core Concepts

U(t) OK

Start of repair

dt)

End of repair

KO t

dt

Instantaneous active repair rate, μ(t): conditional probability per unit of time that the active repair is completed between t and t + dt, provided that it starts at 0 and had not been completed within the interval [0, t]. By similarity with Figs. 4.31, Fig. 4.35 illustrates the conditional active repair intensity, η(t), from which could be calculated the repair frequency in the same way as the failure frequency has been calculated from the conditional failure intensity. Of course, if other maintenance actions are considered, similar definitions can be introduced for restoration rate/intensity/frequency, active repair rate/intensity/frequency, etc.

4.8 Conclusion About the Reliability Concepts The above exercise is instructive as it shows that the definitions inside a corpus are not independent but are linked to each other through a more or less complicated network. Therefore, developing a consistent corpus of definitions is not an easy task as modifying one of them is likely to have side-effects on many others. This is why the “improvement” of a definition has to be done very cautiously and only after having identified the potential side-effects on all the other definitions belonging to the same network. This has been the aim pursued throughout this chapter.

References Boulanger JL (2013), Safety of computer architecture. Wiley-ISTE Cox DR (1962) Renewal theory. Methuen & Co. Ltd, London. Chapman & Hall, London (Reprinted) CCPS (2007) Guidelines for safe and reliable instrumented protective systems. Center for Chemical Process Safety, Wiley-Blackwell

References

101

Cui L, Lie H (2007) Analytical method for reliability and MTTF assessment of coherent systems with dependent components. In: Reliability engineering and system safety (RESS), vol 92. Elsevier, pp 300–307 Dhillon BS (2007) Applied reliability and quality—fundamentals, methods and procedures. Springer-Verlag, London, UK Elsayed EA (2012) Reliability engineering. 2nd edn. Wiley IEC 60050-192 (IEV192) (2015) International electrotechnical vocabulary—part 192: dependability. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC Electropedia (2020) http://www.electropedia.org/. Accessed Sept 2020 IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/ electronic/ programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland Kapur KC, Pecht M (2014) Reliability engineering. Wiley-Blackwell Kumamoto H, Henley EJ (1996) Probabilistic risk assessment and management for engineers and scientists. IEEE, USA, Piscataway, NJ Modarres M, Kaminskiy MP, Krivtsov V (2017) Reliability engineering and risk analysis—a practical guide, 3rd edn. CRC-Press Nakagawa T (2011) Maintenance theory of reliability. Springer, London Ltd Okaro IA, Tao L (2016) Reliability analysis and optimisation of subsea compression system facing operational covariate stresses. In: Reliability engineering and system safety (RESS), vol 156, pp 159–174. Elsevier Ossai CI (2019) Remaining useful life estimation for repairable multi-state components subjected to multiple maintenance actions. In: Reliability engineering and system safety (RESS), vol 182, pp 142–151. Elsevier Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering, Springer Rausand M (2014) Reliability of safety-critical systems: theory and applications. Wiley, Hoboken, New Jersey, USA Rausand M, Høyland A (2009) System reliability theory—models, statistical methods, and applications. Wiley-Interscience Selvik JT, Signoret JP (2020) Risk Management of complex systems: understanding the difference between systematic and systemic failures. In:Engineering assets and public infrastructures in the age of digitalization, LNME. Springer Nature, Switzerland, pp 128–136 Smith DJ (2017) Reliability, maintainability and risk—practical methods for engineers, 9th edn. Butterworth-Heinemann, Oxford, UK Tobias PA, Tindade DC (2012) Applied reliability, 3re edn. CRC Press Zio E (2007) An introduction to the basics of reliability and risk analysis. World Scientific Publishing Co. Pte. Ltd. Singapore

Chapter 5

Dependent and Common Cause Failures

5.1 Introduction to Dependent and Common Cause Failures 5.1.1 Identification of the Problem Achieving a high reliable system S can be done in two ways: 1. Selection of a high reliable single item I. 2. Use of redundant items Ii . Unfortunately, the first solution is often not tractable because such items do not exist or are highly expensive. Therefore, the second solution is generally implemented in industry. ¯ of a system S comprising 2 indeFor example, the probability of failure, Pr ( S), pendent redundant items I1 and I2 can be calculated from the probabilities of failure, Pr ( I¯1 ) and Pr ( I¯2 )., by the following formula: ¯ = Pr ( I¯1 ∩ I¯2 ) = Pr ( I¯1 ) × Pr ( I¯2 ) Pr ( S)

(5.1)

¯ T ) ≤ 10−6 over a given time interval [0, T ], applying If it is required that Pr ( S, formula (5.1) shows that the use of two identical redundant items having a failure probability Pr ( I¯1 , T ) = Pr ( I¯2 , T ) ≤ 10−3 would allow to reach the target. Formula 5.1 can be extended to any number of redundant items and it seems that reaching a high level of success (i.e. a low probability of failure) is just a matter of adding up a sufficient number of redundant items. Unfortunately, this is an illusion as the items are not necessarily fully independent and in this case formula 5.1 is no longer valid. For example, when I¯1 and I¯2 are not independent, the following formula 5.2 has to be used instead: © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_5

103

104

5 Dependent and Common Cause Failures

  Pr ( I¯1 ∩ I¯2 ) = Pr ( I¯1 | I¯2 ) × Pr I¯2

(5.2)

In this formula, Pr ( I¯1 | I¯2 ) denotes the conditional probability of I¯1 given that I¯2 has occurred. This implies: Pr ( I¯1 | I¯2 ) =

Pr ( I¯1 ∩ I¯2 ) Pr ( I¯2 )

(5.3)

When I¯1 and I¯2 are independent, formula 5.3 leads to Pr ( I¯1 | I¯2 ) = Pr ( I¯1 ). This result constitutes a criterion of independency between I¯1 and I¯2 . In another way, Pr ( I¯2 | I¯1 ) can be calculated by the following formula: Pr ( I¯2 | I¯1 ) =

Pr ( I¯1 ∩ I¯2 ) Pr ( I¯2 ) = Pr ( I¯1 | I¯2 ) × Pr ( I¯1 ) Pr ( I¯1 )

(5.4)

Then, when I¯1 and I¯2 are dependent, the dependency exists on both sides: the occurrence of I¯1 depends of the occurrence of I¯2 and vice versa. Therefore, dependencies described by conditional probabilities such as Pr ( I¯1 | I¯2 ) = Pr ( I¯1 ) or Pr ( I¯2 | I¯1 ) = Pr ( I¯2 ) do not imply any link of causality between the events I¯1 and I¯2 , but rather implies a positive or a negative mutual influence between them. These dependencies are the symptom of underlying (or root) causes affecting both items. When the impact is detrimental, these underlying causes are called common cause failures. However, an underlying cause is often considered to be a “true” common cause failure (CCF) only if the failures of two or more individual items impacted by this cause occur within a limited interval of time. Let us consider that I1 and I2 are two bulbs which fail to light. This can be due to ¯ which obviously is a common cause leading a loss of the electrical power supply, E, to I¯1 and I¯2 . Therefore, E¯ is a cause of I¯1 and a cause of I¯2 and this is because I¯1 and I¯2 share this common cause that they are not independent. Pr ( I¯1 | I¯2 ) being greater than Pr ( I¯1 ) most of the time (positive dependency), using formula 5.1 instead of formula 5.2 leads to non-conservative results. Of course, the more dependent the events, the most optimistic formula 5.1. This is the core of the dependent/common cause failures difficulties: as soon as the items are linked by any

5.1 Introduction to Dependent and Common Cause Failures

105

dependency, the system failure probability cannot be assessed just by multiplying the item failure probabilities: conditional probabilities have to be considered instead. As already described in Chap. 4 (see Fig. 4.6), the item failures being never completely independent, decreasing the probability of failure of a system just by increasing the redundancy is limited by existing dependencies due to common cause failures (CCFs). That means that the impact of CCFs increases when the probability of the system failure decreases. At the limit, a CCF negligible for an individual item can become the topmost contributor when redundancy is high. Then, the dependent failures and their underlying common causes often constitute the weak points of redundant systems. Therefore, it is important to take CCFs into account but, except in explicit (tangible) cases (e.g. the loss of electrical power supply analysed above), assessing conditional probabilities is usually difficult. This is why approximated models have been developed to take them into account to some extent and to provide safeguards against non-conservative results.

5.1.2 Definition As pointed out by Modarres Modarres et al. (2017), “there is no unique and universal definition for CCFs”. However, a fairly general definition of CCF is given by Mosley Mosley et al. (1988) as “a subset of dependent events in which two or more component fault states exist at the same time or in a short time interval, and are direct results of a shared cause”. In the absence of generally accepted definition of CCF, as already mentioned, many information media such as standards, guidelines and textbooks have proposed their own more or less equivalent definition often depending on their application domain and context (Rausand 2014). Among them are the theses cited in Chap. 4 in accordance with the IEV 192 standard IEC 60050-192 (2015) as follows: Common cause failures (CCFs): failures of multiple items, which would otherwise be considered independent of one another, resulting from a single cause (IEV 19203-18). Some other publications define the CCFs by the mechanism leading to them. For example, in Mosley et al. (1988): Common cause failures (CCFs): result of the coexistence of two main factors: • A susceptibility for items to fail because of a particular root cause. • A coupling mechanism (or factor). The root cause is the most basic cause of the failure that, if corrected, would prevent its occurrence (e.g. abnormal environmental stress, design inadequacy, human actions, procedure inadequacy).

106

5 Dependent and Common Cause Failures

The coupling factor characterizes components susceptible to the same causal mechanism which creates the condition for multiple items to fail (e.g. shared hardware design, shared maintenance/test schedule). In addition, it is expected that the failures of two or more individual items due to CCFs occur within a limited interval of time. It has to be noted that common cause failures can lead to common mode failures which are defined as follows in the IEV 192: Common mode failures (CMFs): failures of different items characterized by the same failure mode (IEV 192-03-19). Moreover, the distinction between common cause and common mode failure may be difficult and some sources do not hesitate to consider CMFs as a sub-category of CCFs.

5.1.3 Dependency Classifications Dependencies can be classified in many ways with regards to the type of failures. This is analysed hereafter where useful classifications are identified.

5.1.3.1

Intrinsic Versus Extrinsic Dependencies

The first classical dependency classification consists to split them as follows: • Dependencies intrinsic to the system. These dependencies find their roots in the design of the system. There are two classes: – Functional dependencies: the functional status of an item depends on the functional status of another one. Failures due to functional dependencies are often due to design errors. – Cascade failures or propagating failures:the failure of one item is the cause of the failure of another item which, in turn, can be the cause of failure of another item, etc. The cascading effect is often identified and taken into account by designers and operators. • Dependencies extrinsic to the system. Typical extrinsic dependencies are: – Loss of support systems (e.g. loss of hydraulic, pneumatic, electric power). – Abnormal working environmental conditions (e.g. operating conditions exceeding the design ones). – Aggression from environment (e.g. excessive heat, corrosive atmosphere, lightning) impacting items located in the same area. – Item inadequacy (e.g. off specifications due to poor manufacturing). – Human interactions (e.g. items maintained by the same maintenance team).

5.1 Introduction to Dependent and Common Cause Failures

107

This classification is useful to identify the origin of the dependencies and then of the common cause failures.

5.1.3.2

Logic Versus Dynamic and Lineage Dependencies

A second classification can be to split the dependencies between: • Logic dependencies: they belong to the Boolean framework and can be described by conditional probabilities as this has been done in 5.1.1. The impacted items immediately fail. • Dynamic dependencies: the effect of the underlying common cause is softer. It does not immediately trigger the failures of the impacted items but increases their probabilities of occurrence (e.g. the failure of the air conditioning increases the failure rate of electronic devices which, in turn, decreases the time to failure of the system). Dynamic dependencies can be intrinsic (e.g. functional dependency, soft cascade failure) or extrinsic (e.g. abnormal environment). When the effects begin to be perceptible in a short interval of time, they can be assimilated to “true” CCFs. Due to the non-immediate effect, this type of dependency is sometimes named semi-catastrophic. • “Lineage”1 dependencies: they are linked to common causes impacting in the same way the probabilistic parameters of all the related components (e.g. when they are got from a bad batch or a good batch of components). This classification is useful with regards to the available probabilistic models allowing to take them into consideration: the logic dependencies can be handled by Boolean family models (e.g. reliability block diagrams, fault trees, see Part 3 of the book), the dynamic dependencies by dynamic models (e.g. Markovian or Petri net approaches, see Part 4 of the book) and the lineage dependencies by using the uncertainty propagation techniques (see e.g. Chaps. 25, 32, 26 or 38). It has to be noted that the item failures coming from a CCF are expected to occur within a limited time interval (see 5.1.1 and 5.1.2). Therefore, the qualification of CCFs strongly depends of what is meant by “limited time interval”. This is in particular the case for dynamic and lineage dependencies. For example, if an external event multiplies by 100 the item failure rates, it is likely to appear as a common cause of failure but, if it multiplies the item failure rates only by 2, it is likely to remain unnoticed. In fact, there is a continuum between failures occurring at the same time and failures occurring over a time interval and the qualification of “true” CCF may be a bit subjective.

1 This

term is used in this book to qualify non-tangible dependencies impacting items of the same origin, design, manufacturing, provider. That is to say items sharing the same pedigree (same lineage) and consequently of the same quality (good, medium or bad).

108

5.1.3.3

5 Dependent and Common Cause Failures

Tangible Versus Non-tangible Dependencies

A third classification can be to split the dependencies between: • Tangible or explicit dependencies: they are the result of causes which can be clearly identified by performing in-depth system analyses (e.g. loss of power supply, pipe plugging, fire, flooding, etc.). • Non-tangible or non-explicit dependencies: they are the result of causes difficult to apprehend due to the absence of field feedback or ignorance of the phenomena (epistemic problems). This classification is useful with regards to the choice of the techniques used to take dependencies into account in the probabilistic modelling and calculations: • The tangible common cause failures are easy to identify and should be identified, analysed and processed as any other event in the safety and dependability models developed in this book. • The non-tangible common cause failures are difficult to identify or to qualify and generally constitute the residual failures whose causes cannot be explicitly modelled. However, they cannot be ignored without risking over-optimistic evaluations. Then broad approaches have been developed (e.g. beta-factor model, shock model) aiming to raise safeguards preventing to underestimate their impact. 5.1.3.4

Human and Software Related Dependencies

This chapter could not be complete without talking about the human factor which is often considered as the main source of common cause failures. The human interactions with the systems can be roughly classified as: • Errors in the design, construction and installations: they generally belong to the non-tangible dependencies mentioned above. • Errors in operation: they generally belong to tangible dependencies (e.g. shutting off a manual valve on the tapping of a sensor) and, as such, can be identified and quantified on their own. Nowadays, the components and systems become more and more smart. That means that more and more software is introduced and software errors now tend to supersede the human factor as source of common cause failures. Unfortunately, no really effective approaches have been developed yet to properly model the interactions between hardware, human factor and software. Then, the software “reliability” is generally evaluated separately by implementing specific techniques (see IEC 62628 (2012)) and often in a purely qualitative way (see IEC 61508 (2010)). This is why, within the reliability model of a system, the software is generally considered as a single isolated item constituting an extrinsic dependency. Without further analyses, it can even be considered as a non-tangible common cause of failure in many cases. Therefore, the software should be, at least, considered

5.1 Introduction to Dependent and Common Cause Failures

109

through the broad models (e.g. beta-factor model, shock model) by adjusting the parameters (e.g. by engineering judgment) of these models.

5.2 Examples of CCFs Observed in Real Life The impact of CCFs is not a purely theoretical view and they are involved in many situations which frequently occur. The last in date when writing this book is the emergence of the COVID-19 which in few days has locked almost all the activities of the countries over the planet and has killed several hundreds of thousands of people. This pandemic is the last case of calamity which all over the centuries, unfortunately, periodically impact a part of the planet: plague, leprosy, Ebola, influenza (Spanish, Asian, avian), HIV, SARS, etc. or starvation due to bad climatic conditions. Hereafter are briefly described some typical accidents where CCFs are involved and some CCFs detected from reliability data collection.

5.2.1 Examples of Typical Accidents Due to CCFs Opening newspapers is all you have to do to hear about the CCFs: they frequently inform about incidents/accidents where CCFs are involved. Hereafter are described some illustrative examples. Example 5.1 Apollo 13 lunar module was provided with 2 oxygen tanks (Jones 2016). On April 13th 1970, one oxygen tank exploded and the explosion destroyed the second oxygen tank, causing the failure of the mission. The 3-man crew managed to return safely to the Earth. The failure of the second tank is a cascade failure. Example 5.2 On March 3rd 1974, a McDonnell Douglas DC-10 airplane crashed into the Ermenonville forest outside Paris (France) (JO 1976). The cargo doors of the airplane were designed to open outwards under pressure but a specific latching system was provided to lock shut each of the cargo doors under pressure. After take-off, one of the latching systems failed and the rear left cargo door opened. A section of the cabin floor was ejected. However, even with such damages, the pilots could keep the airplane under control. Unfortunately, all the redundant control cables ran beneath the damaged floor and the pilots were then unable to maintain the airplane in the air. 346 people were killed. A proper zone analysis would have identified that the redundant control cables are located in the same area, thus a design change would have prevented the loss of control of the airplane.

110

5 Dependent and Common Cause Failures

Example 5.3 On March 28th 1979, an accident has occurred in unit 2 of the Three Mile Island nuclear power plant and the core partially melted down (Rogovin and Frampton 1979). This was the result of a combination of equipment malfunctions, designrelated problems and human errors. Among the causes, several pressuriser valves had been mispositioned after a periodic proof test and this constitutes a typical example of common cause failure due to human error. Although no injury had been reported, this accident is often mentioned as a textbook case of common cause failures due to human factor. A better training of the operators may decrease the probability of such an event. Example 5.4 The Viking Sky is a cruise ship launched in 2016 (Wikipedia Viking Sky 2019). Despite the storm warnings which had been issued, she was sailing from Tromsø to Stavanger (Norway) on March 23th 2019 with 1,373 people on board when she suffered from the failure of her four engines. A loss of lubrication oil pressure has been the common cause for the shutdown of the four engines. Due to rough conditions, the rescue was difficult (another common cause!). 470 passengers have been evacuated by helicopter before that three engines have been restarted in the night of March 24th and that the Viking Sky has been able to sail again. Sixteen people have been injured, three of them seriously. Avoiding common utilities (here the lubrication system) is a good way to prevent such common cause of failures. Example 5.5 The electric blackouts are a typical common cause of loss of electrical power supply for many people at the same time and at the level of towns, regions or even whole countries. The blackout is often itself the result of cascading failures from a single cause (e.g. high-voltage line break, electrical power plant shutdown, over consumption leading to protective circuit breaker opening). This arises rather often and see, for example, Wikipedia Blackouts (2020) or Wikipedia Outages (2020): • United States: 1965 (230 kV line failure, Northeast), 1977 (storm, New York city), 2003 (electric power station failure, Northeast, 55 millions consumers impacted), 2008 (storm, California), 2008 (electric station failure, Florida), 2019 (Manhattan). • France: 1978 (grid collapse), 1987 (winter cold), 1999 (storm). • Japan: 1987 (winter cold), 2011 (seism magnitude 9). • Canada: 1989 (solar wind), 1998 (freezing rain). • Europe: 2006 (grid collapse, 15 million consumers impacted). • Argentine, Paraguay, Uruguay: 2019 (grid collapse, 48 million consumers impacted). • India: 2012 (grid collapse, 670 million consumers impacted).

5.2 Examples of CCFs Observed in Real Life

111

Beyond the technical failures, the above list highlights the impact of meteorological or environmental (e.g. seism or solar winds) conditions as a source of common cause failures.

5.2.2 Examples of Typical CCFs Detected from Field Feedback The reliability data collection is also an opportunity to identify common cause failures as this is highlighted with the two examples presented hereafter. Example 5.6 A data collection and analysis of CCF data was performed in 2014-2015 on offshore platforms in the North Sea on several safety items (Hauge et al. 2015). Some of the CCF events recorded for Pressure Safety Valves (PSV) involved: • Pilot exhaust lines plugged. • Rust within the valves. That means that, when analysing the collected data related to similar items having failed within a short interval of time, it has been discovered that these failures were due to a true CCF rather than to independent failures. Example 5.7 The Organisation for Economic Co-Operation and Development (OECD) Nuclear Energy Agency (NEA) has set up the International Common Cause Data Exchange (ICDE) Project to collect and analyse CCF events on nuclear power plants. The main causes of CCF for batteries are (NEA/CSNI/R (2003)19 2003): • Battery design or manufacture inadequacy. • Maintenance-induced failures. • Internal malfunction. Once more, that means that when analysing the collected data related to similar items having failed within a short interval of time, it has been discovered that these failures were due to a true CCF rather than to independent failures.

5.3 Dependent Failures Identification When analysing a system, it is important to identify the potential CCFs as soon as possible to prevent them as far as possible. This is mainly important with regards to redundancy which is likely to be seriously impeded when CCFs are present. Then

112

5 Dependent and Common Cause Failures

a basic principle in dependent failure identification is to consider that dependent failures can occur as soon as items are on redundancy. Any of the analysis techniques described in this book offers, to some extent, opportunities to identify common cause failure candidates provided that they are oriented toward this purpose. Specific approaches have also been designed for CCF identification. Hereafter are described three simple approaches which can be used for this purpose: Identification during preliminary analyses All the inductive approaches described in Chaps. 7–12 can be used. For example, hazard identification methods such as checklists (Chap. 11), Preliminary Hazard Analysis (Chap. 8) or even FMEA (Chap. 10) prove to be effective for identifying potential common cause failures (e.g. external events or possible cascade failures). Minimal cut sets analysis Reliability block diagram (Chap. 15) or fault tree (Chap. 16) approaches can also be used through the thorough analysis of each minimal cut set (see Chap. 16) with an order equal to or greater than two. This provides an effective technique to help to identify which causes are good candidates to produce CCF (see Chap. 17). Zone analysis The zone analysis (Desroches et al. 2015) is a specific technique focused on the identification of failure modes and scenarios of accidents by considering the geographical arrangement of the system all over its mission. It is the unique method for determining what could happen on redundant items located close to each other, for example, in the same room or the same cabinet (e.g. fire, flooding, overheating, corrosion).

5.4 CCF Data Collection As for any other probabilistic calculations, no relevant result involving common cause failures could be obtained without sound estimations of the related probabilistic parameters. Like other probabilistic parameters, it is expected to obtain them from statistical analysis of field feedback collected through reliability data collection systems (see Chap. 38). However, CCF events are typical rare events (far scarcer than events involved for ordinary reliability data collection). Then it is beyond the capability of a single operator to collect CCF data alone: it is therefore essential to collect and combine the CCF field feedback from many operators from different countries to perform meaningful estimations. This implies to implement rigorous data collection frameworks to gather, exchange and combine CCF data. As presented hereafter, this is done, for example, in nuclear industry and oil and gas industry.

5.4 CCF Data Collection

113

Experience on CCF is scarce and most of it is coming from the nuclear industry: • The US Nuclear Regulatory Commission issues many documents: – explaining how to code, collect and elaborate CCF data (NUREG/CR 4780 1988; NUREG/CR-5485 1998; NUREG/CR 6268 Rev. 1 2007); – providing CCF parameters about valves, pumps, heat exchangers, etc. (NUREG/CR-5497 2016). • The Nuclear Energy Agency (BEA) from OECD manages an international joint project focused on CCF data exchange and also issues documents like NEA/CSNI/R(2003)19 (2003) about batteries, NEA/CSNI/R(2011)12 (2012) about CCF data exchange or NEA/CSNI/R(2019)4 (2019) about nuclear power plant modifications. In the oil and gas industry, works are also achieved to collect and elaborate CCF data. For example, Hauge et al. (2013) provides CCF data elaborated mainly from the OREDA database (OREDA 2020). This database shared between several operators implements the ISO 14224 (2016) standard which provides some guidance about CCF data and has been developed to collect and exchange reliability data in the oil and gas industry. The latest CCF data collection exercises related to safety instrumented systems (SINTEF A26922 2015) show that: • The limited time interval to consider in the definition of CCF is of 1 year (the most common interval between proof tests, see Chap. 36). • Due to the limited information on CCF frequency, only approaches with a low number of parameters are reasonably usable: the higher the number of parameters of the method, the higher the number of assumptions needed to implement the method. • There are few CCF events on the items belonging to the same system so the past data collection exercises have been extended to identical items belonging to different systems. As a consequence, additional assumptions have been introduced for calculating the parameters of the various methods. It has to be noted that, when dealing with safety systems, the unavailability is caused by dangerous failures and mainly by the dangerous undetected failures. Then, a proper identification and quantification of CCF is crucial for dangerous undetected failures. This is why the data collection exercises mentioned in reference SINTEF A26922 (2015) and NEA/CSNI/R(2003)19 (2003) are focused on such failures. This leads to modelling methods related to dangerous undetected failures. However, they are often considered to be also applicable to dangerous detected failures or spurious failures but this is questionable.

114

5 Dependent and Common Cause Failures

5.5 CCF Modelling 5.5.1 Introduction Let us consider a component I1 belonging to a set of two similar components {I1 , I2 }. Its failures can be split into independent failures I¯1I nd and failures due to logic (see 5.1.3) common causes I¯12 between I¯1 and I¯2 . This leads to: I¯1 = I¯1I nd ∪ I¯12

(5.5)

If the same component belongs to a set of three similar components {I1 , I2 , I3 }, its failures can be split in addition into the common causes I¯13 between I¯1 and I¯3 and also into the common causes I¯12,3 between I¯1 , I¯2 and I¯3 . This leads to: I¯1 = I¯1I nd ∪ I¯12 ∪ I¯13 ∪ I¯12,3

(5.6)

It has to be noted that in the above formulae I¯12 ≡ I¯21 , I¯13 ≡ I¯31 and I¯12,3 ≡ I¯21,3 ≡

I¯31,2 For a set of four components, this leads to:

I¯1 = I¯1I nd ∪ I¯12 ∪ I¯13 ∪ I¯14 ∪ I¯12,3 ∪ I¯12,4 ∪ I¯13,4 ∪ I¯12,3,4

(5.7)

And so on: even if some terms do not exist, their number increases very quickly when the size of the set of components increases and, except when tangible common causes are clearly identified, it is generally not really possible to take into account all the events involved in the above formulae. This is why simplified broad parametric models involving a limited number of terms have been developed to model the impacts of logic CCFs on system failures. They are the main models used for CCF analyses and they aim to: • provide a realistic prediction (i.e. conservative but not too much) of the probability of failure of systems comprising redundant components or, more generally, involving minimal cut sets of an order greater than one; • prevent to perform unrealistic non-conservative predictions in case of non-tangible CCF; • aid to determine the weak points involving CCFs and identify which defences against them are the most efficient. The broad CCF modelling methods can be classified according to the number of required parameters (single or multiple) and according to the impact on item failures (i.e. if they occur with a probability equal to one or not). The main broad models described hereafter comprise the beta-factor model (single parameter model) and the shock model (three parameters). Both models are devoted to logic CCFs and are recommended by IEC 61508-6 (2010).

5.5 CCF Modelling

115

5.5.2 The Beta-Factor Model This is the simplest parametric model. It is described in IEC 61508-6 (2010), ISO/TR 12489 (2013), Humphreys and Proc (1987) and SINTEF A26922 (2015). This approach considers that the failure rate λ of an item is the sum of an independent failure rate λ I nd and of a CCF failure rate λcc f : λ = λ I nd + λcc f

(5.8)

Then the beta factor, β, is defined as the ratio of the CCF failure rate to the (total) failure rate (i.e. the failure rate generally found in the reliability data handbooks OREDA (2015)): β=

λcc f λcc f = λ λ I nd + λcc f

(5.9)

With: 1 > β > 0 Then, from this definition: λcc f = β · λ and λ I nd = (1 − β) · λ

(5.10)

Values up to 10% can be considered for the beta factor (IEC 61508-6 2010, Table D1). This model is interesting because it can be easily modelled by Markov graphs (Chap. 31), Petri nets (Chap. 33) and Boolean models (Chap. 23). It has to be noted that the same definition is also used to split the probability to fail due to a demand γ such as: γ = γ I nd + γcc f

(5.11)

γcc f = β · γ and γ I nd = (1 − β) · γ

(5.12)

With:

This method being very straightforward is well known and widely used. It can be easily implemented in the various models (reliability block diagrams, fault trees, Markovian approach or Petri nets) described in the other parts of this book. Its main drawbacks are: • without further additional assumptions, it can be used only with items exhibiting the same failure rate; • it only models the failure of all the impacted items. Compared to the formula established in 5.3.1, this approach consists in keeping only the last terms:

116

• • • •

I¯1 = I¯1 = I¯1 = etc.

5 Dependent and Common Cause Failures

I¯1I nd ∪ I¯12 for two components; I¯1I nd ∪ I¯12,3 for three components; I¯1I nd ∪ I¯12,3,4 for four components;

Of course, the beta-factor model can be extended to more than one parameter in order to take other terms into account as this is done in various ways by the other modelling methods mentioned in 5.5.4.

5.5.3 The Shock Model The shock model, also named the binomial failure rate (BFR) model (Atwood 1986), is a three parameter parametric CCF model. • ω: occurrence rate for lethal shocks. • ρ: occurrence rate for non-lethal shocks. • γ : conditional probability of failure of each item, given a non-lethal shock. According to the definition of the above parameters, the common cause failure rate, λcc f , of a given item impacted by a shock (lethal or non-lethal) is given by: λcc f = ω + γ · ρ

(5.13)

And the total failure rate of this item can be written as: λ = λ I nd + ω + γ · ρ

(5.14)

Like for the beta-factor model, λcc f and λ I nd can be expressed as percentages of the item failure rate, λ: • λ I nd = (1 − β) · λ. • λcc f = β · λ For the shock model, λcc f can be split in turn with regards to the lethal and non-lethal shocks: • Lethal failure rate: ω = β LC · λ. • Non-lethal failure rate: λn Lc = γ · ρ = βn LC · λ. This implies: • βn LC = γ · ρ/λ

5.5 CCF Modelling

117

• β LC = ω/λ • β = β LC + βn LC And finally: λcc f = β · λ = (β LC + βn LC ) · λ

(5.15)

λ I nd = (1 − β) · λ = [1 − (β LC + βn LC )] · λ

(5.16)

Methods for assessing the parameters of the shock model are given in IEC 61508-6 (2010) (annex D), ISO/TR 12489 (2013) and Leroy (2018). When a collection of N similar items is affected by a lethal shock, all of the N items fail and this has a similar impact as in the beta-factor model described above. This is the same for the non-lethal shock and when γ = 0, γ = 1or ρ = 0, this approach is equivalent to the beta-factor model which takes only lethal shocks into account. As explained above for the beta factor, this can be easily modelled by Markov graphs (Chap. 31), Petri nets (Chap. 33) or Boolean models (Chap. 23). Concerning the non-lethal shock, the probability of failure between t and t + dt of one impacted item is given by γ · ρ · e−ρ.t dt: if this can be easily modelled by Markov graphs (Chap. 31) or Petri nets (Chap. 33), this is not directly possible by Boolean models (Chap. 23). Fortunately, integrating this formula over an interval  [0, t] gives the probability of failure of an impacted item as γ · 1 − e−ρ.t . This formula, which combines the probability that the non-lethal shock has occurred over [0, t] and the probability of failure of the impacted item, can be handled by Boolean models (see Chap. 23). Nevertheless, this is an approximated approach which should be used only when the CCFs cannot be clearly identified or when the related reliability data are not easily available (e.g. non-tangible CCFs). When a collection of N similar items is affected by a non-lethal shock, ρ, the probability that k among them fail is equal to C Nk · γ k (1 − γ ) N −k where C Nk is the number of combinations of k items among N. This binomial formula explains the name of binomial failure rate given to the approach. With regards to a given item this implies: λn Lc = γ · ρ ≡ γ · ρ ·

N −1 

C Nk · γ k (1 − γ ) N −k

(5.17)

k=0

 N −1 k Where k=0 C N · γ k (1 − γ ) N −k = 1 because encompassing all the possible cases of failure (from 0 to N − 1) of the N − 1 remaining items. This gives:

118

5 Dependent and Common Cause Failures

λn Lc = ρ ·

N −1 

C Nk · γ k+1 (1 − γ ) N −k = βn LC · λ

(5.18)

k=0

And then: βn LC =

ρ·

 N −1 k=0

C Nk · γ k+1 (1 − γ ) N −k λ

(5.19)

The above formulae can be simplified because, when dealing with a collection of numerous components affected by the same non-lethal shock, the probability of e.g. four component failures is negligible: this has never been observed for non-tangible CCFs and this is generally realistic for industrial systems. A conservative approach may be to consider that the quadruple failures due to a non-lethal shock are certainly at least lower than 10 times the double failures due to the same non-lethal shock. This implies C N2 · γ 2 (1 − γ ) N −2 > 10 · C N4 · γ 4 (1 − γ ) N −4 which leads to: γ < 1−γ When γ < 0.1 then

γ 1−γ



C N2 10 · C N4

≈ γ and choosing γ ≈

(5.20)

C N2 10·C N4

is certainly a conservative

assumption. This allows to “tune” the value of γ to make the impact of CCF vanishing when the number of impacted items, actually failing on a non-lethal shock, increases. Finally, for a set of N similar items with the same global failure rate, λ, and impacted by a non-lethal shock, the parameters can be estimated as follows: • • • • • • • •

Estimation of the global beta factor: β. Independent failure rate: λ I nd = (1 − β) · λ. Estimation of the lethal part of the beta factor:β LC . Lethal failure rate: λ Lc = β Lc · λ. Non-lethal part of the beta factor:βn LC = β − β LC . Non-lethal shock failure rate: λn Lc = βn Lc · λ. Lethal failure rate: ω = β LC · λ. Non-lethal failure rate: λn Lc = βn LC · λ.

• Conditional probability of failure on non-lethal shock: γ ≈ • Occurrence rate of non-lethal shock: ρ =



C N2 10·C N4

.

λn Lc . γ·

The above approach is effective when a great number of items are impacted by a non-lethal shock (e.g. a water hammer in hydraulic system): it allows to introduce the repartition between lethal and non-lethal shocks in order that the beta factor be kept constant and that, beyond the double or triple failures, the contribution of unrealistic multiple failures is neglected.

5.5 CCF Modelling

119

Then, the shock model is closer to the physical reality than the simple beta-factor model. The price to pay is the need for three parameters, β LC , ρ and γ instead of one. But this number is independent of the number of items affected in the system.

5.5.4 Other Modelling Methods From the seventies until now, several methods have been developed to extend the beta-factor model as, for example: • Basic parameter model (NUREG/CR-5485 1998): it involves all the conditional probabilities identified in formula 5.7. As they are normally not readily available, other models with less stringent requirements on data were developed. • Multiple Greek letter model (Fleming 1989; NUREG/CR-5497 2016): it is one of the models used by the US Nuclear Regulatory Commission for the assessment of the parameters of the CCFs. • Alpha-factor model (NUREG/CR-5485 1998; NUREG/CR 6268 2007; NUREG/CR-5497 2016): it is also one of the models used by the US Nuclear Regulatory Commission for the assessment of the parameters of the CCFs. • PDS method: developed for safety instrumented systems within the framework of the offshore petroleum industry (STF50 A0603 2006). The above approaches are designed to be implemented at the level of a whole group of impacted items and they model how, among m impacted items, 2, 3, 4 … items fail. They are interesting, on a case-by-case basis, to model the CCFs of such groups. However, they cannot be easily implemented into systemic models (e.g. reliability block diagrams, fault trees or Petri nets) for modelling large systems involving many other items. This is why they are not described in more detail in this book.

References ATWOOD CL (1986) The binomial failure rate common cause model, technometrics vol 28(2). American Statistical Association and American Society for Quality, USA Desroches A, Leroy A, Vallée F (2015) La gestion des risques, Principes et pratiques. LavoisierHermès, Paris, France, pp 41–67 Fleming KN (1989) Parametric models for common cause failure analysis in advanced seminar on common cause failure analysis in probabilistic safety assessment, Ispra Course. Ispra, Italy, pp 159–174 Hauge et al (2013) Reliability prediction method for safety instrumented systems—PDS method handbook, 2013th edn. SINTEF, Trondheim, Norway Hauge S et al (2015) Common cause failures in safety instrumented systems—beta-factor and equipment specific check-lists based on operational experience, SINTEF A26922. Trondheim, Norway

120

5 Dependent and Common Cause Failures

Humphreys R, Proc A (1987) Assigning a numerical value to the beta factor common-cause evaluation. Proceedings of the National Reliability Conference, UK, 1987 IEC 60050-192 (IEV192) (2015) International electrotechnical vocabulary—Part 192: Dependability. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61508-6 Ed. 2.0 (2010) Functional safety: safety of electrical/electronic/programmable/electronic safety related systems, Part 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3, edition 2.0, Geneva, Switzerland IEC 62628 (2012) Guidance on software aspects of dependability. International Electrotechnical Commission. Geneva, Switzerland ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland JO (1976) Rapport final de la Commission d’Enquête sur l’accident de l’avion D.C. 10 TC-JAV des Turkish Airlines survenu à Ermenonville le 3 mars 1974. (Année 1976. N°27. 12 mai 1976) Journal Officiel de la République Française Jones H (2016) Common cause failures and ultra reliability. https://ntrs.nasa.gov/archive/nasa/casi. ntrs.nasa.gov/. NASA Technical report server. Accessed September 2020 Leroy A (2018) Production availability and reliability. use in the oil and gas industry, 1st edn. Wiley-ISTE, London, UK Modarres M, Kaminskiy MP, Krivtsov V (2017) Reliability engineering and risk analysis—a practical guide, 3rd edn. CRC-Press Mosley et al. (1988) Procedure for treating common cause failures in safety and reliability studies, U.S. Nuclear Regulatory Commission, NUREG / CR-4780, vol I and II, Washington, DC, USA NEA/CSNI/R(2003)19 (2003) ICDE Project report: collection and analysis of common—cause failures of batteries, OECD Nuclear Energy Agency, Paris, France NEA/CSNI/R(2011)12 (2012) International common cause failure data exchange (ICDE). General coding Guidelines. Nuclear Energy Agency. Paris, France NEA/CSNI/R(2019)4 (2019) Collection and Analysis of Common-Cause Failures due to Nuclear Power Plant Modifications. Nuclear Energy Agency. Paris, France NUREG/CR-5485 (1998) Guidelines on modelling common-cause failures in probabilistic risk assessment, US Nuclear Regulatory Commission, Washington NUREG/CR-5497 2015 update (2016) Common-cause failure parameters estimations, 2015 Update, US Nuclear Regulatory Commission, Washington NUREG/CR 6268 Rev. 1 (2007) Common-cause failure database and analysis system: event data collection, classification, and coding US Nuclear Regulatory Commission, Washington OREDA Handbook (2015) Ed. 6.0 Offshore and Onshore reliability data. Prepared by SINTEF and NTNU. Hovik. Norway OREDA database (2020) https://www.oreda.com/database/. Accessed September 2020 Rausand M (2014) Reliability of safety-critical systems: theory and applications. Wiley. Hoboken, New Jersey, USA Rogovin M, Frampton GF (1979) Three mile Island: a report to the commissioners and to the public vol 1 to 3. NUREG /CR-1250. USNRC. USA Wikipedia Blackouts (2020) https://fr.wikipedia.org/wiki/Liste_de_pannes_de_courant_importa ntes. Accessed September 2020 Wikipedia Outages (2020) https://en.wikipedia.org/wiki/List_of_major_power_outages. Accessed September 2020 Wikipedia Viking sky (2019) https://en.wikipedia.org/wiki/MV_Viking_Sky. Accessed September 2020

Chapter 6

Extensions to Production Availability and Functional Safety Analyses

6.1 From Availability to Efficiency 6.1.1 Binary Items and Introduction of the Efficiency Concept Figure 6.1 represents a reliability block diagram (see Chap. 15) modelling a redundant system made of two similar components A and B. This could be, for example, the redundant sensors of a safety instrumented system (see Chap. 36). This system has four states: – – – –

E 1 : A and B are in up state. E 2 : A is in up state and B is in down state. E 3 : A is in down state and B is in up state. E 4 : both A and B are in down state.

These states are disjoint because the system can be only in one of them at the same time. This is a typical binary system where the states can be split into two classes: – up state class: [E 1 , E 2 , E 3 ] – down state class: [E 4 ] This splitting is the basis for defining the reliability, R(t), and instantaneous availability, A(t), functions described in Chap. 4. States E 1 to E 4 being disjoint, the availability of this system can then be written as: A(t) = Pr (E 1 , t) + Pr (E 2 , t) + Pr (E 3 , t).

(6.1)

A more general way for writing this formula is the following one:

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_6

121

122

6 Extensions to Production Availability …

Fig. 6.1 Binary state system

A

B

A(t) = 100% × [Pr (E 1 , t) + Pr (E 2 , t) + Pr (E 3 , t)] + 0% × Pr (E 4 , t)

(6.2)

In this formula, a coefficient has been introduced to indicate to which class a given state belongs with regards to the overall service expected from the system (i.e. the planned service). This coefficient is the efficiency of the state: State efficiency, εm : percentage of the planned service provided by state m of an item. This coefficient is equal to 100% for the states belonging to the up state class (i.e. they provide the planned service) and equal to 0% for the states belonging to the down state class (i.e. the service has dropped to 0). The advantage of introducing this parameter is to generalize the presentation of the availability function for an item comprising m different disjoint states: A(t) =



εm × Pr (E m , t)

(6.3)

m

This formula can be written as: A(t) =



εm × Am (t)

(6.4)

m

where Am (t) is the availability related to state E m . In Chap. 4 it has been demonstrated that: T ASTO K (T ) =

A(t)dt

(6.5)

0

where ASTO K (T ) is the accumulated sojourn time in the OK state over the interval [0, T ]. In the example above, the OK state is the up state class. Therefore, the accumulated time in the up state class can be written as:

6.1 From Availability to Efficiency

ASTO K (T ) =

123



T εm ×

m

Am (t).dt

(6.6)

0

T

Therefore, ASTm (T ) = ∫ Am (t) is the accumulated sojourn time in state E m and 0

the above formula can be simplified to: ASTO K (T ) =



εm × AST m (T )

(6.7)

m

Finally, this leads to write the average availability as: ¯ ) = AST O K (T ) = A(T T



m εm

× AST m (T ) T

(6.8)

Similarly, the average unavailability can be written as: ¯ )=1− U¯ (T ) = 1 − A(T

 m

εm × AST m (T ) T

(6.9)

6.1.2 Extension to Multistate Systems As said above, reliability, R(t), and availability, A(t), are related to binary items with only two state classes. If this is realistic for simple components, this is questionable for systems made of several components for which there is room for degraded states between the perfect state and the completely faulty state: when in degraded state, the system can provide services which should not be neglected. This is particularly the case for production systems as illustrated in Fig. 6.2. This is a simple pumping system comprising two pumps providing different flow rates. As in the previous subsection, four different states can be identified: – E 1 : A and B are in up state; Fig. 6.2 Multistate pumping system

72 m3/h

A 48 m3/h

B

124

6 Extensions to Production Availability …

– E 2 : A is in up state and B is in down state; – E 3 : A is in down state and B is in up state; – E 4 : both A and B are in down state. The difference with the previous example is that, now, the services provided by states E 2 , E 3 are different from each other and different from the service provided by E 1 : – 120 m3 /h for E 1 ; – 72 m3 /h for E 2 ; – 48 m3 /h for E 3 . The service provided by E 4 is still equal to 0%. This is a typical multistate system for which the common reliability analysis finds its limits because it is not able to give a relevant answer to the simple question: how to split these four states into two relevant classes, up and down? This splitting is indispensable to apply the conventional rules of probabilistic modelling and calculations and three attempts are possible: – gathering [E 1 , E 2 , E 3 ] in the up class because some production is provided and [E 4 ] in the down class; – gathering [E 1 , E 2 ] in the up class because a production greater than 50% is provided and [E 3 , E 4 ] in the down class; – gathering [E 1 ] in the up class and [E 2 , E 3 , E 4 ] in the down class because the production is partly lost. The first attempt is too optimistic as it assimilates degraded states to perfect states, the second one is arbitrary and the third one is too pessimistic because it assimilates the degraded states to completely failed states. Therefore, the classification in up and down state classes is not relevant and has to be abandoned. The solution to overcome this difficulty is to consider the four states separately. This can be done by extending the formulae established for the availability in Sect. 6.1. The state efficiency being the percentage of the service provided by a given state, this leads to: • ε1 =

120 m 3 / h = 100%, 120 m 3 / h

• ε2 =

72 m 3 / h = 60%, 120 m 3 / h

• ε3 =

48 m 3 / h = 40% 120 m 3 / h

• ε4 =

0 m3/ h = 0%. 120 m 3 / h

6.1 From Availability to Efficiency

125

If ρ is the planned production rate of the production system (e.g. 120 m3 /h), ρ.εm .Pr (E m , T ) is the part of the production provided by state m at time t. Then the instantaneous production rate, Pdr (t), is given by: Pdr (t) = ρ.



εm .Pr (E m , t)

(6.10)

m

This formula can be written: Pdr (t) = ρ.ε S (t) where ε S (t) is the instantaneous system efficiency at time t. In the case of a production system, ε S (t) is called productivity of the system. This gives: Pdy(t) = εS (t) =

Pdr (t)  = εm .Pr (E m , t) ρ m

(6.11)

This formula is similar to formula 6.3 established for the instantaneous availεm × Pr (E m , t). Therefore, the productivity and the availability ability: A(t) = m

are two particular cases of a more general concept which can be called the efficiency of the item under consideration: Instantaneous efficiency, εS (t): percentage of the planned service provided at a given instant t. This general definition can be derived into two particular cases: Instantaneous availability, A(t): instantaneous efficiency related to an availability model1 (i.e. percentage of time spent in up state at a given instant t). Instantaneous production availability, Pdy(t): instantaneous efficiency related to a production system, i.e. percentage of the planned production provided at a given instant t. If ρ is the production rate (in m3 /h) of the production system, then ρ.εm .ASTm (T ) is the part of the production provided by state E m (see formula 6.5). Then the accumulated production, A Pd(T ), over the interval [0, T ] is given by: A Pd(T ) = ρ.



εm .AST m (T )

(6.12)

m

Formula 6.12 allows also to calculate an equivalent production time, Teq (T ), at 100% of capacity over the time interval [0, T ]:A Pd(T ) = ρ.Teq (T ). This leads to: Teq (T ) =

A Pd(T )  = εm .AST m (T ) ρ m

(6.13)

1 An availability model can be, for example, an availability Markov graph as described in Chap. 31.

126

6 Extensions to Production Availability …

Equivalent production time over [0, T], Teq (T ): production time at 100% of capacity giving the same production over [0, T] as the production given by all the states with a production efficiency not equal to 0. The maximum expected production over [0, T ] being ρ.T , the average productivity over [0, T ] can be calculated as: A Pd(T ) = Pdy(T ) = ρ.T

 m

εm .AST m (T ) Teq (T ) = T T

(6.14)

The above formula can also be calculated from formula 6.11: Pdy(T ) =

1 T ∫ A Pd(t)dt T 0

(6.15)

This formula is similar to this obtained for the average availability in formula 6.8. ¯ ) are two Again, the average productivity Pdy(T ) and the average availability A(T faces of a more general concept which can be called the average efficiency ε S (T ) of the item under consideration: Average item efficiency, ε S (T ): ratio of accumulated service to the accumulated planned service over a specified period of time. This makes the link with the estimation of the availability given in Chap. 4 (formula 4.36) and to propose an equivalent definition for the average availability: ¯ ): average efficiency related to an availability model, i.e. Average availability, A(T ratio of accumulated up time during a specified period of time to this specified period of time. This also makes the link with the average production availability as defined in ISO 20815 (2018): (Average) production availability or average productivity, Pdy(T ): ratio of production to planned production, or any other reference level, over a specified period of time (ISO 20815 (2018), 3.1.46).

6.1.3 Generalization of the Efficiency Concept In the previous example of the production system, only the incomes due to the production have been taken into consideration but, of course, when a component is failed, it has to be repaired and this costs money. Let us note α the income (e.g. in e per m3 of production) and β the cost of the maintenance (e.g. in e per hour of maintenance). The revenue ηm per hour related to each state is the difference between the production income minus the cost of the maintenance and this leads to: – E 1 : η1 = 120.α – E 2 : η2 = 72 · α − β.

6.1 From Availability to Efficiency

127

– E 3 : η3 = 48 · α − β. – E 4 : η4 = 0 − 2 · β. For example, the income per hour is of 72 m 3 × α e/m3 for state E 2 and the loss is equal to β· e per hour of repair for the repair of one pump. To obtain the actual efficiencies of the states, the above values have to be divided by η1 : – – – –

E 1 : ε1 E 2 : ε2 E 3 : ε3 E 4 : ε4

= 120 · α/120 · α = 100%. = (72 · α − β)/(120 · α) = ε2 − β/η1 . = (48 · α − β)/(120 · α) = ε3 − β/η1 . = (0 · α − 2 · β)/(120 · α) = ε4 − 2β/η1 .

This leads to the general formula where εm is the efficiency without losses, βm the costs linked to state E m and η1 the maximum possible income in the perfect state. εm = εm − βm /η1

(6.16)

Contrary to the previous example, the efficiency can now be positive or negative but it can be used exactly as explained in the previous subsection. This simple efficiency concept can be implemented each time that the service provided by a state is proportional to the time spent in this state. It allows to handle availability and productivity in the same mathematical framework and it is particularly well adapted to the Markovian approach (see Chap. 31). It is also well adapted to the Petri net approach (see Chap. 33) which, furthermore, allows to implement more sophisticated models taking into account other parameters as, for example, non-constant state efficiencies, cost of spare parts, travel costs of the maintenance team, mobilisation of maintenance tools, penalties due to contractrelated production lacks, problems due to too long stop (e.g. hydrate formation in subsea pipelines), etc. This is useful each time the conventional reliability/availability framework is not sufficient to cover the needs of a given study.

6.2 From Conventional Safety to Functional Safety 6.2.1 Generalities About Protection Layers and Safety Systems Figure 6.3 illustrates the typical organization of the various layers of protection allowing to operate an industrial process in safe conditions: • The first protection layer is the basic process command-control system (BPCS) which continuously monitors the operating parameters and maintain the process in a nominal and safe zone of operation. The BPCS is then the first safety system in itself.

128

6 Extensions to Production Availability … Emergency plan Mitigation of consequences Prevention

Protection layers

Safety systems

Command-control (BPCS)

Process

Protected system

Fig. 6.3 Organization of a process with its layers of protection

• The second one takes over when the first layer fails. It aims to prevent the occurrence of incident or accident and it comprises mechanical safety devices (e.g. relief valves), generate alarms in order to trigger safety actions from operators and also comprises safety systems expected to perform required specific safety actions (e.g. to prevent overpressure in tanks). • The third one takes over when the second one fails. It aims to mitigate the consequences of the incidents or accidents. Again, it comprises mechanical protections (e.g. merlons to limit the impact of explosions) and safety systems (e.g. to extinguish a fire). • The last one takes over when all the protection layers have failed and an accident with major consequences has occurred. It involves the intervention of public authorities (fire fighter mobilisation, evacuation of the surrounding of the site, etc.) and this is outside the scope of this book. Safety systems are found in several places in the above organization and the following definition can be given: Safety system: system provided to ensure safe operation of a process or to limit the consequences of anticipated incidents or accidents. Safety systems have only two states: – they are able to perform the safety action as required (up state); – they are not able to perform the safety action as required (down state). Contrary to production systems analysed in Sect. 6.1, there is no room in-between up and down state for degraded states: Safety systems are typical binary items.

6.2 From Conventional Safety to Functional Safety Fig. 6.4 Example of a conventional overpressure protection system

129

Flare

Relief valve

Flow in

Flow out Tank

Therefore, the Boolean approaches (Part 3) are very effective to model them and perform probabilistic calculations. For a long time, the safety systems have been made of mechanical devices (e.g. relief valves, Watt regulator) based only on physical phenomena. The example of such an overpressure protection system is illustrated in Fig. 6.4: when the pressure reaches a pre-set pressure level, the relief valve opens and the pressure decreases. In this figure, the protected system (the tank) is in grey and the protection system (the relief valve) in black. If conventional safety systems are still in use, they tend to be replaced more and more often by the so-called safety related systems (see IEC 61508 (2010) or safety instrumented systems (see IEC 61511 (2016)) which are introduced in Sect. 6.2.3.

6.2.2 Classification of Safety Systems and Impact of Faults According to Fig. 6.3, the safety systems can be split into two categories: 1. the safety systems belonging to the BPCS (which is actually the first protection layer) which operate on a continuous basis (continuous mode of operation); 2. the safety systems belonging to the other protection layers which operate only when a demand occurs due to a failure of a previous layer of protection (demand mode of operation). This distinction is important with regards to the failure detection: – when a complete failure occurs in a safety system operating in continuous mode, the fault reveals itself by the immediate loss of the safety function; – when a complete failure occurs in a safety system operating in demand mode, the fault may or may not reveal itself but the loss of the safety function will have an impact only if the fault is not repaired before an actual demand occurs.

130

6 Extensions to Production Availability …

This makes a big difference between the two types of operating modes: in continuous mode of operation, a complete failure may lead immediately to a problem (demand on another safety system or hazardous event) whereas, in demand mode of operation, there are some chances to detect the complete failure before a problem occurs and before performing maintenance actions in due time. More generally, the distinction between revealed and not-revealed failure (hidden) failure (see Sect. 4.4.3.3) is of utmost importance. Within the context of safety systems, the failures can be classified with regards to their capacity to be detected: 1. self-revealed failures; 2. hidden failures revealed by diagnostic tests performed in loop and at relatively high frequency (every second, minute, …); 3. hidden failures revealed by periodic tests performed at low frequency (every month, year, 5 years, …); 4. hidden failures not detectable unless an actual demand occurs. When a failure occurs in a safety system, the probability of not satisfying a demand for a safety action increases and, therefore, the exposure of the protected process to danger also increases. This is illustrated in Fig. 6.5, for the four types of failures identified above, where a scale of grey has been used to visualize the degree of danger. The danger is maximum as long as the fault remains hidden (dark grey) and decreases when the fault has been detected (light grey) because it is possible to undertake compensating measures to mitigate the effect of this fault until it is repaired. The following definition is a generalization of the definition of compensating measure provided in IEC 61511 (2016): Compensating measure: temporary implementation of planned and documented methods for managing risks during any period of maintenance or process operation when it is known that the performance of a safety system is degraded. Various compensating measures can be undertaken: strengthen the operator surveillance (e.g. to be ready to manually trigger the process shutdown), starting a back-up system, automatically shut down the process, etc. One compensating Self-revealed failure

Repair / restoration time

OK

Online diagnostic OK

RT

RT

KO

KO

Danger exposure Fault detection time OK KO

FDT

Test RT

Non-detectable failure OK

FDT

KO Danger exposure

Fig. 6.5 Danger exposure of the protected system according to the fault detection time of a safety system

6.2 From Conventional Safety to Functional Safety Fig. 6.6 Danger exposure when compensating measures are undertaken

OK

131

Compensating measure

FDT

KO Danger exposure

measure often undertaken is to shut down the protected system immediately because the danger disappears and the safety system becomes useless. Anyway, when the compensating measures have been undertaken, the risk is generally considered to be very low, as illustrated in Fig. 6.6. Of course, the faster the compensating measures are undertaken and the more effective they are to prevent hazardous events. Nevertheless, it has to be noted that if a compensating measure (e.g. the process shutdown) is safe with regards to a given fault, this is not necessarily the case with regards to another one and this should be considered by the reliability analysts dealing with safety systems. Another important parameter to take into account with regards to the success of a safety action is that it is completed in due time. This is the concept of process safety time. A definition derived from this given in IEC 61508 (2010) is provided hereafter: Process safety time, PST: period of time between a failure of the process that has the potential to give rise to a hazardous event and the time by which the safety action has to be completed to prevent the hazardous event occurring. With regards to the examples given in Figs. 6.4 and 6.9, the full opening of the relief valve or the full closure of the SDV have to be completed within the PST. This introduces a constraint with regards to the safety action success: if it is not completed within the PST, then a hazardous event may occur. Nevertheless, this constraint depends on the dynamic of the protected system which can be slow (several minutes or hours are available) or fast (only a few seconds are available). Figure 6.7 illustrates what happens on the protected system when it generates a demand for a safety action. On the left-hand side, the safety action is completed within the process safety time and the protected system reaches a safe state. On the PST

PST Dangerous

Dangerous

Safe

Safe Demand

Completion of the safety action

Fig. 6.7 Illustration of the process safety time

Demand

Hazardous event

Completion of the safety action

132 Fig. 6.8 Available time to undertake compensating measures

6 Extensions to Production Availability … Demand Failure detection OK

FDT

θ

PST TTS

KO Available time for compensating measure

right-hand side, the safety action is not completed within the process safety time and a hazardous event occurs (or a demand on another safety system is produced). If TTS is the time needed to reach a safe state when undertaking a compensating measure and θ the time to the next demand, θ +PST- TTS is the time available (see Fig. 6.8) to start the compensating measure in order that the safe state is reached before a hazardous event occurs: – for a safety system operating in demand mode, this generally leaves enough time to react because θ is normally large; – for a safety system operating in continuous mode, the failure of the safety system can itself produce a demand for a safety action. Therefore, θ = 0, the time to react is very short and the compensating measure can succeed only if it is triggered automatically. Another important distinction between the failures related to safety systems is their impact with regards to the safety action (see Sect. 4.4.3.6): – unsafe failure: detrimental impact on the safety action; – safe failure: beneficial impact on the safety action. It has to be remarked that a failure is not intrinsically safe or unsafe. This characteristic is even not really linked to the safety system itself: it is linked to the impact on the protected system. Therefore, the nature safe versus unsafe of a failure is a systemic characteristic related to the overall system comprising both the protection system (the safety system) and the protected system (the process itself).

6.2.3 Safety Instrumented Systems Figure 6.9 illustrates a simple safety instrumented system (SIS) implementing the same safety function as this implemented by the conventional safety system presented in Fig. 6.4. Again, the protected system is presented in grey and the protection system in black. In this safety system, the single relief valve is replaced by: – an instrument (sensor) which measures the parameter of interest: the pressure transmitter measuring the pressure inside the tank;

6.2 From Conventional Safety to Functional Safety Fig. 6.9 Example of a simple instrumented overpressure protection system

133

PT Logic solver

Pressure transmitter

LS Flow out

Flow in

Tank SDV

– a decision mean (logic solver) which takes the decision to close the valve according to the set point and the information received from the sensor; – a final element which actually performs the safety action: the closure of the valve in order to stop the pressure increase. The trio sensor + logic solver + final element is typical of an instrumented system as defined in IEC 61511: Instrumented system: system composed of sensors (e.g. pressure, flow, temperature transmitters), logic solvers (e.g. programmable controllers, distributed control systems, discrete controllers), and final elements (e.g. control valves, motor control circuits). (IEC 61511 2016), definition 3.2.36.1). This leads to the following definition of an SIS: Safety instrumented system, SIS: instrumented system devoted to safety. The principle is very different between conventional and safety instrumented system: – the functioning of a conventional safety system is based on physical properties; – the functioning of an SIS is based on measure and software calculations. The reliability of the conventional system can be obtained directly from the field feedback (e.g. data collection of relief valve failures) whereas it has to be proved by detailed reliability analyses for SISs. Therefore, it is far more difficult to make the proof of the reliability of an SIS than for a conventional safety system. It has to be noted that the use of sensors and logic solvers allows to automatically detect a big part of the failures and to automatically trigger the compensating measures introduced in the previous chapter. The tests performed to reveal the failures of type 2 described above are called “diagnostic tests”. As they are quickly and automatically detected, compensatory measures can be quickly undertaken. This is the assumption done in functional safety standards where these failures are gathered with the self-revealed failures and considered to be safe. This may be questionable and will be discussed in Chap. 36.

134

6 Extensions to Production Availability …

Due to the difficulties to prove that the probability of failure is acceptable, the design and the analysis of safety instrumented systems raises very specific problems. This is why specific standards have been developed to help to design safety systems. The mother standard is IEC 61508 (2010). It is very general and various sectoral standards have been developed on its bases: for example, IEC 61511 (2016) is specifically devoted to process, IEC 61513 (2011) to the nuclear sector, ISO 26262 (2011) to automobile, ISO/TR 12489 (2013) to probabilistic calculations, etc. All these standards belong to the functional safety standards. Functional safety is defined as follows in IEC 61511 (2016): Functional safety: part of the overall safety relating to the process and the BPCS which depends on the correct functioning of the SIS and other protection layers. The safety integrity is the core concept used in the functional safety standards. It is defined in the mother standard IEC 61508 (2010) as the probability of […] satisfactorily performing the specified safety functions under all the stated conditions within a stated period of time. Therefore, the safety integrity is defined as the reliability, R(t), of the safety system. This is very surprising as this is far from catching the very meaning of this concept as it is used in the functional safety standards. Therefore, this definition should be discarded and replaced by this given in IEC 61511 (2016) which is equivalent to the following one: Safety integrity: dependability of a safety instrumented system. A lot of specific concepts are introduced in the functional safety standards: safety instrumented functions (SIF), safety integrity requirements, safety integrity levels (SIL), safe and dangerous failures, detected and undetected failures, etc. They are introduced and discussed in detail in Chap. 36.

6.3 Overview of Probabilistic Models To finish the introductory part of this book, the various probabilistic models described in detail in the other parts are illustrated in Fig. 6.10 which is adapted from Fig. 3.2 developed in Chap. 3. • At the beginning, when computers were not available, calculations were undertaken with simplified specific formulae. Nowadays, the use of generic computer codes implementing rigorous models is preferred but nevertheless the use of simplified formulae remains popular for safety instrumented system related calculations (see Chap. 36). • The Boolean approaches family has been developed since the sixties. It provides a whole family of approaches: reliability block diagrams (RBDs), fault trees (FTs), event trees (ETs), etc.). It is widely used to model the logic relationships between the states of an analysed system. It allows to model binary systems made of binary components (i.e. with only two states) and for this reason proves to be very effective for modelling safety systems. It is also very popular as it can be used

6.3 Overview of Probabilistic Models

135 Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Static models

Boolean approaches RBD FT ET

Specific formulae

Dynamic models

Markovian approaches

Monte Carlo simulation

Behavioural approaches

Markov graphs Petri nets State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 6.10 Overview of probabilistic models

from both qualitative, semi-quantitative and purely quantitative points of view. The Boolean family approaches are described in Part 3 of this book. • TheMarkovian approach is the elder technique. It has been developed well before its use in the reliability engineering field. It allows to model dynamic relationships which are beyond the capacity of Boolean approaches. Unfortunately, it involves a number of states which quickly increases with the number of components comprised within the modelled system. Therefore, it proves to be very effective to model small complex systems. It is not limited to binary items and can then be implemented to model production systems with several production levels. The Markovian approach is described in Chap. 31. • Boolean and Markovian approaches can be combined to model systems made of dynamic (e.g. repaired) components. In this case, the functioning/dysfunctioning logic is provided by a Boolean model and the probabilistic inputs (availability or unavailabilities) by Markovian models. This is in fact the basis for the development of simplified formulae and, more interesting, this leads to the RBD or FT-driven Markov processes which have proven to be very effective to handle safety systems. In this case, small Markov models are used to calculate the block availabilities of RBDs or the primary event unavailabilities of FTs. • The behavioural models have been developed to model the behaviour of dynamic systems. The Markov models mentioned above belong to them but they are the only ones which can be processed in an analytical way (e.g. by calculating formulae). For the other behavioural models, it is necessary to swap to the Monte Carlo simulation which is described in Chap. 32. The Petri nets, which have proven to be very flexible, easy to use and well adapted to support Monte Carlo simulation, have been selected among the behavioural models. Like the Markovian models, they are not limited to binary items and prove to be very effective to model production systems. Petri nets are described in Chap. 33.

136

6 Extensions to Production Availability …

• Boolean and Petri net models can also be combined. Again, the Boolean model provides the functioning/dysfunctioning logic and the Petri nets the dynamic behaviour of the components comprised within the analysed system. This provides an easy way to implement dynamic RBDs or dynamic FTs (see Chap. 27). As a particular case of dynamic RBDs, the RBD-driven Petri nets prove to be very effective for handling safety systems. Using flow diagrams instead of RBDs allows to model production systems (see Chap. 35). The set of the above-mentioned models allows to handle most of the industrial problems encountered when dealing with safety and production systems. Except for the simplified formulae, they are implemented in the GRIF-workshop software package, the free demo version GRIF (2020) of which is used to perform all the probabilistic calculations and to draw the curves illustrating this book.

References GRIF-workshop (2020) Funded and developed by TOTAL, http://grif-workshop.fr/. Accessed September 2020 IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/ electronic/ programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61511 Ed. 2.0 (2016) Functional safety. Safety instrumented systems for the process safety sector (3 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61513 (2011) Nuclear power plants—Instrumentation and control important to safety—General requirements for systems. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland ISO 20815 Ed. 2.0 (2018) Petroleum, petrochemical and natural gas industries. Production assurance and reliability management. International organization for standardization (ISO), Geneva, Switzerland ISO 26262 (2011) Road vehicles—Functional safety. International organization for standardization, Geneva, Switzerland

Part II

Risk Identification and Qualitative Analyses

Chapter 7

The Inductive Approaches

7.1 Need of the Inductive Approach As explained in Sect. 3.4.2, the principle of the inductive (cause to effects, bottomup) approach is to consider potential causes at subsystem/item level and to look for the effects of these causes at the overall system level. As such, the use of these methods is the only way to gain a thorough understanding of the elementary failure modes of the system. Several inductive methods have been designed over the years for reaching that purpose. They can be ranked in two categories: • Inductive methods designed for performing hazard identification only. A hazard is a potential source of harm and harm is a physical injury or damage to persons, property and environment (IEC 60050-903 2013). The word “potential” should be stressed as nothing unwanted can occur as long as the hazard remains potential. • Inductive methods designed for performing hazard identification and analysis of the consequences of the failure mode of each item on the system. All but one inductive method fall into the first category. Only the failure mode and effects analysis (FMEA) belongs to the second category, knowing that it is not often used as a hazard identification method.

7.2 Objectives of Inductive Methods The efficiency of inductive methods does not rely on some underlying mathematical models, such as Boolean formulae (fault trees, see Chap. 16, reliability block diagrams, see Chap. 15) or ordinary differential equations (Markov graphs, see Chap. 31), but on their systematic and rigorous implementation. The advanced methods mentioned before between brackets are also systematic and rigorous, but they make use of the outputs of the inductive methods as inputs. So, these advanced © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_7

139

140

7 The Inductive Approaches

methods could provide erroneous results if the inductive methods were not previously properly implemented. The objectives of all hazard identification methods are to: • Identify all the hazards that are relevant during the intended use and possible misuse of the system, and during the interactions with the outside. The purpose of this identification is not only to put a name on the hazard but rather to describe its characteristics. • Describe the chain of events leading from the hazard to an event considered as final and unwanted. • Rank the severity of the final event. • Propose measures to eliminate or mitigate this final event. As its name suggests, the objectives of the FMEA are the following: • To identify the effects on the system of the failure modes of each item belonging to the system. • To identify the detection means of each failure mode for being able to quickly undertake compensatory/mitigating measures. • To rank the failure modes according to their severity. • To identify the most critical failure modes for the FMECA. It is to be stressed that, before analysing how a system fails, it is of utmost importance to understand how it works. Beyond their efficiency to analyse the impact of individual component failures, all these approaches provide powerful means to increase in depth the analyst’s knowledge about how a system works and fails. This is invaluable with regards to further analysis of industrial systems needing to take combinations of failures into consideration. Implementing such methods is time-consuming and expensive in human resources. So, although it is not an objective of any of these methods, it would be regrettable to miss the opportunity to carefully analyse each event of the chain of events in order to identify the existing means for detecting them as soon as possible and to propose additional means whenever necessary.

7.3 Overview of the Main Inductive Methods Only the main inductive methods used by the authors of this book are described in this part. Three other inductive methods are presented in Chap. 11: the What-if? method, the checklist method and the HAZID method, but many other methods exist, see e.g. IEC 31010 (2019).

7.3 Overview of the Main Inductive Methods

141

PHA

HAZOP

FMEA

Working section

Node

Item

Hazardous element

Physical parameter

Failure mode

AND Event causing hazardous situation

Deviation

Hazardous situation

Causes

Causes

Potential accident

Consequences

Consequences

Event early detection

Event early detection

Failure mode detection

Accident preventive measures

Safeguards

Compensating provisions

AND Event causing potential accident

Fig. 7.1 Inductive approaches of PHA, HAZOP and FMEA methods

7.3.1 Similar Approaches The three main inductive methods, the preliminary hazard analysis (PHA), the hazard and operability (HAZOP) study and the FMEA are systematic methods sharing the same approach (Fig. 7.1): • The system under study is broken down into specific elements1 : – Convenient working sections for the PHA method. – Nodes for the HAZOP method. – Items for the FMEA method. • For each of these specific elements: – The hazardous elements (of each working section) are identified and then the events (triggering events) causing each hazardous element to become a hazardous situation are identified for the PHA method. – For each physical parameter (e.g. flow) characterizing each node, the possible deviations (e.g. more flow) are considered in turn for the HAZOP method. – Failure modes (of each item) are determined for the FMEA method. • Causes are determined:

1 The

specific wording of each method is used.

142

7 The Inductive Approaches

– In two steps for the PHA method: characterization of the hazardous situation and identification of the events (triggering events) causing each hazardous situation to become a potential accident (i.e. the hazardous situation becomes a condition of the accident). – For the deviations for the HAZOP method. – For the failure modes for the FMEA method. • Consequences are determined: – For the combination of hazardous situation and triggering event for the PHA method (these consequences are named potential accident). – For the deviations of physical parameters for the HAZOP method. – For the failure modes of items for the FMEA method. • Proposals are made whenever necessary to reduce the causes or the consequences. They are called: – Accident prevention measures for the PHA method. – Safety measures for the HAZOP method. – Compensating provisions for the FMEA method. • Means of detection (alarms, parameter monitoring, etc.) of each event of the chain of events leading to the final consequences should be systematically identified and recorded. This activity is referred as “failure mode detection” for the FMEA method. Proposals are made if nothing was planned for this purpose.

7.3.2 Area of Implementation Inductive methods were designed over the years in several industrial sectors according to their own needs. An overview of their origin and use is given below: • The PHA (Chap. 8) method was designed for the needs of the aeronautic and military industries. It is now used in any industry. • The HAZOP (Chap. 9) method was designed for the needs of the process sector (e.g. chemical industry and oil and gas industry) and it is still used only in these industries. • The FMEA (Chap. 10) method is the most widely used in all industries. • The What-if? method as well as the checklist method (Chap. 11) can be used in any industry. • The HAZID method (Chap. 11) is often used in addition to the HAZOP method for the identification of hazards not caused by process deviations (e.g. load drop).

7.3 Overview of the Main Inductive Methods

143

7.3.3 Study Team All inductive methods are systematic critical examination methods. Except for the FMEA method, all the inductive methods are carried out by a multidisciplinary team working under the guidance of a team leader. As all the questions raised during the working sessions are to be answered by the members of the team, a large range of competence should be gathered. Although a FMEA is sometimes performed by a group, it is commonly performed by a single reliability analyst, the results being reviewed by experts of the system.

7.3.4 Use Within System Life Cycle The inductive methods are used throughout the life of a system and updated whenever necessary. However (Gould 2005): • The PHA method can be implemented early in the project (at conceptual phase) as it does not require detailed information. This method is not useful when the design is nearly frozen. • The FMEA method is not suitable for the earliest phases as not enough information is available on the items being analysed. • The HAZOP method requires the details of the instrumentation and of the valving arrangement which is not available at the conceptual phase of the project. • The What-if? method, the checklist method and the HAZID method are suitable for all phases. However, as these methods are not as structured as the other three ones, lack of information in the early phases can easily prevent useful conclusions to be reached.

References Gould J (2005) Review of hazard identification techniques, HSL/2005/58. HSE, Sheffield, UK ISO/IEC 31010 (2019) Risk management–Risk assessment techniques. International Organization for Standardization and International Electrotechnical Commission. Geneva, Switzerland IEC 60050-903 (2013) Ed. 1.0 International Electrotechnical Vocabulary (IEV) - Part 903: Risk assessment. Geneva, Switzerland

Chapter 8

Preliminary Hazard Analysis (PHA)

8.1 Description of the Method 8.1.1 Presentation of the Method Several approaches (Boeing 1970) can be used in the accomplishment of an initial analysis of the hazards of a system, which is the first safety analysis of a system. Implemented using worksheets with specific entries, it is named the preliminary hazard analysis (PHA). It is, by far, the most popular approach. An inductive approach is implemented for identifying and characterizing the potential accidents which can be generated by the existing hazardous elements of the system.

8.1.2 Purposes of the Method The purposes of the PHA method are to: • Identify hazardous elements, hazardous situations and potential accidents. • Determine the significance of their potential effect. • Establish design and procedural measures to eliminate or to control these identified hazardous situations and potential accidents.

8.1.3 PHA Procedure A PHA study is performed in three phases: • Phase 1: preparation of the study.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_8

145

146

8 Preliminary Hazard Analysis (PHA)

– Collection of information on the system (mission, phases, drawings, etc.). – Collection of hazard information on previous and similar systems, including existing checklists. – Breaking down of the system into convenient working sections. – Selection of the team leader and team members. • Phase 2: implementation of the method. The PHA method is implemented, using the form shown on Fig. 8.1 (Boeing 1970), in nine steps. – For each working section: identification of relevant hazardous elements. A hazardous element is an element of the system which is inherently hazardous. Hazardous elements can be energy sources, phases of the system (e.g. airplane take-off), etc. The hazardous element can be inherently hazardous or hazardous in combination with other elements. Hazardous elements are identified through the use of checklists, from experience, engineering judgment and intuition. – For each hazardous element, identification of the triggering events causing the hazardous element to become a hazardous situation. These triggering events are circumstances, unwanted events, failures or errors.

PHA WORKSHEET

Sub-system / Function:

Hazardous element

Event causing hazardous situation

Phase:

Hazardous situation

Fig. 8.1 Typical PHA worksheet

Event causing potential accident

Potential accident

Effect

Severity class

Accident prevention measures

Validation

8.1 Description of the Method

147

Triggering event Hazardous element

AND

Triggering event Hazardous situation

AND

Potential accident

Fig. 8.2 Sequence of steps of the PHA method

– Description of the situations, named hazardous situations, which could result from the interaction of the system and of each hazardous element in the system. A hazardous situation is any item or function state which, when in a system environment, constitutes a threat or jeopardizes something of value, within or related to that system. Hazardous situations are identified through the use of checklists, from experience, engineering judgment and intuition. – For each situation, identification of the triggering events which could cause the hazardous situations to become a potential accident. These triggering events are unwanted events or failures. – Description of potential accidents. The potential accidents which could result from the identified hazardous situations are described. Figure 8.2 shows the sequence of events linking a hazardous element to a potential accident. – Description of the possible effects of the potential accident. – Ranking of the severity of the accident using pre-defined severity ranking. – Description of accident prevention measures. The aim of this step is to establish the recommended measures to eliminate or to control identified hazardous situations and potential accidents. This includes the verification that detection means (e.g. sensors) are available and, if not, the identification of the new detection means to be installed. – Recording validated preventive measures and registering the status of the remaining recommended preventive measures. • Phase 3: results of the study. The results of a PHA study are the documentation of the identified hazardous elements and potential accidents, of the validated recommended measures and of areas requiring further investigation. Example 8.1 For a railways transportation system, the high voltage of the catenary is a hazardous element. Triggering events for this hazardous element can be (Mortureux 2016): • Fall of the catenary on the ground. The sequence of events up to a potential accident is given in the figure below.

148

8 Preliminary Hazard Analysis (PHA)

People walking along the railways

Fall of the catenary on the ground High voltage catenary

Catenary on the ground

AND

AND

People walking on the fallen catenary

• People walking alongside carrying long metallic ladders. The sequence of events up to a potential accident is given in the figure below. People walking alongside carrying long metallic ladders High voltage catenary

AND

Ladder in contact with catenary

• People walking on the roof of a wagon (voltage being on). The sequence of events up to a potential accident is given in the figure below.

People walking on the roof of a wagon High voltage catenary

AND

People in contact with catenary

The two last figures show that a single triggering event can be enough to cause an accident to occur.

8.1.4 Resources for the Method Hazardous Element Checklists The relevant hazardous elements can be selected from existing checklists, generic type or industry specific (see Table 8.1), using the experience of designers/users of the system and hazard experience on similar systems. These checklists cannot be considered as definitive as the use of new material/items/procedures is likely to generate new hazardous elements (and hazardous situations). The upper table in Table 8.1 is a checklist for items and the lower table is a checklist for operations. The list of hazards to consider is named the preliminary hazard list (PHL).

8.1 Description of the Method Table 8.1 PHA. Example of checklists for hazardous elements

149 Hazardous element (example 1) Battery Combustible Explosive Heating system Pressurized vessel Pump Reactor Rotating machine Tensioned spring Hazardous element (example 2) Welding Cleaning Extreme temperature operations Propellant loading Proof test of major components High energy pressurization Operational validation

Hazardous Situations Checklists Checklists are also used for defining the hazardous situations. Severity Ranking The assessment of the severity of the potential accident is not obvious as a choice has to be made between the most probable case, the worst conceivable case and the worst credible case. Any case has its pros and cons, so the principle for the selection of the severity should be made before initiating the study. In addition, in order to have a uniform ranking of the consequences of the accidents, severity tables are always defined before starting the PHA. The one given in Table 8.2 is related with safety only, but consequences on the success of the mission of the system, the property, the environment, etc. can also be ranked in additional tables. Table 8.2 PHA. Example of Severity Table (railways transportation)

Class

Consequence severity

Effects on travellers

I

Minor

None

II

Moderate

Small discomfort

III

Serious

Major discomfort

IV

Major

Light injuries

V

Catastrophic

Death

150

8 Preliminary Hazard Analysis (PHA)

8.1.5 Comments It has to be stressed that the implementation of the definitions of Sect. 8.1.3 on an actual system is not easy as they are sometimes overlapping (Lievens 1976). It is specified in Sect. 8.1.3 that the aim of a PHA study is to establish the recommended measures to eliminate or to control identified hazardous situations and potential accidents. However, the hazardous elements are part of the system and the potential accidents are the final events. So, the true aim (Baudrin et al. 2009) of a PHA study is to establish the recommended measures to eliminate or to control: • identified triggering events causing hazardous situations, • identified triggering events causing hazardous situations likely to become potential accidents. As outlined in Sect. 7.2, the opportunity has to be taken to identify whether (and how) each triggering event and each hazardous situation can be detected (if not already planned). Proposals are to be made, accordingly. The PHA method was the first hazard identification method to be specified as part of a system safety program in MIL-STD-882E (2012), first issued in 1969.

8.2 Other Related Approaches 8.2.1 Gross Hazard Analysis The gross hazard analysis (GHA) is a simplified version of the PHA. The analysis starts with the identification of the hazardous situations, then: • the causes or combination of causes leading to the hazardous situations are determined; • the consequences of the hazardous situations are assessed; • the preventive measures are identified; • the severity of the consequences is ranked (see Table 8.2). So, the hazardous elements and the potential accidents are not specifically identified and described. The worksheet in Fig. 8.3 is used for performing a GHA.

8.2.2 Chemical Industry Reaction between products is the core of the chemical industry. So, the chemical industry designed a specific preliminary hazard analysis method for analysing these reaction hazards (UIC 1980).

8.2 Other Related Approaches

151

GHA WORKSHEET Hazardous situation

Causes

Consequences

Severity class

Preventive measures

Validation

Fig. 8.3 Typical worksheet for gross hazard analysis

First, detailed and validated information on each product (raw materials, intermediate and final products) used on the plant is collected: • Physical properties: state at 20 °C (solid, liquid, gaseous), boiling temperature, critical pressure, etc. • Specific heats of vaporisation, of combustion, etc. • Flammable characteristics: flammable limits (in air, at the operating conditions), flash point, self-ignition temperature, etc. • Extinguishing means. • Resistance to corrosion, shocks, etc. • Toxicity. • Etc. Then, detailed and validated information on each process used on the plant is collected: • Operating conditions. • Risks during the reaction: volume of gas produced in case of decomposition, possibility of delay in the start-up of the reaction, etc. • Measures for controlling the increase of reaction speed in case of temperature increase, pressure increase, etc. • Measures for preventing erroneous operations, for controlling failure effects, etc. • Etc. Finally, the effect of the combination of each product used on the plant with any other product is systematically analysed. Then, the procedure is also applied to the resulting new product. The information is collected through literature survey, reaction screening and experimental testing.

152

8 Preliminary Hazard Analysis (PHA)

(PROBABILISTIC) PHA WORKSHEET

System element or activity:

Hazard

Hazardous event

Reference:

Risk Cause

Consequence Freq.

Cons.

RPN

Risk reducing measure

Comment

Fig. 8.4 Typical PHA with frequency worksheet

8.2.3 Preliminary Hazard Analysis with Frequencies The original PHA method evolved with time and, as for several other methods, the frequency of occurrence of a potential accident was introduced: columns were then added to the original form. This PHA with frequencies (sometimes qualified as preliminary risk analysis) method is implemented using forms such as the one shown in Fig. 8.4 (Rausand 2011) in six steps (the system considered to illustrate each step is an oil tank located within a retention basis): • Identification of hazards. According to ISO/IEC guide 51 (2014), a hazard is a potential source of harm, and a harm an injury or damage to the health of people, or damage to the property or the environment. Checklists are also of common use to identify the hazards. For example, the tank filled with oil is a hazard. • Identification of the hazardous events. For example, the opening of the manual drain valve at the bottom of the tank is a hazardous event. • Determination of the causes of each hazardous event.

8.2 Other Related Approaches

153

Possible causes of the opening in error of the valve are lack of attention, poor tag on drain valve. • Determination of the consequences of the occurrence of each hazardous event. Possible consequences of the opening of the valve are: – Release of oil and non-ignited oil pool within retention basin. – Release of oil and ignited oil pool. • Assessment of the characteristics of the risks generated by the hazards. – Determination of the frequency of each consequence. – Determination of the severity of each consequence. – Calculation of the risk priority number, RPN, of each consequence. The risk priority number, RPN, is defined as: RPN = S × F where: 1. S is a rating of the severity of the consequence. 2. F is a rating of the frequency of occurrence of the consequence. The numbers for S and F are determined using rating tables in which the level for each parameter is associated with a description. If the rating scales of these 2 parameters is of 1–10 (10 being in any case associated with the worst value), the overall RPN ranges from 1 to 100. • Proposal of risk-reducing measures. This part of the study is usually not the most important one, the focus is definitively on the identification of hazards and hazardous events.

8.3 Use with Other Methods With or without considering the frequencies of occurrence, the PHA remains a hazard identification method. As such it is used for the identification of systems which deserve to be studied in detail with other modelling methods.

8.4 Worked Example 8.1 The two-phase (liquid/gas) high pressure (HP) separator shown in Fig. 8.5 (FC = fail closed, FO = fail open) processes the liquid (oil and water)-gas mixture coming from

154

8 Preliminary Hazard Analysis (PHA) To flare PV2 FC

Plant control unit PSV PV1

Gas flow

To HP gas compressors PSS control unit

PSH

Choke valve

FC

PT

LSH

LT

Plant control unit

SDV1

FC

Two-phase HP separator

PSS control unit

Oil /Water flow SDV2 FC

Oil / Water / Gas flow

LV

To MP separator FC

Fig. 8.5 System used for worked example 8.1

the gas field and transfers the effluents to downstream units: liquid to MP (medium pressure) separator and gas to HP gas compressors: • The characteristics (e.g. flowrate) of the incoming flow are set by the choke valve. • The gas flow is controlled by a pressure control loop made up of a pressure transmitter (PT) acting through the plant control unit on: – pressure control valve 1 (PV1) on pipeline to HP gas compressors; – PV2 on flare header. • The liquid flow is controlled by a level control loop made up of a level transmitter (LT) acting on a level control valve (LV) through the plant control unit. The HP separator does not withstand the maximum pressure from the field. In case of: • high level in the separator, a level sensor high (LSH) orders the upstream shutdown valve (SDV1) and downstream SDV2 to close through the process safety system (PSS) control unit; • high pressure in the separator, a pressure sensor high (PSH) orders SDV1 and SDV2 to close through the PSS control unit.1 The working section considered is the HP separator with the mixture flowing through it. The hazardous element is then the flow of high-pressure flammable mixture of gas, oil and water. 1 On

a real plant, the BDV shown in Fig. 8.7 is not manually activated but connected to another safety system: the emergency shutdown (ESD) system, not considered for the purpose of worked example 8.1.

8.4 Worked Example 8.1

155

PHA WORKSHEET

Phase: on-duty

Sub-system / Function: HP separator

Hazardous element

High flow, high pressure flammable mixture of gas/oil/wate r

Event causing hazardous situation

Hazardous situation

Event causing potential accident

Potential accident

Effects

S

Accident prevention measures

Validation

Stop of gas flow out of the separator (HP compressors stop, etc.)

High pressure in the separator

PSH/ SDV1 fail to close PSV fails to open (PSV or PV2 cannot handle full flow)

Separator rupture

Potential several deaths, Oil pool

V

Install a BDV activated by operator in control room

Agreed and implemented: Fig 8.7

Stop of liquid flow out of separator (SDV2 spurious closure, etc.)

High liquid level in the separator

PSH/ SDV1 fail to close

Liquid to flare, ignited droplets at flare tip level

Potential workers burnt if within flare tip area

IV

Procedure to prevent workers to be under flare tip in operation

Agreed: procedure to be written and implemented

High increase of mixture flow in the separator (choke valve fails full open, etc.)

High pressure in the separator

PSH/ SDV1 fail to close PSV fails to open

Separator rupture

Potential several deaths, Oil pool

V

Install a BDV activated by operator in control room

Agreed and implemented: Fig 8.7

Fig. 8.6 Filled in PHA worksheet for worked example 8.1

Three triggering events can cause the hazardous element to generate hazardous conditions: • stop of gas flow out of the separator; • stop of liquid flow out of the separator; • high increase of mixture flow within the separator. Perform a PHA using the form given in Fig. 8.1 and the severity ranking of Table 8.2. Filled in PHA worksheets are given in Fig. 8.6 (“S” in the seventh column means severity class). Figure 8.7 provides the drawing of the HP separator including the accident prevention measures (installation of a blowdown valve, BDV) proposed in Fig. 8.6.

156

8 Preliminary Hazard Analysis (PHA) To flare PV2 BDV

FC

Manual action

FO

Plant control unit

PSV PV1

Gas flow

To HP gas compressors PSS control unit

PSH

Choke valve

FC

PT

LSH

LT

Plant control unit

SDV1

FC

Two-phase HP separator

PSS control unit

Oil /Water flow SDV2 FC

Oil / Water / Gas flow

LV

To MP separator FC

Fig. 8.7 System of worked example 8.1 including proposed modifications

References Baudrin D, Dadoun M, Desroches A (2009) L’analyse préliminaire des risques. Principes et pratiques, Hermes-Lavoisier, Paris, France Boeing (1970) System safety analytical technology-preliminary hazard analysis, Boeing D2113072-1 Rev A, Seattle, USA ISO/IEC guide 51 (2014) Safety aspects — Guidelines for their inclusion in standards. International Organization for Standardization and International Electrotechnical Commission, Geneva, Switzerland Lievens C (1976) Sécurité des systèmes, Cepadues-Editions, Toulouse, France, pp 123–130 MIL-STD-882E (2012) Standard practice: system Safety, US Department of Defense, Washington, USA, pp 44–48 Mortureux Y (2016) Analyse Préliminaire des Risques, SE 4010V1. Techniques de l’ingénieur, Paris, France Rausand M (2011) Risk assessment—theory, methods, and applications, 1st edn. Wiley and Sons, London, UK, pp 223–232 UIC (1980) Les cahiers de sécurité: les différentes méthodes d’analyse de sécurité dans la conception d’une installation chimique. 1ère méthode : l’analyse préliminaire des risques, UIC (Union des Industries Chimiques (France) Chemical Industrial Association), Paris, France

Chapter 9

Hazard and Operability Study (HAZOP)

9.1 Description of the Method 9.1.1 Presentation of the Method A hazard and operability (HAZOP) study is carried out by a multidisciplinary team, acting under the guidance of a team leader, who reviews the (continuous or batch) process to discover potential hazards and operability problems. The work sessions, carefully organized by the team leader, are essentially critical examinations of deviations from physical operating parameters (pressure, flow, etc.) using specific pre-defined guide words.

9.1.2 Purposes of the Method The HAZOP method is based on the assumption that the plant under analysis is in a safe state as long as the physical operating parameters remain within a given range defined by design. Thus, the aim of a HAZOP study is to review all possible deviations (from the design intentions) for identifying whether the plant is protected enough and to propose additional safeguards whenever necessary.

9.1.3 HAZOP Procedure An HAZOP study is performed in three phases: • Phase 1: preparation of the study.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_9

157

158

9 Hazard and Operability Study (HAZOP)

– Selection of the team leader. – Constitution of the multidisciplinary study team. – Collection of information on the plant: updated piping and instrumentation diagrams (P&IDs), piping schedule, safety valve rating, etc. – Break down of the plant into homogeneous sections (more or less same pressure, more or less same temperature, etc.), called parts or nodes, for which the design intention can be defined without ambiguity. • Phase 2: implementation of the method. For each node, the study team under the chairmanship of the team leader applies the HAZOP method using the form shown in Fig. 9.1 in several steps. – Step 1. Select one node. – Step 1.1. Explain the design intention. – Step 1.2. Select the first physical operating parameter (e.g. pressure): – Step 1.2.1. Apply the first guide word (such as “More” on Table 9.1) to the parameter (which gives “more pressure than expected”): – Identify the possible causes of the deviation.

HAZOP WORKSHEET

Part:

DEVIATION

POSSIBLE CAUSES

Fig. 9.1 Typical HAZOP worksheet

CONSEQUENCES

SAFEGUARDS

ACTION REQUIRED

9.1 Description of the Method Table 9.1 HAZOP. Main guide words and physical parameters considered

159 Parameter

Guide word

Pressure

More Less

Flow

More Less (none) Reverse

Temperature

More

Level

More

Concentration

More

Less Less Less Part of Contamination

N. A.

Other

Start-up Maintenance Static electricity Utility failure Other than

– Assess the severity of the consequences of the deviation. – Define the existing safeguards. The safeguards are the actions against the consequences of the deviation (as such they include the detection of the deviation). – Determine whether an action (for improvement or investigation) is required. – Define the required action if this one is obvious otherwise recommend a study to be made. – Step 1.2.2. Apply the second guide word (e.g. “Less”) to the parameter (which gives “less pressure than expected”) and rerun the above step. – Apply all other guide words and rerun Step1.2. – Step 1.3. Select the second physical parameter (e.g. flow) and rerun Step 1.2. – Select all other parameters and rerun Step 1.1 in sequence. The sequence of steps is shown in Fig. 9.2. – Step 2 to Step n. Select all other nodes and rerun Step 1 in sequence. • Phase 3: results of the study. The team leader reviews the proposed actions and builds an action plan.

160

9 Hazard and Operability Study (HAZOP)

Physical parameters Pressure

Flow

Guide words

Deviations

+

More

= More Pressure

+

Less

= Less Pressure

+ + +

More Less

=

More Flow

=

Less Flow

Reverse = Reverse Flow

Inductive analysis: • Causes. • Consequences. • Existing safeguards. • Proposals.

Etc. Fig. 9.2 Sequence of steps of the HAZOP method

9.1.4 Resources for the Method Multidisciplinary Study Team The main characteristic of the team members is their professional experience. They should cover a large range of disciplines of the plant, so at least the following specialists should be part of the review: • • • •

A process specialist. An instrumentation specialist. A safety specialist. An operation specialist. If the plant is still under design, this specialist should be mobilised from another plant of an equivalent design.

Team leader The team leader’s (the chairman, the facilitator) role is of vital importance as he must understand what is going on as well as keep the team efficient all over the HAZOP sessions. One of his main characteristics is that he should not have been closely associated with the project. His job is to: • • • • •

Collect the information. Prepare the breakdown of the plant. Specify the specialities of the members of the team. Train the members of the team in using the HAZOP method. Lead the working sessions: propose deviations, motivate all members of the team to participate in the discussions, try to reach to an agreement between all members in order to take a collective decision (if not possible such disagreement is solved outside the working session).

9.1 Description of the Method

161

• Record the working sessions. • Issue the report: set of worksheets and list of all actions to be done (or studies to be carried out). • Perform (if asked to do it) the follow-up of the actions. Guide words The main characteristic of the HAZOP method is to use guide words to generate the possible design deviation and to stimulate the team members to identify their possible causes, consequences and to determine whether the plant is well protected by existing safeguards (IEC 61882 2016). These guide words are MORE, LESS, AS WELL AS, OTHER THAN, etc. Table 9.1 gives some of them and indicates to which physical parameters they can be applied.

9.1.5 Comments The HAZOP method is mainly devoted to the identification of safety problems (i.e. hazards), operability problems are not often considered. Sometimes: • The team leader is assisted by a secretary. This aid is no longer mandatory if computer packages are used. • Experts do not agree on the possible consequences or actions to be implemented. Such discrepancies are not solved during a working session but through an additional study. As outlined in Sect. 7.2, the opportunity has to be taken to identify whether (and how) each possible cause of each deviation can be detected (if not already planned). Proposals are to be made, accordingly.

9.2 Quantified HAZOP For a long time, the judgment of the team members has been considered as sufficient to decide whether additional safeguards were needed or not. It is now of common practice to assess the frequency, the severity and then the risk of the possible deviations for: • consequences before considering existing safeguards; • consequences considering existing safeguards; • consequences considering existing safeguards and proposed additional safeguards. This means that, prior to initiating the HAZOP study, the owner of the plant must:

162

9 Hazard and Operability Study (HAZOP)

• Rank the severities of the consequences of the deviations on human beings, environment and equipment (at least). As examples: – Destruction of equipment worth less than 100 ke can be considered as a light consequence and be ranked as 1. – Destruction of equipment worth less than 10 Me (but more than 0.5 Me) can be considered as a major consequence and be ranked as 4. • Using existing reliability data handbooks, build an item failure probability/human error frequency data base. Examples of such data are: – Frequency of failure of a control loop = 0.1/year. – Probability of an operator not performing a routine action = 10−3 /action. • Design a risk matrix (Chap. 2).

9.3 HACCP A method, well used in the food industry, makes use of the principle of HAZOP: the hazard analysis and critical control points (HACCP). It is a systematic preventive approach to food safety from biological, chemical, and physical hazards in production processes that can cause the finished product to be unsafe. Then design measures are proposed to reduce these risks to an acceptable safe level. The first step of the HACCP method is a hazard analysis which is to be conducted in a similar way to the one of the HAZOP method. Indeed, the hazard analysis consists in asking a series of questions (on microbial content of the food, on the ingredients used, etc.) which are appropriate to the process under consideration to a group of specialists. Its implementation is recommended by ISO 22000 (2018).

9.4 Worked Example 9.1 The system shown in Fig. 8.7 of worked example 8.1 is used for worked example 9.1. Perform a HAZOP study on node “HP separator”. Guide words shown in Table 9.1 are to be used. However, the physical parameter FLOW is not considered as it applies to pipeline only. The filled in worksheets are given in Figs. 9.3 and 9.4.

9.4 Worked Example 9.1

163

HAZOP WORKSHEET

2-phase HP separator

Part:

DEVIATION

ACTION REQUIRED

SAFEGUARDS

CONSEQUENCES

POSSIBLE CAUSES

Pressure More

Choke valve total failure open PT/PV1 control loop failure closed

Separator bursting

PSHH/SDV1 closure BDV opening PSV opening

None

Pressure Less

BDV spurious opening PSV spurious opening

Pressure at 1 atmosphere: no problem

None

None

Temperature More

Fire on near-by equipment or unit

Pressure increase

Water spray on separator Fire detection initiates blowdown of equipment See Pressure/More

Check location of water spray nozzles

Temperature Less

N.A.

N.A.

N.A.

N.A.

LT/LV control loop failure closed

Liquid to HP gas compres- LSHH/SDV1 closure sors and possible to flare

Level More

Check safeguards downstream PV1

Fig. 9.3 Filled in HAZOP worksheet for worked example 9.1. Page 1

HAZOP WORKSHEET

Part:

DEVIATION Level Less Utility Instrument Air (IA)

Utility Electricity

2-phase HP separator

POSSIBLE CAUSES

CONSEQUENCES

ACTION REQUIRED

SAFEGUARDS

LT/LV failure open

Gas blow-by

LSLL/SDV2 closure

Instrument Air failure

SDV failure closed BDV failure open Separator isolated and depressurized

None

Blackout or local failure

SDV failure closed BDV failure open Separator isolated and depressurized

None

Fig. 9.4 Filled in HAZOP worksheet for worked example 9.1. Page 2

None

164

9 Hazard and Operability Study (HAZOP)

9.5 Use with Other Methods In the process sector, HAZOP studies are the major input for LOPA studies (Chap. 26).

References IEC 61882 Ed.2 (2016) Hazard and operability studies (HAZOP studies)—Application guide. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 22000 Ed. 1.0 (2018) Food safety management systems. Requirements for any organization in the food chain. International organization for standardization, Geneva, Switzerland

Chapter 10

Failure Mode, Effects (and Criticality) Analysis, FME(C)A

10.1 Description of the Method 10.1.1 Presentation of the Method A failure mode and effects analysis (FMEA) is a systematic inductive method of evaluating e.g. the safety of an item or system, to identify the ways in which it might potentially fail, and the effects of the mode of failure upon the performance of the item or system and on the surrounding environment and personnel.

10.1.2 Purposes of the Method Within the scope of this book, several reasons to perform a FMEA (see also IEC 60812 2019) can be identified: • • • •

to identify the failure modes having unwanted effects on the system; to gain a deep understanding of the way a system runs and fails; to improve the design of a system; to provide a foundation for other dependability methods.

10.1.3 FMEA Procedure A FMEA study is performed in three phases: • Phase 1: preparation of the study. – Break down the system into items. © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_10

165

166

10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A FMEA WORKSHEET System: Failure analysis

Item

Function

Failure mode

Failure causes

Failure effects

S

Failure detection

Compensating provisions

Fig. 10.1 Typical FMEA worksheet

– Define the relationships of the item with the environment, the way it is to be operated, etc. MIL-Std-1629A (1977) recommends building functional block diagrams for this purpose. – Collect information on the item itself: engineering diagrams, maintenance and testing policy, etc. – Tailor the FMEA form to the system under study and to the objectives of the study. • Phase 2: implementation of the method. For each item, the reliability engineer applies the FMEA method using, for example, the form shown in Fig. 10.1 (based on the model given in MIL-Std-1629A 1977) in several steps: – Define the function of the item (in relationship with the system it is part of). – Identify all possible failure modes. – Identify the possible failure causes of each failure mode.1 This step is vital for the assessment of the relevance of each failure mode. – Identify the effects of each failure mode (its severity) on the system. To ease the identification of the effects, this column is often split in up to three levels:

1 A detailed causal analysis is not part of a FMEA (IEC 60812 2019): see IEC 62740 (2015) for that

purpose.

10.1 Description of the Method

167

local effects, next higher level effects, end effects (effects on the system itself, on near-by systems and on the environment). – Rank the severity of each failure mode. The ranking can be made by considering the effect resulting from the failure mode on human beings, environment, equipment, production, operation (maintenance, costs, image). – Identify failure detection method of each failure mode. – Whenever possible, identify actions which could eliminate the failure mode or mitigate its effects. • Phase 3: review and issue the results of the study. – Review the FMEA worksheets with experts of the system. – Update the FMEA worksheets and build an action plan.

10.1.4 Resources for the Method Failure mode checklist Most of the failure modes can be identified using the following typical conditions (based on MIL-Std-1629A 1977): • • • • • •

Failure to operate at a prescribed time. Failure to cease operation at a prescribed time. Failure during operation. Spurious operation. Intermittent operation. Degraded operational capability.

There are several ways to obtain information about failure modes. The fastest is to use the information given in existing lists, e.g.: • FMD (2016) provides failure modes/mechanism distribution for electrical, electronic, electromechanical and mechanical parts and assemblies. • Recommended failure modes for rotating, mechanical, electrical, safety and control (e.g. sensors), subsea production, well completion and drilling items (i.e. items used in the oil and gas industry) are given in annex B of ISO 14224 (2016). Severity ranking tables It does exist in the literature several levels (e.g. four, five, six) for the failure mode severity ranking. Table 10.2 gives an example of a severity ranking. ISO 14224 (2016) provides examples of failure effect classification in Annex C with respect to safety, environment, production and operating costs.

168

10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A

Table 10.1 Example of criticality matrix Failure mode likelihood classes

High

II

III

IV

IV

Medium

I

II

III

IV

Low

I

II

III

IV

Rare

I

I

II

III

Moderate

Severe

Major

Catastrophic

Failure mode severity classes

10.1.5 Comments Most of the time, a FMEA is performed by a reliability analyst acting alone. However, for newly designed items or complex ones, the implementation of brainstorming sessions with a group of experts is recommended.

10.2 FMEA/FMECA Worksheets The FMEA worksheet is to be tailored for achieving the FMEA objectives. There are then several types of FMEA/FMECA worksheets according to: • The aim of the study, for example: – Worksheet of Fig. 10.1 (“Failure detection” column) is essential for a safety/reliability study as the detection of dangerous failure modes is vital (undetected dangerous failure modes are the main contributors to system unavailability). – Removal of “Failure detection” column gives a worksheet applicable for a production availability study of a production system (items considered for such a study are running which means that failures are immediately revealed). • The type of industrial domain concerned. As an example, FMEA/FMECA performed for electronic systems are focused on failure modes, final effects and detectability. • The use of FMEA with modelling methods or not: in this case, a criticality analysis is useless as the ranking of the item failure modes is more accurate using such methods. There are even methods based on the FMEA method (such as the reverse FMEA used in the car industry) which nearly make no use of standard FMEA columns. Several examples of FMEA worksheets are given in IEC 60812 (2019).

10.3 FMECA

169

10.3 FMECA 10.3.1 Criticality Analysis If the class of the likelihood of occurrence of each failure mode is determined, in addition to the severity of the effect on the system of these failure modes, a criticality analysis can be performed. This criticality analysis consists in the ranking of the set (likelihood of occurrence, severity of effects) of each failure mode of each item. The FMEA becomes then a failure mode, effects and criticality analysis (FMECA). For a FMECA, two or three columns must be added in the worksheet: • Failure mode likelihood of occurrence and criticality category columns if the criticality matrix is used (Sect. 10.3.2). • Failure mode frequency number, detectability value number and criticality number columns if the risk priority number is used (Sect. 10.3.3).

10.3.2 Use of Criticality Matrix The likelihood and the severity are formed in a matrix and a criticality rank is allocated to each of the cell in the matrix (called the criticality matrix or the risk matrix). An example of such criticality matrix is given in Table 10.1 where the roman numbers are the criticality ranks. The output of the FMECA is the list of item failure modes with the highest criticality rank.

10.3.3 Use of Risk Priority Number An alternative to the classic criticality matrix is the risk priority number (RPN) as explained in ISO 60812 (2019): RPN = S × F × D where: • S is a rating of the severity of a failure mode. • F is a rating of the frequency of occurrence of a failure mode. • D is a rating of the detectability of a failure mode. The detectability number, D, represents the likelihood with which a failure mode is expected to be detected before significant effects occur. The numbers for S, F and D are determined using rating scales in which the level for each parameter is associated with a description. If the rating scale of these three

170

10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A

Table 10.2 Severity ranking for worked example 10.1

Class S

Description

IV

Destruction of separator

III

Only one safety barrier remaining

II

Production shutdown

I

Production loss

parameters is of 1–10 (10 being in any case associated with the worst value), the overall RPN ranges from 1 to 1000.

10.4 Worked Example 10.1 The system is shown in Fig. 8.7 (worked example 8.1). Perform a FMEA study, using the form of Fig. 10.1, for the separator overpressure protection system of the HP separator. The failure modes for the sensors are selected from the ones given in ISO 14224 (2016): • Failure to Function on Demand (FTF). • Spurious Operation (SPO). The failure modes for the valves are selected from the ones given in ISO 14224 (2016): • Failure to Close on Demand (FTC) for the SDV and Failure to Open on Demand (FTO) for the BDV and the PSV. • Spurious Operation (SPO). • External Leakage-Process medium (ELP). The ranking of the severity (S) of the final effects is provided in Table 10.2. The filled in worksheets are given in Figs. 10.2, 10.3 and 10.4.

10.5 Use with Other Methods As specified in Sect. 7.1, a FMEA (the only one method of its kind) is performed for systems of any size for which the reliability, the safety or the availability, is to be assessed. The FMEA method is then used as an input to modelling methods (see Sect. 27.2).

10.5 Use with Other Methods

171 FMEA WORKSHEET

System:

Separator overpressure protection Failure analysis

Item

PSH

SDV1

Function

To issue a signal upon High pressure

Failure mode

Failure causes

Failure effects

S

Failure detection

Compensating provisions

FTF

Sensing element fails low Wrong setting Isolated in error

No order to close to SDV1

IV

Diagnostic test for some causes

None

SPO

Sensing element fails high Wrong setting

Spurious closure of SDV1

II

Revealed

None

FTC

Valve stuck Actuator seized Solenoid valve stuck

HP separator not isolated from upstream flow

IV

Proof test

None

SPO

Spurious action of solenoid valve

Separator shutdown

II

Revealed

To close upon PSH signal

Fig. 10.2 Worked example 10.1. FMEA worksheet. Page 1 FMEA WORKSHEET System:

Separator overpressure protection Failure analysis

Item

Function

Failure mode

Failure causes

Failure effects

S

Failure detection

Compensating provisions

SDV1

To close upon PSH signal

ELP

Leak at flange level

Flammable gas then potential jet fire

III

Gas detectors

Water curtain on separator

FTO

Valve stuck Actuator seized Solenoid valve stuck

HP separator not depressurized

III

Proof test

PSV

SPO

Major leakage of utility fluid Spurious action of solenoid valve

Separator shutdown

II

Revealed

None

ELP

Leak at flange level

Flammable gas then potential jet fire

III

Gas detectors

Water curtain on separator

BDV

To open upon manual action

Fig. 10.3 Worked example 10.1. FMEA worksheet. Page 2

172

10 Failure Mode, Effects (and Criticality) Analysis, FME(C)A FMEA WORKSHEET System:

Separator overpressure protection Failure analysis

Item

PSV

Function

To open if the pressure exceeds set value

Failure mode

Failure causes

Failure effects

S

Failure detection

Compensating provisions

FTO

Spring failure Wrong setting

HP separator not depressurized

III

Proof test

BDV

SPO

Spring failure Wrong setting

Separator shutdown

II

Revealed

ELP

Leak at flange level

Flammable gas then potential jet fire

III

Gas detectors

Water curtain on separator

Fig. 10.4 Worked example 10.1. FMEA worksheet. Page 3

References FMD-2016 (2016) Failure Mode/ Mechanism Distributions, Quanterion Solutions Inc., Utica, NY, USA IEC 62740 Ed. 1.0 (2015) Root cause analysis (RCA), International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60812 Ed. 3.0 (2019) Failure modes and effects analysis (FMEA and FMECA), International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland MIL-Std-1629A (1977) Procedures for performing a failure mode, effects and criticality analysis, US Department of Defense, Washington, USA

Chapter 11

Other Inductive Methods

11.1 Checklists The common definition of a checklist is: an aid of typical “things to do” to compensate for poor human memory and low attention, but the checklist method is an experiencebased approach. For the purpose of hazard identification, checklists are lists of generic or specific hazards gathered mainly from past experience. Different checklists are used for each stage of a project, starting with checklists of material properties and process characteristics and terminating with checklists for auditing operations. It has also to be stressed that there are two types of checklists: • Checklists for checking the design. • Checklists focused onhazard identification (already used in Chap. 8). A checklist study is performed as follows: • Gather the team in charge of the study. • Perform a systematic review of the checklist. The team uses the checklist to stimulate thoughts and decides whether hazards may occur on the system under study. • Record conclusions (the answers to the questions are not simply “yes” or “no”) and register the recommendations that arise. Table 11.1 provides an example of checklist for structural events (i.e. event impacting the mechanical structure) on offshore platforms (Spouge 1999) and Table 11.2 an example of checklist for nuclear power plants (SSG-3 2010). Most of the checklists available in the literature look like the ones of Tables 11.1 and 11.2, i.e. they are mainly a set of words with few relationships between them. So:

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_11

173

174 Table 11.1 Generic hazard checklist for structural events on offshore platforms

11 Other Inductive Methods Structural events Structural failure due to fatigue Extreme weather Earthquake Bridge collapse Crane collapse Disintegration of rotating equipment

Table 11.2 Ground based natural hazards for nuclear power plants

Ground based natural hazards Soil frost Volcanic phenomena Avalanche Above water landslide External fire Seismic hazards Karsts

Prior to initiate any checklist study, the team should review the existing checklists, update them and structure them so as to perform the study on a sound basis.

11.2 What-If? The assessment is made by raising questions beginning by “What-if?” to identify hazards, hazardous situations or specific event sequences that could produceunwanted consequences (CCPS 2008). As the approach of the What-if? method was found not rigorous enough by safety engineers by the end of the 60s, they started to develop a new method (based on the same principle of critical examination of the What-if? method): the HAZOP method. A What-if? study is performed as follows: • Gather the team in charge of the study. • The team leader raises questions beginning with What-if? or other forms of initiating questions such as “How could?” (Spouge 1999). • The team uses the question checklist to stimulate thoughts (brainstorming). Whenever necessary, the questions may be divided into specific areas of investigation

11.2 What-If?

175

Table 11.3 Typical worksheet for recording the conclusions of a What-if? session WHAT-IF? worksheet What if?

Causes

Consequences

Safeguards

Actions

Table 11.4 Typical worksheet for recording the conclusions of a HAZID session HAZID worksheet Generic hazard

Event

Consequences

Mitigating factors

Recommendations

such as electrical safety. Each area is then addressed by a team of knowledgeable people. • Register conclusions on specific table formats. The results are often presented in a table format as shown in Table 11.3. As the result of the analysis relies heavily on the way the questions are raised, the formulation of the question is based on the experience of team members.

11.3 HAZID The hazard identification (HAZID) method is a systematic review of the possible causes and consequences of hazardous events. Like a HAZOP, a HAZID involves experts of the system under study and is performed during working sessions. During them, the stress is put on the way the events could occur rather than whether they could occur and on the description of the consequences. Conclusions are recorded on specific table formats such as the one on Table 11.4. Like PHA (Chap. 7) and What-if? (Sect. 11.2) methods, the HAZID method is performed using checklists e.g. for process hazards: • • • • • • •

Unignited process releases. Ignited process releases: fire, explosion, heat, smoke, etc. Toxic process releases. Flaring. Venting. Draining. Sampling.

11.4 Additional Methods A lot of methods can be found in the literature, most of them are not commonly used. These methods can be brand new or, more often:

176

11 Other Inductive Methods

• An extension of methods already described such as the structured What-if? technique (SWIFT). For this method, the What-if? brainstorming sessions are structured using special checklists (Rausand 2020 and CCPS 2008). • The use of methods not originally planned for the same purpose, e.g. concept safety reviews and design reviews.

References CCPS Ed.3 (2008) Guidelines for hazard evaluation procedures. Center for Chemical Process safety. Willey-Interscience, USA, pp 175–209 Rausand M (2020) Risk assessment—theory, methods, and applications, 1st edn. Wiley and Sons, London, UK Spouge J (1999) A guide to quantitative risk assessment for offshore installations. CMPT, Aberdeen, UK SSG-3 (2010) Development and application of level 1 probabilistic safety assessment for nuclear power plants, Specific Safety Guide p 155–164, IAEA Safety standards series. IAEA, Vienna

Chapter 12

Comparison of Inductive Approaches

12.1 Strengths and Weaknesses of Inductive Approaches The analysis of the strengths and weaknesses of the inductive methods presented in Chaps. 8, 9, 10 and 11 is based on CCPS (2008), Gould (2005) and Spouge (1999). However, all these inductive methods share the same weakness: they do not consider multiple failure events. To a greater or lesser extent, all the inductive approaches presented in this chapter rely on the experience and expertise of the team leader. However, this dependency on the team leader is counterbalanced by the fact that the decisions are taken mainly by the members of the working team.

12.1.1 PHA The main strengths are: • Analysis easy to perform. • Analysis can be performed at a very early stage of the design as the level of detail requested is low. • Analysis made by a multidisciplinary team. The main weaknesses are: • Not all causes are identified as the level of detail requested is low. • Mainly focused on major hazards. The need of a competent team leader can be considered as a strength and as a weakness.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_12

177

178

12 Comparison of Inductive Approaches

12.1.2 HAZOP The main strengths are: • It is a systematic and comprehensive technique. Detailed procedures do exist for performing a HAZOP. Existing software packages greatly facilitate the smooth running of the working sessions, more specifically when performing a quantified HAZOP (use of data base for assessing the frequencies and the severities of the consequences). • Although fully qualitative in its early years, HAZOP is now a semi-quantitative method. • Fits well to process sector needs. • Analysis made by a multidisciplinary team which is a guarantee for more tuned decisions. The main weaknesses are: • Time-consuming and expensive method. • Experienced analysts (team leader and members of the team) are needed to produce all possible causes and consequences of the deviations, as well as to produce realistic recommendations. • Hazards caused by more than one deviation cannot be identified. The need of a competent team leader can be considered as a strength and as a weakness.

12.1.3 FMEA/FMECA The main strengths are: • Systematic review of all the failure modes of all items of a system. • Format easy to adapt to any new item in any industry. • Semi-quantitative assessment (FMECA). The main weaknesses are: • Quickly time-consuming as any failure mode of any item of the system is to be reviewed. • Cannot handle combinations of failures. • Difficult to find appropriate failure modes for new items.

12.1.4 Checklists The main strengths are:

12.1 Strengths and Weaknesses of Inductive Approaches

179

• Technique easy to apply as the underlying principle is simple. • The checklists ensure that known problems are already fully explored. The main weaknesses are: • The assessment is relevant as long as the checklists used are relevant. • Checklists are mainly based on field feedback and, as such, they are not easy to apply to novel processes.

12.1.5 What-If? The main strengths are: • Method easy to apply. • Can be easily adapted to different sectors of industries. The main weaknesses are: • Quickly time-consuming as any item of the system is to be analysed. • As the principle of the method is simple, experience is required for asking the appropriate questions.

12.1.6 HAZID The main strengths are: • Fast and flexible method. • Covers low-frequency events. The main weaknesses are: • Experienced analysts (team leader and members of the team) are needed. • Guide words must be developed for each plant.

12.2 Synthesis Table 12.1 provides a synthesis of the analysis made in the above pages. Each approach is assessed versus characteristics considered as major. The meaning of the symbols used in Table 12.1 is as follows: • XX: characteristic fully valid. • X: characteristic partially valid. • NA: Not Applicable.

180

12 Comparison of Inductive Approaches

Table 12.1 Comparison of main inductive approaches Characteristic

Easy to implement

Structured approach

Ability to identify new hazards

Industry specific

In-depth analysis

Hazid

X

NA

X

NA

X

What-if?

XX

NA

X

NA

X

Checklists

XX

NA

NA

NA

X

FMEA/ FMECA

X

XX

XX

NA

XX

HAZOP

X

XX

XX

XX

XX

PHA

X

XX

XX

NA

X

References CCPS (2008) Ed. 3 Guidelines for hazard evaluation procedures. Center for Chemical Process safety. Willey-Interscience, USA Gould J (2005) Review of hazard identification techniques, HSL/2005/58. HSE, Sheffield, UK Spouge J (1999) A guide to quantitative risk assessment for offshore installations. CMPT, Aberdeen, UK

Part III

Modelling of Static Systems. Boolean Approaches

Chapter 13

The Family of Boolean Approaches

The reliability assessment of safety and production systems being based on the analysis of events and combination of events which are likely to lead to unwanted situations, the natural basic mathematical framework is, therefore, the Boolean algebra which is precisely devoted to handling events. This leads to a whole family of approaches using the Boolean algebra as mathematical background (Vinuessa et al. 2016). They are described in this part of the book: • Reliability block diagram—which is probably the oldest model ever used by engineers (Chap. 15); • Fault tree—which is the only top-down approach used within the safety and dependability framework (Chap. 16); • Cause consequence diagram, event tree, LOPA, bowtie—which deal with event sequences (Chap. 26); • Belief network—which is devoted to dependent events modelling (Chap. 27). These approaches share several features: • analytical approaches based on models represented by an underlying Boolean (i.e. logic) formula; • static in nature (i.e., basically the time is not modelled); • graphic representations specific to each of the approaches (this makes them userfriendly when building the models and discussing the results); • possible use for qualitative, semi-quantitative and quantitative analyses. Except for the belief networks—which are devoted to dependent events through the use of conditional probabilities—all the other approaches deal basically with independent events (i.e. the probability of occurrence of an event does not depend on the probability of occurrence of the other events in the model) through the use of ordinary probabilities.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_13

183

184

13 The Family of Boolean Approaches

It has to be noted that, even if these approaches are static in nature, the time can be easily introduced for time-dependent probabilistic calculations provided that: • the model does not change when time elapses and, • the evolution of an event as a function of time does not depend of the evolutions of the other events in the model. Under the above assumptions, these approaches can be combined with Markovian models (see Chaps. 27 and 32) used to provide the probability inputs. This is particularly useful for exact availability/unavailability and failure frequency calculations. This also allows to perform good approximate reliability/unreliability calculations. When dynamic dependencies (i.e. time-dependent dependencies) between events have to be considered, several approaches can be extended to dynamic versions (e.g. dynamic reliability block diagrams, dynamic fault trees, dynamic event trees, see Chap. 27). This extension can be done in combination with Petri nets (see Chap. 33) and the analytical calculations have to be replaced by the Monte Carlo simulation (see Chap. 32) which is more appropriate to take dependencies into account.

Reference Vinuesa C, Folleau C, Clavé N, Cacheux P-J (2016) GRIF-BOOL: risk assessment with an analysis tool handling multiple types of Boolean modeling 20ème Congrès national de l’Institut pour la Maîtrise des Risques (Lambda-Mu 20). France

Chapter 14

Mathematical Framework

14.1 Notion of Events and Boolean Algebra Basically, a Boolean set is made of two values {true, false} or {1, 0} and a Boolean variable—also called logic variable—takes its values in this set. This fits very well to describe events which have only two values {realized, not realized} or {occurred, not occurred}. This is illustrated by the Venn diagrams (Venn 1880; Wikipedia Venn 2020) in Fig. 14.1: on the left-hand side is represented the space state (i.e. the certain event)  which is split between an event A and its complementary A which constitute a Boolean set {A, A} because these two values cover entirely the space state Ω. If this set is related to the state of a given item A, and if a is the Boolean variable representing the item state, then event A = “A in good functioning state” is equivalent to a = 1 and event A = “A in faulty state” is equivalent to a = 0. Boolean algebra comprises three basic operators, negation (NOT ) disjunction (OR) and conjunction (AND) which can be used with both events and variables. Negation, NOT: event A = NOT A is the complementary event of A. Depending on authors and applications, several other notations are currently used like ¬A, ∼  A, A and even -A. This operator appears in Fig. 14.1 along with complementary event A which is the negation of event A. When used with a Boolean variable, NOT a (or  a, ¬a, ∼a, a ,-a) is true when a is false and vice versa. Beware that the notation -a does not denote a subtraction operator. Disjunction (or union), OR: event C = A OR B is realized if A is realized or if B is realized. This can also be noted by C = A ∪ B or C = A + B and this is illustrated in Fig. 14.2 for three different cases: general case, B included in A and A and B incompatible. When used with Boolean variables, c = a OR b (or c = a ∪ b, c = a + b) is true if a is true or if b is true. Conjunction (or intersection), AND: event C = A AND B is realized if both A and B are realized. This can also be noted by C = A ∩ B or C = A · B and this is illustrated in Fig. 14.3 for three different cases: general case, B included in A © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_14

185

186

14 Mathematical Framework

Fig. 14.1 Basic Boolean concepts: events and NOT operator

Fig. 14.2 Basic Boolean concepts: disjunction (OR)

Fig. 14.3 Basic Boolean concepts: conjunction (AND)

and A and B incompatible. When used with Boolean variables, c = a AND b (or c = a ∩ b, c = a · b) is true if a and if b are true at the same time. It has to be noted that the terms union and intersection belong to the general set theory vocabulary when the terms disjunction and conjunction are more specific and are used when dealing with Boolean variables.

14.1 Notion of Events and Boolean Algebra

187

The OR operator has the same property as the common addition and the AND operator the same property as the common multiplication and this is why, as said above, they are often represented by using the signs “ + “ and “·”. Associated with a set of Boolean variables, these operators provide a structure of algebra and this is why it is referred as Boolean algebra. This leads to the following properties: Neutral elements: theimpossible event, , and the certain event, , play the same role of neutral elements as “0” and “1” for the common algebra. This leads to: A ∪  = A, and a + 0 = a A ∩  = A, and a · 1 = a

(14.1)

Commutativity: A ∪ B = B ∪ A, and a + b = b + a A ∩ B = B ∩ A, and a · b = b · a

(14.2)

Associativity: (A ∪ B) ∪ C = A ∪ (B ∪ C), and (a + b) + c = a + (b + c) (A ∩ B) ∩ C = A ∩ (B ∩ C), and (a · b) · c = a · (b · c)

(14.3)

Distributivity: (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C), and (a + b) · c = a · c + b · c A ∪ (B ∩ C) = (A ∪ B) · (A ∪ C), and a + (b · c) = a · b + a · c

(14.4)

In addition to the common properties above, the Boolean algebra has several other important specific properties: Idempotence: A ∪ A = A, and a + a = a A ∩ A = A, and a · a = a

(14.5)

Applied to the certain event, , this implies:  ∪  = , and 1 + 1 = 1  ∩  = , and 1 · 1 = 1 Absorption: A ∪ (A ∩ B) = A, and a + a · b = a

(14.6)

188

14 Mathematical Framework

A ∩ (A ∪ B) = A, and a · (a + b) = a

(14.7)

Complementarity: A ∪ A = , and a + a = 1 A ∩ A = , and a · a = 0

(14.8)

  NOT(NOT A) ≡ A = A, and (a) = a

(14.9)

Involution:

De Morgan’s Laws: this is an important feature allowing to invert the formulae 

 A ∪ B = A ∩ B, and (a + b) = a · b   A ∩ B = A ∪ B, and (a · b) = a + b

(14.10)

Logic functions are functions of Boolean variables (i.e. logic variables) which, themselves, have Boolean values (i.e. logic values). They are in the background of all the Boolean approaches (e.g. reliability block diagrams or fault trees) and the above mathematics allows to proceed to the needed manipulations when such approaches are implemented. Therefore, it is important to be aware of them and this is why they have been reminded above. This chapter would not be complete without introducing the exclusive disjunction (exclusive OR) which is illustrated in Fig. 14.4. Exclusive Disjunction (or exclusive union), ⊕: event C = A ⊕ B is realized if A is realized or if B is realized but not A and B. This is illustrated in Fig. 14.4 for three different cases: general case, B included in A and A and B incompatible. When used with Boolean variables, c = a ⊕ b is true if a is true or if b is true and if a ∩ b is false.

Fig. 14.4 Extended Boolean concepts: exclusive OR operator

14.1 Notion of Events and Boolean Algebra

189

This operator is often used but it does not belong to the basic operators of the Boolean algebra and this is why it is described aside. When implemented in a model, it is likely to introduce non-monotony (or non-coherence) and this makes the probabilistic calculations more difficult than in the basic case (see Chap. 18).

14.2 Bases for Time-Independent Probabilistic Calculations 14.2.1 Probability of the Disjunction (Union) of Events The bases of probabilistic calculations are developed hereafter but, in addition, more information can be found in Page (1989) or Ruegg (1995) which are focused on engineering oriented probabilistic calculations. According to the basic probability calculation properties, the probability of the certain event, , is equal to 1 and this of the impossible event, , equal to 0: Pr() = 1 and Pr() = 0

(14.11)

When events A and B are mutually exclusive (also named disjoint or incompatible) (see Fig. 14.5 top left), the intersection of A and B is empty: A ∩ B =  and then Pr(A ∩ B) = 0. In this case, the calculation of Pr(A ∪ B) is simply equal to the sum, Pr(A) + Pr(B), of the probabilities of A and B. In the general case (see Fig. 14.5 top right), the intersection of A and B is not empty: A ∩ B =  and then Pr(A ∩ B) = 0. In this case, this sum counts the probability of A ∩ B twice: one time with Pr(A) and one time with Pr(B). Therefore Pr(A ∩ B) has to be subtracted one time to obtain the exact result and this leads to the general formula:

Fig. 14.5 Four ways to consider event C = A ∪ B

190

14 Mathematical Framework

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B)

(14.12)

The presence of the term Pr(A ∩ B) in this formula has been the source of calculation difficulties for decades because, when the formula is extended to more than two events, the number of such terms increases quickly (see Sylvester-Poincaré formula in Chap. 20) and it becomes intractable without approximations. Therefore, the idea is to replace A∪B by an equivalent set of disjoint events like the two examples provided in Fig. 14.5 (bottom right) in order to make the probabilities of the intersections equal to 0:   A ∪ B ≡ A ∪ B ∩ A ⇒ Pr(A ∪ B) = Pr(A) + Pr(B ∩ A)

(14.13)

⇒ Pr(A ∪ B) = Pr(A ∩ B) + Pr(A ∩ B) + Pr(B ∩ A)

(14.14)

What is described above can be used to transform simple logic functions but to deal with large logic functions related to industrial size systems, this has been generalized and systematized by the binary decision diagrams (BDDs) which are described in Chap. 21.

14.2.2 Probability of the Conjunction (Intersection) of Events To complete the calculation of formulae 14.12–14.14, there is still to calculate the probabilities of the event intersections (e.g. Pr(A ∩ B), Pr(B ∩ A), etc.) and this depends whether the events are independent or not.

14.2.2.1

Independent Events

This is the simplest case and the independency between the basic events is one of the main assumptions of most of the Boolean models (e.g. reliability block diagrams, Chap. 15, fault trees, Chap. 16, event trees, Chap. 26). In this case, the occurrence of one basic event does not depend on the occurrence of the others and vice versa and this leads to simple calculations. For example, when

14.2 Bases for Time-Independent Probabilistic Calculations

191

A and B are independent,Pr(A ∩ B) is only the product of the individual probabilities of A and B: Pr(A ∩ B) = Pr(A) · Pr(B)

14.2.2.2

(14.15)

Dependent Events and Bayes’ Theorem

This is the general case and the dependency between events is the main assumption of the belief network approach which is the Boolean approach specifically devoted to dependent events (see Chap. 27). In this case, the probability of occurrence of one basic event depends on the occurrence of the others and vice versa and this leads to the notion of conditional probability. For example, when A and B are dependent, the conditional probability of occurrence of A given that B has occurred is noted by Pr(A|B) and it is equal to (see Bayes (1763), Wikipedia Bayes (2020), Wikipedia CP (2020)): Pr(A|B) = Pr(A ∩ B)/Pr(B)

(14.16)

The dependency between A and B does not imply a causal relationship between A and B and the conditional probability of B given A has occurred can be defined in a similar way: Pr(B|A) = Pr(A ∩ B)/Pr(A)

(14.17)

This allows to calculate Pr(A ∩ B) from the conditional probabilities: Pr(A ∩ B) = Pr(A|B) · Pr(B) = Pr(B|A) · Pr(A)

(14.18)

Therefore, – when A and B are independent, Pr(A ∩ B) = Pr(A) · Pr(B) and this implies: Pr(A|B) = Pr(A) and Pr(B|A) = Pr(B)

(14.19)

– when A and B are disjoint (incompatible), Pr(A ∩ B) = 0 and this implies: Pr(A|B) = Pr(B|A) = 0

(14.20)

192

14 Mathematical Framework

It has to be noted that the above formula 14.18 leads to the Bayes’ theorem (Bayes 1763) which gives the relationship between Pr(A|B) and Pr(B|A): Pr(A|B) = Pr(B|A) · Pr(A)/Pr(B)

(14.21)

This formula can be extended to any number of events.

14.3 Introduction to Time-Dependent Calculations Details about the time-dependent probabilistic calculations are provided in Chap. 22 for reliability block diagrams and fault trees and the aim of the present chapter is only to give a broad overview of this important topic. Let us consider the simple formula Pr(A ∩ B) = Pr(A) · Pr (B). In this formula, Pr(A) and Pr(B) are simple point values comprised between 0 and 1. If the probabilities of A and B are time-dependent, they can be written Pr(A, t) and Pr(B, t). Pr(A, t) and Pr(B, t) time-independent: the probabilities of A and B evolve independently from each other when time elapses. This implies that, for a given instant t i , the probabilities Pr(A, ti ) and Pr(B, ti ) are independent point values exactly as Pr(A) and Pr(B) were in the time-independent case. Therefore, it is possible to write: Pr(A ∩ B, ti ) = Pr(A, ti ) · Pr(B, ti )

(14.22)

This can easily be generalized to any logic formula involving a set of events (Ak ). The time-independent probabilistic calculations performed by using the point value Pr(Ak ) remain valid for any instant t i , by replacing them by the point values Pr(Ak , ti ). This can be applied to any of the Boolean approaches in order to perform quantitative probabilistic calculations and this is particularly effective for availability/unavailability calculations when Boolean models are combined with Markovian models providing the probability inputs (see Chap. 27). Pr(A, t) and Pr(B, t) time-dependent: the probabilities of A and B do not evolve independently from each other when time elapses. Then the simple calculations proposed above are no longer valid. In particular, the reliability/unreliability calculations which involve systemic time-dependencies between the events belong to this case. The corresponding analytical calculations— –which require the use of the Birnbaum importance factors (see Chaps. 22 and 24)— are far more complicated than for availability/unavailability calculations. They are not tractable anyway, without accepting some approximations which, fortunately, are pretty good when the systems are reliable.

14.3 Introduction to Time-Dependent Calculations

193

In the general case of time-dependent events, no analytical calculations are tractable and they must be replaced by Monte Carlo simulation on the dynamic versions of reliability block diagrams, fault trees or event trees. As already mentioned above, the combination with Petri nets has proven to be very useful to do that (see Chaps. 26 and 27).

References Bayes T (1763) An Essay towards solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53:370–418. London, UK Page LB (1989) Probability for engineering with Applications to reliability, Computer Science Press, ISBN 0–7167–8187–5. New York, USA Ruegg A (1995) Probabilités et Statistiques. Collection “Méthodes mathématiques pour l’Ingénieur". Presses Polytechniques Romandes, Switzerland. Venn j, (1880) On the diagrammatic and mechanical representation of proposition and reasoning. London, UK, The London, Edinburgh and Dublin Philosophical magazine and journal of science Wikipedia Venn (2020): https://en.wikipedia.org/wiki/Venn_diagram. Accessed September 2020. Wikipedia Bayes (2020): https://en.wikipedia.org/wiki/Bayes’theorem . Accessed September 2020. Wikipedia CP (2020): https://en.wikipedia.org/wiki/Conditional_probability. Accessed September 2020

Chapter 15

Reliability Block Diagrams (RBDs)

15.1 History and Introduction to Reliability Block Diagrams Which engineer analysing a system has not, one day or another, quickly drafted some boxes and links in order to represent the components of this system and the relationship between them in order to help him make his mind or to discuss with some colleagues? This simple and popular approach dates back to the dawn of times and, unfortunately, literature has not kept in memory the clever people who have invented it. When used to represent the logic relationship existing between the up states (see Chap. 4) of the components and the up state of the overall system, this approach leads to the drafting of the reliability block diagrams (RBDs) which are described in this chapter. The location of the RBD approach among the other approaches used to assess the reliability and safety of industrial systems is presented in Fig. 15.1: this is an analytical approach belonging to the Boolean family. Therefore, the RBD approach aims to model the logic structure of systems and is one of the ways allowing to represent the logic function linking the states of a system to the states of its components. More precisely, an RBD models the logic links between the success state (up state) of a system and the success states (up states) of its components. This is drafted by using the symbols and structures presented in Sect. 15.2 and it is based on the following fundamental assumptions: • the system—represented by an RBD—has only two states: for example, up/down or success/failed; • the system is divided into individual parts (e.g. components, equipment, groups of components)—represented by blocks (see Sect. 15.2)—which have also only two states: for example, up/down or success/failed;

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_15

195

196

15 Reliability Block Diagrams (RBDs)

Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Specific formulae

Boolean approaches RBD FT ET

Static models

Dynamic models

Markovian approaches

Monte Carlo simulation

Behavioural approaches

Markov graphs

Petri nets

State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 15.1 Location of the RBD approach among the various probabilistic models

• thanks to various logic structures and logic links presented in Sect. 15.2, an RBD models the logic linking the success state of the system to the success states of the blocks; • each block behaves independently from the others. This leads to directed acyclic graphs (DAG) which, therefore, include no loops and have at least one input and one output. Then, a given RBD embeds a logic formula allowing to calculate the state of its output as a function of its block states. From a formal point of view, the RBDs are equivalent to the fault trees (FTs) described in Chap. 16. RBDs and FTs are dual approaches, the RBDs model the up state (i.e. the success/good functioning) of the modelled system as a function of up states of its components whereas the FTs model the opposite, i.e. the down state (i.e. fault) of the system as a function of the down states of its components. This implies that RBDs and FTs have exactly the same mathematical properties and can produce exactly the same results: this is why Chaps. 17–25 devoted to mathematical development are common for both RBDs and FTs. When the above assumptions are fulfilled, then these analytical calculations can be undertaken to perform qualitative, semi-quantitative and quantitative calculations on both RBD and FT models. The difference comes from the way the models are built—bottom-up for RBDs and top-down for FTs—and from the level of probabilities which are calculated— close to 1 for RBDs and close to 0 for FTs. This makes probabilistic approximations easier with FTs and RBDs less flexible than FTs but easier to use for discussion with non-reliability engineers.

15.1 History and Introduction to Reliability Block Diagrams

197

It has to be noted that an RBD is basically a static model implementing probabilities of success independent of the time. When the time variable is introduced (e.g. to consider probability of success evolving with time) this leads to some issues which should be fully understood by the users: • sequential events: they are outside the scope of Boolean models even if in some cases this can be overcome by approximation or by introducing composite blocks behaving independently of all the other blocks (see Sect. 15.2); • availability or frequency calculations: in this case, independency implies that the repairs of repaired blocks are also independent. This implies in turn that each of them has its own repair team. Then, ignoring this fact is likely to lead to non-conservative (i.e. optimistic) results; • reliability calculations: in this case, the independency between blocks no longer exists because this calculation implies to consider that a failed block can be repaired only if the system has remained in up state all over the time and remains operating when the failure occurs (see Chap. 22). This introduces a so-called systemic dependency between all the blocks of the RBD and this prevents to perform exact calculations. Therefore, when the fundamental assumptions are fulfilled, the RBD approach can be used for qualitative analysis and availability/frequency calculations. For reliability calculations, this can be done without approximation only if the RBD comprises only non-repaired blocks (because in this case availability and reliability are the same thing). When the RBD comprises repaired blocks, the reliability cannot be calculated without approximations which, fortunately, are rather good when the failures are quickly detected and repaired. It seems paradoxical to name “reliability block diagram” an approach which, except in the very simple case of non-repaired blocks, is not actually able to calculate straightforwardly the system reliability. This is, in fact, a legacy from the old times when “reliability” was used as an umbrella term encompassing all the activities devoted to improving the good functioning of systems. This acceptation is still widely used in the vernacular language or even in expressions like “reliability engineering” whose scope is larger than solely the reliability defined as an ability or a measure in the IEV 192. This is also this general acceptation which has been used in the title of this book.

15.2 Graphical Symbols and Basic RBD Structures An RBD is a directed acyclic graph and that means that the links are oriented from input to output. Conventionally this orientation is from left to right as presented on the left-hand side of Fig. 15.2. On the right-hand side of Fig. 15.2 is presented the elementary brick of every RBD, that is to say the block which gives its name to the approach. A block generally represents the up state of a given component (or of an equipment or a group of

198

15 Reliability Block Diagrams (RBDs)

Fig. 15.2 Directed links and elementary block of a reliability block diagram

components considered as a whole) and, to make the reading easier, the following notations are implemented hereafter: • A (upright upper-case letter) represents both component A and the block modelling this component; • A (italic upper-case letter) represents the up state of component A and, by assimilation, the state of the related block; • a (italic lower-case letter) represents the logic variable associated with the state of component A and, by assimilation, the state of the related block. The single block illustrated in Fig. 15.2 constitutes the simplest RBD which can be built. If the input, I, represents an event which is true, the state of the corresponding output, O, is simply O = A. This can be represented by the logic formula o = a. The basic RBDs are built by using the two structures illustrated in Fig. 15.3: • the series structure (left-hand side): in such a structure the modelled system is in up state if all the blocks are also in up state. This is equivalent to the conjunction introduced in Chap. 14. For the example in the figure, the state of the system S1 is given by S1 = O = A ∩ B or by s1 = o = a · b in the form of a logic equation; • the parallel structure (right-hand side): in such a structure the modelled system is in up state if at least one of the blocks is in up state. This is equivalent to the disjunction introduced in Chap. 14. For the example in Fig. 15.3, the state of the system S2 is given by S2 = O = A ∪ B or by s2 = o = a + b in the form of a logic equation.

Fig. 15.3 Basic structures used to build a reliability block diagram

15.2 Graphical Symbols and Basic RBD Structures

199

Fig. 15.4 Repeated events and NOT operator

Compared to the three operators of the Boolean algebra (see Chap. 14) the NOT operator is missing from the basic RBD structures. This is normal as, in common use, the blocks model only components in up state. Nevertheless, in some cases, it can be useful and, in this case, the symbols presented in Fig. 15.4 can be used. A repeated block is a block which is used in several places of an RBD. In this case it can be used as it is (direct state) or in the complementary state (inverted state). This is useful, for example, to represent conditions which are fulfilled for one part of the RBD and not fulfilled for another part of the same RBD. Therefore, it is necessary to identify clearly the blocks which are repeated and in which state. Fig. 15.4 proposes some symbols for this purpose: • on the left-hand side, a block A repeated in the direct state is drafted with a double line to box it. In this case, all other blocks A appearing in the RBD are a simple copy of this repeated block; • in the middle of the figure, two different drawings are proposed to represent a block A repeated in the inverted state. At the top is the drawing proposed in the IEC 61078 Ed. 3 (2016) standard and at the bottom the drawing used in the GRIF software package (GRIF-Workshop 2020) developed under the lead of one of the authors and the free version of which is used to perform the calculations presented in this book. In this case, this is a copy of the repeated block A but in the state A instead of A. The inverted blocks are a kind of NOT operator, but another one is needed to invert the logic value of output of blocks or structures appearing within the RBD. This is illustrated on the right-hand side of Fig. 15.4 where a common NOT gate is used for this purpose. Even if they are not commonly used, the repeated blocks, the inverted blocks and the NOT operators are very useful to extend the modelling power of the basic RBDs and do not change their nature of Boolean models. Nevertheless, this makes the calculations more complicated and this can also introduce incoherent behaviours (see Chap. 18) which, when dealing with large systems, can be solved only by using the binary decision diagrams (BDDs) described in Chap. 21. Complementary graphical symbols are presented in Fig. 15.5: the majority vote logic m/n and a structure comprising a common block. The majority vote logic m/n (m out of n) is a logic structure widely used as a complement of the series and parallel structures: its output is true when at least m of its n inputs are true. It results that its output is false when (n-m+1) of its inputs are

200

15 Reliability Block Diagrams (RBDs)

Fig. 15.5 Useful complementary RBD structures

false and therefore a m/n system with regards to success is a (n-m+1)/n with regards to fault. Then, a 2/3 with regards to success is also a 2/3 with regards to fault: this symmetry between success and faults explains why this is a popular configuration widely used in instrumented systems: a 2/3 system needs two components over three in up state to be in up state but also two components in down state to be in down state. When a block is common to several paths of an RBD, this is difficult to model just by using series and parallel structures and a structure with a common block can be implemented instead. Note that arrows are indicated in the figure in order to remove any ambiguity on their direction (what can happen with complicated configurations with several common blocks). The Boolean formula corresponding to the example in Fig. 15.5 is the following: S3 = (A1 ∩ B1 ) ∪ [D ∩ (B1 ∪ B1 )] ∪ (A2 ∩ B2 )

(15.1)

In the above formula, block D appears only once but the formula can be written in a different way: S3 = [(A1 ∪ D) ∩ B1 ] ∪ [(A2 ∪ D) ∩ B2 ]

(15.2)

This formula can be represented by parallel and series structures but block D is repeated twice and this shows the usefulness of the repeated blocks introduced above. Useful structures for handling large RBDs are proposed in Fig. 15.6. When an original RBD is too large to be drafted on only one page, this allows to draft it on several pages: Fig. 15.6 Other useful complementary RBD structures to split large RBDs

15.2 Graphical Symbols and Basic RBD Structures

201

• The sub-RBD on the left-hand side of Fig. 15.6 which models an individual part of an overall RBD and which, itself, is an RBD modelling a subsystem made of two redundant components A and B. • The transfer gates on the right-hand side of Fig. 15.6 which allows to connect an output somewhere in an RBD to an input located in another place of the same RBD. Sub-RBDs are useful when developing an RBD (see the examples proposed in Sec. 15.3) and, like transfer gates, they are also useful to split large RBDs into smaller ones. This is illustrated in Fig. 15.7 where the original RBD at the top is split into 3 parts at the bottom. When a subsystem has a binary behaviour but cannot be analysed in more detail in terms of RBDs, then a composite block can be implemented. This is the case for the composite block F given in Fig. 15.8: component D is normally running and component E in standby position. When D fails, E is started to ensure the continuity of the operations and when D is repaired, E goes back to standby position. The sequential aspects and the dependencies between D and E cannot be described within the Boolean framework but this can be easily done by using e.g. a Markov graph. Therefore, the composite block F can be handled as a simple block within an RBD but its probability of success can be calculated by the Markov graph modelling D and E (see Chap. 31) in cold standby operation.

Fig. 15.7 Example of use of transfer gates and sub-RBDs

Fig. 15.8 Symbol for composite blocks

202

15 Reliability Block Diagrams (RBDs)

This is useful when a system comprising dependent components can be split into independent functions modelled by ordinary sub-RBDs or as composite blocks. In particular, this leads to the RBD-driven Markov models (see Chap. 27) which use the RBDs to model the logic structure of the modelled system and small Markovian models to compute the availabilities of the ordinary blocks as well as the availabilities of the composite blocks.

15.3 Building an RBD from Simple Examples A simple pumping system is modelled in Fig. 15.9. It comprises a valve V1 in series with two redundant pumping trains: train 1 made of two 50% capacity pumps (P1 and P2 ) flowing in parallel through a valve (V2 ) and train 2 made of a 100% capacity pump (P3 ) and a valve (V3 ). The building of the corresponding RBD can be made by undertaking the following steps: • identification of the global function to be achieved by the system: providing a pumping capacity of 100% from the input to the output of the pumping system; • identification of the subfunctions participating to the global function: – connection of the pumping system to the upstream source (not represented) performed by the isolation valve V1; – pumping function performed by the two redundant trains; • drafting of the RBD according to the above analysis. These two functions are in series, as represented by the node 1 in the RBD on the right-hand side of Fig. 15.9. Trains 1 and 2 are redundant because each of them has a pumping capacity of 100%. Therefore, the RBD in Fig. 15.9 can be detailed by modelling this redundancy by a parallel structure with two branches, one related to train 1 and another one to train 2. This has been done in the RBD on the left-hand side of Fig. 15.10 (node 2).

Fig. 15.9 Example of an RBD modelling a simple pumping system (1/2)

15.3 Building an RBD from Simple Examples

203

Fig. 15.10 Example of an RBD modelling a simple pumping system (2/2)

The next step is to detail the sub-RBDs related to trains 1 and 2. This is done on the right-hand side of Fig. 15.10: • the pumping capacities of P1 and P2 are only 50% then, for being in up state, train 1 needs to have P1 , P2 and V2 in up state at the same time to provide a pumping capacity of 100%; then this is modelled by a series structure (nodes 3 and 4); • the pumping capacity of P3 being 100%, train 2 needs to have P3 and V3 in up state at the same time to be in up state; then this is modelled by a serial structure (node 5). It has to be noted that pumps P1 and P2 which are physically in parallel (left-hand side of Fig. 15.9.) are in series from a functional logic point of view (right-hand side of Fig. 15.10) because the failure of one of them is sufficient to fail train 1. This illustrates the fact that, even if the structure of an RBD is often close to this of the modelled physical system, this is not always the case. As the RBD on the right-hand side of Fig. 15.10 is detailed at the block (i.e. component) level, this is the final RBD related to the pumping system and it embeds the logic formula linking the up state of the whole pumping system to the up states of its components. The correspondence of the RBD nodes and the operators (·, +) of the logic formula is indicated at the top of the right-hand side of Fig. 15.10. This leads to the following formula for the state of the pumping system: S = V1 ∩ [(P1 ∩ P2 ∩ V2 ) ∪ (P3 ∩ V3 )]

(15.3)

Using logic variables leads to the logic formula provided in Fig. 15.10: s = v1 ∩ [( p1 ∩ p2 ∩ v2 ) ∪ ( p3 ∩ p3 )]

(15.4)

Fig. 15.11 illustrates another example which is related to safety instead of production: a typical safety instrumented system made of 3 sensors, one logic solver and 2 safety valves allowing to prevent the overpressure in a protected tank. Again, the building of the corresponding RBD can be made by undertaking the following steps:

204

15 Reliability Block Diagrams (RBDs)

Fig. 15.11 Example of an RBD modelling a typical safety instrumented system

• identification of the global function to be achieved by the system: protect the tank from overpressure by closing one of the safety valves V1 or V2 ; • identification of the subfunctions participating to the global function: – overpressure detection by at least 2 of the sensors (pressure transmitters); – logic treatment of the signals coming from sensors by the logic solver and triggering a demand to close the valves if at least 2 sensors over 3 detect an overpressure; – closure of at least one valve when the demand occurs. • drafting of the RBD according to the above analysis. According to the above analysis, three functions have been identified – pressure detection (sensors), signal processing (logic solver) and inlet flow shut off (safety valves)—which are necessary and sufficient to achieve the overall safety function. As the failure of any of them leads to the failure of the safety function, these three functions have to be organized in series as shown in Fig. 15.11. The next step is to develop the sub-RBDs related to these 3 functions, i.e. a 2/3 majority vote for the sensors, a single block for the logic solver and a parallel structure for the valves (because the closure of only one of them is sufficient to prevent the overpressure). Finally, the RBD drafted on the right-hand side of the figure is obtained. In this RBD, a 2/3 majority vote structure is used and it is necessary to establish the logic formula related to this structure which is in up state if at least 2 of the sensors over 3 are in up state. There are 3 possibilities S1 ∩ S2 , S1 ∩ S3 and S2 ∩ S3 which can be represented by a parallel structure of three series structures: S2/3 = (S1 ∩ S2 ) ∪ (S1 ∩ S3 ) ∪ (S2 ∩ S3 )

(15.5)

Using logic variables gives: s2/3 = s1 · s2 + s1 · s3 + s2 · s3

(15.6)

15.3 Building an RBD from Simple Examples

205

Then, the formula giving the up state S of the safety system as a function of the up states of the various blocks is the following: S = S2/3 ∩ L ∩ (V1 ∪ V2 )

(15.7)

Using logic variables gives: s = s2/3 · l · (v1 + v2 )

(15.8)

15.4 Tie and Cut Set Identification 15.4.1 Electrical Analogy A typical component with a binary behaviour is an electrical switch and this leads to the idea to consider a block as an electrical switch (see Fig. 15.12) which is: • closed to model the up state of the corresponding component; • open to model the down state of the corresponding component. If it is imagined that the block is included into a virtual electric circuit (in grey in Fig. 15.12), the circuit is closed and the lamp is switched on (it lights) when the block is in up state and the circuit is cut and the lamp is switched off (it does not light) when the block is in down state.

15.4.2 Concept of Minimal Cut and Tie Sets The electrical analogy described above allows to transform any RBD into an electrical circuit whose analysis consists in identifying the combinations of up and down states of the blocks allowing to close or to cut the resulting circuit.

Fig. 15.12 Block seen as an electrical switch

206

15 Reliability Block Diagrams (RBDs)

Fig. 15.13 Example of success paths (tie sets)

Fig. 15.13 gives two examples related to the RBD in Fig. 15.10 where the virtual electric circuit is closed and which corresponds to configurations, Ts1 and Ts2 , where the system is in up state. From the point of view of the system state, such configurations are called success paths and from the point of view of the electrical analogy they are called tie sets. Each of these configurations contains blocks in up state sufficient to lead to the up state of the modelled system. The difference between them is the following: • Ts1 : except for V1 , the opening of P1 , P2 , V2 , P3 or V3 does not open the circuit. Then all the closed switches are not necessary to close the circuit (and all the blocks in up state are sufficient but not necessary to lead to the system up state). This is an ordinary success path (tie set). • Ts2 : the opening of V1 , P1 , P2 or V3 opens the circuit. Then all the closed switches are necessary to close the circuit (and all the blocks in up state are necessary and sufficient to lead to the system up state). This is a minimal success path (tie set). Using the Boolean algebra properties leads to represent the system up state as the union of the minimal tie sets of the corresponding RBD: S = ∪i T si

(15.9)

Fig. 15.14 gives two examples where the virtual electric circuit is open and which correspond to configurations, Cs1 and Cs2 , where the system is in down state. From the point of view of the system state such configurations are called failure paths and from the point of view of the electrical analogy they are called cut sets. Each of these configurations contains blocks in down state sufficient to lead to the down state of the modelled system. The difference between them is the following:

Fig. 15.14 Example of failure paths (cut sets)

15.4 Tie and Cut Set Identification

207

• Cs1 : except for V1 , the closing of P3 or V3 does not close the circuit. Then all the open switches are not necessary to open the circuit (and all the blocks in down state are sufficient but not necessary to lead to the system down state). This is an ordinary failure path (tie set). • Cs2 : the closing of P1 or P3 closes the circuit. Then all the open switches are necessary to close the circuit (and all the blocks in up state are necessary and sufficient to lead to the system up state). This is a minimal success path (tie set). Using the Boolean algebra properties leads to represent the system down state as the union of the minimal cut sets of the corresponding RBD: S = ∪ j Cs j

(15.10)

Gathering formulae 15.9 and 15.10 gives the following equality: ∪i T si = ∪ j Cs j

(15.11)

15.5 RBD Representation by Tie and Cut Sets The minimal tie and cut sets can be obtained from the logic formulae related to the considered RBD. For the example above the minimal tie sets are the following: S = (V1 ∩ P1 ∩ P2 ∩ V2 ) ∪ (V1 ∩ P3 ∩ V3 )

(15.12)

The above formula provides one minimal tie set of order 3, (V1 ∩ P3 ∩ V3 ) and one minimal tie set of order 4, (V1 ∩ P1 ∩ P2 ∩ V2 ). This allows to obtain an equivalent representation with minimal tie sets of the RBD presented in Fig. 15.10. This is done in Fig. 15.15 where it has to be noted that, with such a representation, block V 1 is repeated twice. The minimal cut sets can be obtained simply by applying the De Morgan’s law to formula 15.12: S = (V1 ∩ P1 ∩ P2 ∩ V2 ) ∪ (V1 ∩ P3 ∩ V3 ) This gives: Fig. 15.15 Equivalent RBD represented by minimal tie sets

208

15 Reliability Block Diagrams (RBDs)

Fig. 15.16 Equivalent RBD represented by minimal cut sets

    S = V1 ∪ P1 ∪ P2 ∪ V2 ∩ V1 ∪ P3 ∪ V3 And, finally, the minimal cut sets are obtained by developing the formula:       S = V1 ∪ P1 ∩ P3 ∪ P1 ∩ V3 ∪ P2 ∩ P3       ∪ P2 ∩ V3 ∪ V2 ∩ P3 ∪ V2 ∩ V3

(15.13)

This provides 7 minimal cut sets, one of order 1 and six of order 2. Re-using the De Morgan’s law gives the following equivalent formula: S = V1 ∩ (P1 ∪ P3 ) ∩ (P1 ∪ V3 ) ∩ (P2 ∪ P3 ) ∩ (P2 ∪ V3 ) ∩ (V1 ∪ P3 ) ∩ (V1 ∪ V3 )

(15.14)

This allows to obtain an equivalent representation with minimal cut sets of the RBD presented in Fig. 15.10. This is done in Fig. 15.16 where, except event V 1 , all the other events are repeated 2 or 3 times.

15.6 Associated Exercises Two exercises related to Chap. 15 are proposed in Chap. 29: • Exercise 15.1: build the RBD related to an overpressure protection system • Exercise 15.2: identify the tie sets related to the above system.

References GRIF-workshop (2020) BFiab module. Funded and developed by TOTAL, https://grif-workshop.fr/. Accessed Sept 2020 IEC 61078 Ed. 3 (2016) Reliability block diagrams. International Electrotechnical Commission (IEC), Geneva, Switzerland

Chapter 16

Fault Tree Analysis (FTA)

16.1 History and Introduction to Fault Tree Analysis On the contrary of the reliability block diagrams for which the origin remains unknown, this of the fault tree is clearly identified. The FT analysis has been developed since 1961 within the framework of the Minuteman project of the US Air Force: Watson (1961), Haasl (1965). Its history is described, for example, in Ericson (1999). Thanks to its original way of thinking—deductive (top-down) when the other approaches were mainly inductive (bottom-up)—the use of FTs has been adopted immediately and has quickly spread to aeronautics, spatial and nuclear industry and finally throughout the whole industry. The FT approach now belongs to the battery of tools currently used to assess the reliability and safety of industrial systems and it is standardized (IEC 61025 Ed. 3.0 in progress). The location of the FT approach among the other approaches is presented in Fig. 16.1: this is an analytical approach belonging to the Boolean family. Therefore, the FT approach aims to model the logic structure of systems and is one of the ways allowing to represent the logic function linking the state of a system to the states of its components. More precisely, an FT models the logic links between the faulty state (down state) of a system and the faulty states (down states) of its components. This is drafted by using the symbols and structures presented in Sect. 16.2 and it is based on the following fundamental assumptions: • the system—represented by an FT—has only two states: for example, up/down or success/failed; • the system is divided into individual parts (e.g. components, equipment, groups of components) which have also only two states (e.g. up/down or success/failed) and the down states are represented by primary events (also called leaves) (see Sect. 16.2);

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_16

209

210

16 Fault Tree Analysis (FTA)

Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Specific formulae

Boolean approaches FT RBD ET

Static models

Dynamic models

Markovian approaches

Monte Carlo simulation

Behavioural approaches

Markov graphs

Petri nets

State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 16.1 Location of the FT approach among the various probabilistic models

• thanks to various logic symbols and logic links presented in Sect. 16.2, an FT models the logic linking the down state of the system (top event of the FT) to the down states of the components (primary events of the FT); • each primary event behaves independently from the others. This leads to directed acyclic graphs which, therefore, include no loops and have as many inputs as primary events and at least one output which is the top event of the FT. Then, a given FT embeds a logic formula allowing to calculate the state of the top event as a function of the states of the primary events (leaves of the FT). From a formal point of view, the FTs are equivalent to the reliability block diagrams (RBDs) described in Chap. 15. FTs and RBDs are dual approaches, the FTs model the down state (i.e. the fault/malfunctioning) of the modelled system as a function of the down states of its components whereas the RBDs model the opposite, i.e. the up state (i.e. success) of the system as a function of the up states of its components. This implies that FTs and RBDs have exactly the same mathematical properties and can produce exactly the same results: this is why Chaps. 17–25 devoted to mathematical development are common for both FTs and RBDs. When the above assumptions are fulfilled, then these analytical calculations can be undertaken to perform qualitative, semi-quantitative and quantitative calculations on both FT and RBD models. The difference comes from the way the models are built—top-down for FTs and bottom-up for RBDs—and from the level of probabilities which are calculated— close to 0 for FTs and close to 1 for RBDs. On one hand, this makes probabilistic approximations easier with FTs and on another hand, RBDs less flexible than FTs but easier to use for discussion with non-reliability engineers.

16.1 History and Introduction to Fault Tree Analysis

211

It has to be noted that an FT is basically a static model implementing probabilities of failure independent of time. When the time variable is introduced (e.g. to consider probability of failure evolving with time) this leads to some issues which should be fully understood by the users: • sequential events: they are outside the scope of Boolean models even if in some cases this can be overcome by approximation or by introducing composite primary events behaving independently of all the other primary events (see Sect. 16.2); • unavailability or frequency calculations: in this case, independency implies that repairs of repaired components are also independent. This implies in turn that each of them has its own repair team. Then, ignoring this fact is likely to lead to non-conservative (i.e. optimistic) results; • unreliability calculations: in this case, the independency between components no longer exists because this calculation implies to consider that a failure can be repaired only if the system has remained in up state all over the time and remains operating when the failure occurs (see Chap. 22). This introduces a so-called systemic dependency between all the primary events of the RBD and this prevents to perform exact calculations. Therefore, when the fundamental assumptions are fulfilled, the FT approach can be used for qualitative analysis and availability/frequency calculations. For reliability calculations, this can be done without approximation if the FT comprises only nonrepaired primary events (because in this case unavailability and unreliability are the same thing). When the FT comprises repaired failures, the unreliability cannot be calculated without approximations which, fortunately, are rather good when the failures are quickly detected and repaired.

16.2 Graphical Symbols and Basic FT Symbols An FT is a directed acyclic graph and that means that the links are oriented from input to output. Conventionally this orientation is from bottom to up, as presented on the left-hand side of Fig. 16.2. In this figure are also presented the top event which is located at the top of the FT and the intermediate event which is located everywhere in the tree in order to explain the meaning of an intermediate output. The main inputs of an FT are called primary events. They are the leaves of the FT and are generally used to represent the fault (i.e. the down state) of a given component O Output

Toward output

Link direction

Fig. 16.2 Directed links, top event and intermediate event

From input

Logic gate I1

I2

In

Inputs

O Top event

Intermediate event

I

I

212

16 Fault Tree Analysis (FTA)

Fig. 16.3 Main primary events (FT leaves) B failed

L1 Basic event

D failed

E failed

L3

L2 Elementary event

Event to be developed

(or of an equipment or a group of components considered as a whole). They are the dual of the blocks—which represent up states instead of down states—used in RBD models. Like for the RBDs (see Chap. 15) and in order to make the reading easier and to ensure consistency between RBD and FT chapters, the following notations are implemented: • the leaves of an FT are represented by L1 , L2 , ..., Ln ; • an upright upper-case letter represents a component (equipment, group of components) (e.g. B for the basic event or E for the elementary event in Fig. 16.3); • the down state of this component is modelled by a primary event and noted by the same letter in italic with a bar at the top (e.g. B for the basic event or E for the elementary event Fig. 16.3); • the logic variable associated with the down state of this component is noted by the same letter in italic and lower-case (e.g. b for the basic event or e for the elementary event Fig. 16.3). Therefore, the output of a primary event is equal to the state of this primary event, e.g. O = B for the basic event or O = E for the elementary event. This can also be represented by the logic formula o = b or o = e. As shown in Fig. 16.3, several symbols are used to draft the primary events in order to provide information about their meanings: • Basic event: represented by a circle, this is an event which cannot be analysed in more detail. • Elementary event: represented by a diamond, this is an undeveloped event which could be analysed in more detail but for which it is considered not useful to do that. This can be due to the fact that reliability data are readily available at that level or because a more detailed analysis is likely to bring no further information. • Event to be developed: represented by a double diamond, this is an event which is not developed at this stage of the FT building but which must be developed at a further stage. This notation is used as a reminder for the analyst (e.g. reliability engineer) drafting the FT. Other symbols used to draw primary events are illustrated in Fig. 16.4: • House event: represented by a house, this is an event which is expected to occur with certainty during the life of the considered system. This is often an external event.

16.2 Graphical Symbols and Basic FT Symbols

213

Fig. 16.4 Other primary events (FT leaves) H is expected with certainty

L faulty

Condition C

C

L4 House event

Condition

L6

L5

Dormant (latent) fault

• Condition: represented by a rounded box (see Fig. 16.4), this is a conditioning event used to validate a part of an FT. It is generally used in relationship with an IF gate (see Fig. 16.7) • Dormant (or latent) fault: represented by a diamond with a double line, this is an event which is not detected when it happens. It can be revealed when a demand occurs on the related component or when proof tests or inspections are performed in order to detect it before the demand occurs. The bugs in computer programs are dormant faults. It has to be noted that an intermediate event has been placed at the top of each of the primary events. This is strongly recommended to identify clearly the meaning of the primary events and to accurately understand what is modelled by the FT. This is also very useful for external readers or as a reminder for the analyst of what he did some time ago. The basic FTs are built by using the two logic gates illustrated in Fig. 16.5: • OR gate: this is the disjunction introduced in Chap. 14. For the example in the figure, the state at the output of an OR gate is equal to the union of its inputs, i.e. O = I1 ∪ I2 ∪ . . . ∪ In or by o = i 1 + i 2 + . . . + i n in the form of a logic equation. This is the dual representation of the series structure introduced in the RBDs (see Chap. 15). The output of an OR gate represents a down state if one of its inputs represents down states. • AND gate: this is the conjunction introduced in Chap. 14. For the example in the figure, the state at the output of an AND gate is equal to the intersection of its inputs, i.e. O = I1 ∩ I2 ∩ . . . ∩ In or by o = i 1 · i 2 · . . . · i n in the form of a logic equation. This is the dual representation of the parallel structure introduced in the RBDs (see Chap. 15). The output of an AND gate represents a down state if all of its inputs represent down states. Fig. 16.5 Basic logic gates used to build a fault tree AND

OR

I1

I2

In

OR gate (disjunction)

I1

I2

Im

AND gate (conjunction)

214

16 Fault Tree Analysis (FTA)

Again, it has to be noted that an intermediate event has been placed at the top of each of the logic gates. This is strongly recommended to identify clearly what is exactly modelled by the sub-FT below this logic gate. Compared to the three operators of the Boolean algebra (see Chap. 14), the NOT operator is missing from the basic FT logic gates. This is normal as, in common use, the primary events model only components in down state. Nevertheless, it can be useful to invert events (see Fig. 16.6) and logic gates (see Fig. 16.9). When drafting an FT, the same primary event can appear in several places and, in addition, a repeated primary event can be true in one part of the FT and false

Inverted copy

Copy

Copy

Repeated event

A

A

A

A

Original event (reference)

Inverted copy

A

Repeated events (copies)

Fig. 16.6 Repeated primary events (direct and inverted)

O

FT Input C

k/n

I1

I2

Tg1

k/n majority vote gate

Output

FT

I

In

Tg1

Transfer gates

IF gate

Fig. 16.7 Useful complementary FT logic gates

Tg2 Sub-tree 2

Output transfer gates

S failed Tg1

G1

Sub-tree 1 E1

G4

G4

G2 Sub-tree 1

E2

E3

E4

E5

E6

Sub-fault trees

Fig. 16.8 Example of use of transfer gates and sub-FTs

Input transfer gates Sub-tree 2 Tg2

Tg1 Main fault tree

16.2 Graphical Symbols and Basic FT Symbols

215

in another part of the same FT. This implies to introduce the concept of repeated primary events which can be used as they are (direct state) or in the complementary states (inverted states). Therefore, it is useful to identify clearly the primary events which are repeated and in which state they are repeated. Fig. 16.6 proposes some symbols for this purpose: • repeated primary event A: used as a reference, it is drafted by using a thick line to box it (see the left-hand side of the figure); • direct copy of event A: it can be drafted in the same way as the reference repeated event but sometimes it is useful to make the difference between the reference and the copy. Otherwise it can be drafted by using a grey background (see the middle of the figure); • inverted copy of event A (i.e. A): it can be drafted by using a N O T gate as this is proposed in the middle of Fig. 16.6 (see also Fig. 16.9). Like for inverted blocks (see Chap. 15), a double line with a dotted line in the inner circle (see the right-hand side of the figure) can also be used. Complementary graphical symbols are presented in Fig. 16.7: • Majority vote logic k/n: this is a logic structure widely used as a complement of the OR and AND logic gates: its output is true when at least k of its n inputs are true. It results that its output is false when (n – k + 1) of its inputs are false and therefore a k/n system with regards to fault is a (n – k + 1)/n with regards to success. Then, a 2/3 with regards to success is also a 2/3 with regards to fault. This symmetry between success and faults explains why this is a popular configuration widely used in instrumented systems: a 2/3 system needs two components over three in up state to be in up state but also two components in down state to be in down state. For safety systems, that means that two dangerous failures are needed to inhibit the safety action but also that two safe failures are needed to trigger a spurious safety action. Then this is used to minimize the spurious failures. • IF logic gate: this is a logic gate which validates its output when the condition is true and inhibits its output when the condition is false. • Transfer gates: they allow to split large FTs into several sub-FTs (see Fig. 16.8). As shown in Fig. 16.8, two output transfer gates (Tg1 and Tg2 ) have been used on the left-hand side to identify two sub-fault trees (Sub-tree 1 and Sub-tree 2) which in turn are used as input of the main fault tree on the right-hand side. The intermediate events have been removed to simplify the figure and obtain a very simple example. When handling large fault trees, this allows to build sub-fault trees for each subsystem of the whole system under study. This is useful to split the whole FT between several pages and to keep control all over the FT building process. As said above, the NOT operator is not in common use when using FT models. Nevertheless, the inverted copy of a repeated event illustrated in Fig. 16.6 embeds a NOT operator which can be explicitly represented by the NOT gates shown on the

216

16 Fault Tree Analysis (FTA) O=A

I

I

NOT gate

NAND gate

NOR gate Inverted event

I1

I2

In

I1

I2

In

Fig. 16.9 NOT gate and its use with primary events or other logic gates

Fig. 16.10 Exclusive OR gate and its equivalent presentation with OR, AND and NOT gates XOR gate I1

+ I2 I1

I2

left-hand side of Fig. 16.9. Therefore, the inverted event illustrated in this figure is an equivalent representation of the inverted copy of a repeated event illustrated in Fig. 16.6. The NOT gate can also be used to extend the logic gates used to build FTs: • NOT gate: this is an ordinary OR gate with the inversion of its output by a NOT gate. This gives O = I1 ∪ I2 ∪ . . . ∪ In = I1 ∩ I2 ∩ . . . ∩ In ; • NAND gate: this is an ordinary AND gate with the inversion of its output by a NOT gate. This gives O = I1 ∩ I2 ∩ . . . ∩ In = I1 ∪ I2 ∪ . . . ∪ In ; The introduction of the NOT gate is also the opportunity to talk about the exclusive OR logic gate (XOR) which is illustrated in Fig. 16.10. • XOR gate: this is a special OR gate with only two inputs. Its output is true if one and only one of its inputs is true. This gives O = (I1 ∩ I2 ) ∪ (I1 ∩ I2 ). Therefore, the XOR gate can be represented by a combination of ordinary OR and AND gates as illustrated in Fig. 16.10. In fact, each of the input (I 1 and I 2 ) is repeated twice, one time in the direct state and one time in the inverted state. The repeated primary events and the NOT gate (and the resulting NOR, NAND and XOR gates) are very useful to extend the modelling power of the basic FTs without changing their nature of Boolean models. Nevertheless, this generally makes the calculations more complicated and can introduce incoherent behaviours (see Chap. 18) which can be solved by using the binary decision diagrams (BDDs) described in Chap. 27. When a subsystem has a binary behaviour but cannot be analysed in more detail in terms of FTs, it can be input in the FT in the form of an event to be developed

16.2 Graphical Symbols and Basic FT Symbols

D

I

217

F O

F failed

E

F

Composite blocks

Event to be developed

Fig. 16.11 Event to be developed modelling a binary subsystem not developable in a sub-FT

(double diamond). This is the same philosophy as this developed in Chap. 15 for the composite blocks. For example, the cold standby redundancy illustrated in Fig. 16.11 cannot be modelled by a sub-FT involving components D and E but system F which has only two states (up and down) can be modelled as a whole by a primary event of an FT. If the sequential aspects and the dependencies between D and E cannot be described within the Boolean framework, this can be easily done by using e.g. a Markov graph. Therefore, the event to be developed F can be handled as a simple primary event within an FT but its probability of failure (down state) can be calculated by the Markov graph modelling D and E in cold standby operation. This leads to the FT-driven Markov models (see Chap. 27) which use the FTs to model the logic structure of the modelled system and Markovian models to compute the probabilities of the primary events.

16.3 Building an FT of Simple Examples The simple pumping system already modelled by using an RBD in Chap. 15 is illustrated in Figs. 16.11 and 16.12. It comprises a valve V1 in series with two redundant pumping trains. Train 1 is made of two 40 m3 /h capacity pumps (P1 and P2 ) flowing in parallel through a valve (V2 ) and train 2 is made of an 80 m3 /h capacity pump (P3 ) and a valve (V3 ). The first step when building an FT is to identify the global function to be achieved by the system. In this example, this is to provide a pumping capacity of at least 80 m3 /h. This implies that it is faulty (i.e. in down state) when the flow at the system output drops below 80 m3 /h. This event becomes the starting point for developing the FT and this is the top event of the FT (see Fig. 16.12 on the right-hand side). The P1 (40 m2/h) P2 (40 m3/h)

V1

P3 (80 m3/h)

1

V2

1

V3

Top event

Output flow rate < 80m3/h Deductive analysis

Fig. 16.12 Example of an FT modelling a simple pumping system (1/5)

To be developed

218

16 Fault Tree Analysis (FTA)

occurrence of this event represents the whole system failure (i.e. the system down state or the system fault) S. The top event is now the unwanted effect which is going to be analysed in more detail by a top down (effect ⇒ cause or deductive) process in order to find its causes. At this analysis level, the system can be split in two parts (see Fig. 16.13 on the left) and this points out that the top event may happen when valve V1 is failed close or when the rest of the system is not able to produce at least 80 m3 /h. This leads to the identification of the primary event (leaf L1 ,) “V1 failed close” and of the intermediate event “Flow rate P3/4 ( p) = 1 − P2/4 (1 − p)

(19.19)

And, more generally: Pm/n ( p) = 1 − P(n−m+1)/n (1 − p)

(19.20)

Again, the above formulae are easy to use when the inputs of the majority vote nodes are individual blocks or when the input of the majority vote gates are primary events. They can also be used when the inputs are intermediate RBD nodes or intermediate FT events, provided that these inputs are independent. As soon as the inputs are not independent, the use of a more general procedure like this based on the binary decision diagrams (see Chap. 21) is more effective.

19.3 Sylvester-Poincaré Formula As shown in Chaps 15 and 16, the Boolean formula embedded by a (coherent) RBD can be represented by the union of its minimal tie sets and the Boolean formula embedded by a (coherent) FT by the union of its minimal cut sets. Due to possible approximations with low probabilities, it is the union of the minimal cut sets which has been the basis for probabilistic calculations for decades and until the binary decision diagrams described in Chap. 21 have been introduced. Let us consider a system S for which the minimal cut sets (Csi ) have been identified by using an RBD or an FT model. Then its probability of failure can be written: Pr (S) = Pr (



Csi )

(19.21)

i

For two minimal cut sets, formula 14.12 (see Chap. 14) gives the following result: Pr (Csi ∪ Cs j ) = Pr (Csi ) + Pr (Cs j ) − Pr (Csi ∩ Cs j )

(19.22)

The generalization to more than two elements is known as the Sylvester-Poincaré formula (Pagès and Gondran 1986; Schneeweiss 1989). English and French people do not agree about who has invented it first but, anyway, this is an application of the inclusion–exclusion principle introduced by de Moivre (1718)! (Wikipedia IE

19.3 Sylvester-Poincaré Formula

253

principle 2020; Wikipedia de Moivre 2020). It can be written as follows: Pr (



Csi ) =

i



Pr (Csi )

i





Pr (Csi ∩ Cs j )

i< j

+



Pr (Csi ∩ Cs j ∩ Csk ) + etc.

(19.23)

i< j 0 and UB (0) = 0 chosen for drafting this curve lead to the maximum reached by US p (t). In the above simple examples, no particular assumptions have been made about the way the availabilities related to the blocks of the RBDs or the unavailabilities related to the primary events of the FT have been established. Therefore, any technique able to provide time-dependent availabilities or unavailabilities can be used for this purpose. The availability and the unavailability of a component Ci can be calculated using a state-transition model with two states (up and down) as illustrated in Fig. 22.6: ACi (t) = Pr (Ci in up state at t) and UCi (t) = Pr (Ci in down state at t). In this model, the transition from the up state to the down state occurs through a timedependent transition rate λCi (t) which is generally related to a failure and the transition from the down state to the up state through a time-dependent transition rate μCi (t) which is generally related to a repair. The initial conditions can be taken into Fig. 22.5 Principle of availability calculations from an FT

290

22 Time-Dependent Probabilistic Calculations

Fig. 22.6 General principle of availability calculations of a block or of a primary event

account and, in Fig. 22.5, P0 represents the probability for Ci to be in up state at time t equal to 0 and (1 − P0 ) the probability for Ci to be in down state at the same time. Therefore, the RBD or FT approaches can be combined with any techniques able to handle the state-transition model presented in Fig. 22.5 (e.g. analytical formulae, Markovian models, Petri nets): the selected technique provides the probabilistic inputs of the RBD or the FT which handle them as ordinary point probabilistic values. Then the availability can be obtained as the probability of success of an RBD and the unavailability as the top event probability of an FT: Sylvester-Poincaré formula, BDDs or even Monte Carlo simulation can be used for this purpose.

22.2.2 RBD and FT-Driven Markov Processes When the blocks or the primary events are independent over the time of interest, the most popular combination of techniques associates the simplest state-transition models, i.e. the Markov processes (see Chap. 31) with RBDs or FTs: • RBD-driven Markov processes associate Markov processes and RBDs. • FT-driven Markov processes associate Markov processes and FTs. In both cases: • independent individual Markov processes provide the input probabilities of RBDs or FTs; • RBDs or FTs provide, in turn, the logic allowing to calculate the corresponding probability at the system level. The figures presented above in Figs. 22.4 and 22.5 have been, in fact, obtained by implementing these techniques: the corresponding RBD-driven Markov process is presented on the left-hand side of Fig. 22.7 and the corresponding FT-driven Markov process is presented on the right-hand side of the same figure. Both of them share, in the middle of the figure, the same individual availability Markov processes to calculate the availability/unavailability of A and B. The values of the constant

22.2 Availability/Unavailability Calculations

291

1-P0,a

P0,A I

A Up states

B

P0,B

Down states

AND

1-P0,B

Fig. 22.7 RBD and FT-driven Markov processes related to Figs. 22.4 and 22.5

failure and repair rates as well as the initial conditions have been chosen to highlight the transitional period before the asymptotic value of AS p (t) and US p (t) are reached. In this simple case, the Markov graphs could be replaced by equivalent analytical formulae (Pagès and Gondran 1986)). For component A, the initial condition P0,A is lower than 1 and the equivalent formula is the following (see Sect. 31.4.1):   μA 1 − P0,A − λA .P0,A −(λA +μA )t μA − e AA (t) = λA + μA λA + μA   μA 1 − P0,A − λA .P0,A −(λA +μA )t λA UA (t) = + e λA + μA λA + μA

(22.3)

(22.4)

For component B, the initial condition P0,B is equal to 1 and the equivalent formula is simpler: AB (t) =

μB λB + e−(λB +μB )t λB + μ B λB + μB

(22.5)

λB [1 − e−(λB +μB )t ] λB + μB

(22.6)

UB (t) =

Therefore, whether the small Markov graphs in Fig. 22.7 or the analytical formulae are used does not make any difference: in both cases RBD/FT-driven Markov processes are implemented. This is the basis for most of the probabilistic calculations performed with RBDs or FTs. The same approach can be implemented for calculating the availability or the unavailability of periodically tested systems. In this case, the multiphase Markov processes developed in Chap. 31.5.4 can be used to calculate the inputs of the RBDs or FTs. This is illustrated in Fig. 22.8 for the small RBD modelling system Sp with two periodically tested components A and B. Both components are tested with the same test interval τ, but the tests of B are staggered by a time interval θ in order to avoid to test A and B at the same time: this allows to increase the frequency of detection of

292

22 Time-Dependent Probabilistic Calculations

Fig. 22.8 Principle of availability calculations of a periodically tested system from an RBD

common cause failures, prevent human errors on both components at the same time and, finally, to improve the average availability (see Chap. 36). As shown in Fig. 22.8, this leads to typical saw-tooth curves both for the availability inputs AA (t) and AB (t) and for the availability output AS p (t). For a better view, the scale has been adapted for AS p (t). The unavailability of the system can be calculated exactly in the same way using the dual FT. This is illustrated in Fig. 22.9 where the unavailability inputs UA (t) and UB (t) are calculated using multiphase Markov models. Again, UA (t) and UB (t) as well as the FT output US p (t) are typical saw-tooth curves oriented in the opposite direction of these obtained with the dual RBD. As in the previous case, equivalent analytical formulae can also be elaborated but they are rather difficult to establish and use. This is true especially if all the parameters (probability of failure due to the tests, human error, test coverage, test duration, etc.) discussed in Sect. 31.5.4.4 are taken into consideration. Therefore, in this case it may be better to directly use the multiphase Markov processes as described in Sect. 31.5.4.4. Beyond the simple blocks or primary events, the RBD/FT-driven Markov processes can also be used to model composite blocks or composite primary events (sub-fault trees). This is the example given in Fig. 22.10 for a system made of two redundant components A and B with B operated in standby position: A is the component normally operated and when it fails, component B is started. When B is started, Fig. 22.9 Principle of availability calculations of a periodically tested system from an FT

22.2 Availability/Unavailability Calculations

293

Fig. 22.10 Modelling composite blocks or primary events by Markov processes

this is successful with a probability (1-γ ) and unsuccessful with the probability γ . The overall system reaches the down state X¯ when A and B are in down state at the same time. When, after a failure, B is repaired, it goes back to the standby position if A is in up state. The dependency between A and B cannot be modelled by an RBD or an FT but, provided that the composite block/primary event is independent of the other blocks/primary events, it can be used exactly as an ordinary block/primary event with regards to the availability/unavailability calculations. When dependencies imply only few events which can be gathered into composite blocks/composite primary events, the above technique can be effectively used, provided that the dependencies can be modelled using Markov processes. In any case, when RBD/FT-driven Markov processes are undertaken, the result is a Markov process modelling a system made of independent components. The interest is that its size increases linearly with the number of blocks or primary events. Therefore, it is drastically smaller compared to this of the equivalent ordinary Markov process which increases exponentially. Beyond the use for probabilistic calculations on RBDs or FTs, these approaches are also very effective to prevent the combinatorial explosion of the number of states of Markov graphs when the number of modelled components increases and when they are reasonably independent. In the general case, the calculation of availability or unavailability is not really possible by hand and the use of one software package is needed to perform the needed numerical calculations. In the present chapter, all calculations are performed using the RBD and FT modules of the GRIF Workshop software package (GRIF-Workshop 2020).

22.3 Average Availability/Unavailability Calculations 22.3.1 Average Over a Given Interval [0, T ] Beyond the instantaneous values of the availability AS (t) or unavailability US (t) of avg avg a system, the average values AS (T1 , T2 ) and US (T1 , T2 ) are also of big interest

294

22 Time-Dependent Probabilistic Calculations

because they can rather easily be estimated from statistics performed from the field feedback (see Chap. 31.8). Mathematically speaking, these average values are defined over an interval [T1 , T2 ] by the following formulae: avg AS (T1 , T2 )

avg

1 = T2 − T1

US (T1 , T2 ) =

1 T2 − T1



T2

AS (τ ) · dτ

(22.7)

US (τ ) · dτ

(22.8)

T1



T2 T1

avg

avg

Over an interval [0, T ] the average values AS (0, T ) and US (0, T ) are simply avg avg noted by AS (T ) and US (T ). Contrary to what is sometimes trusted, it has to be noted that the combination of average values does not lead to an average value. Then, the combinations of the average availabilities/unavailabilities of the components of a system through an RBD or an FT do not give the average availability/unavailability of the system: the integrals in formulas 22.7 and 22.8 have, actually, to be calculated. Numerically speaking and as illustrated in Fig. 22.11, this can be done by averaging the curves obtained for the instantaneous availability or unavailability. It has to be noted that, for obtaining accurate average results, the number of points calculated to draw the curves has to be large enough. In addition, for periodically tested systems, if T st is an instant of test, the instantaneous availability/unavailability has to be calculated at T st − , just before the test and T st + , just after the test, in order to properly catch the peaks within the averaging operation (see Fig. 22.12). Fig. 22.11 Average availability/unavailability calculations from instantaneous curves

22.3 Average Availability/Unavailability Calculations Fig. 22.12 Bad and good patterns of calculated values to properly catch the peaks

Calculated values

295

Peaks

Bad fitting

Good fitting

According to the IEC 61508 (2010) standard, the calculation of the average unavailability is mandatory when dealing with safety instrumented systems (see Chap. 36). In this standard, the average unavailability is (improperly) named PFDavg (probability of failure on demand average).

22.3.2 Asymptotic Availability or Unavailability A particular case of average availability/unavailability arises when the instantaneous availability/unavailability reaches an asymptotic value when the time goes to infinity. This generally happens when the failures and repairs of each component of the considered system form an alternate renewal process (COX 1962). In this case, the process reaches a steady state where the probability for the component to leave the up state (e.g. by failure) is equal to the probability to reach the up state (e.g. by repair). In a renewal process, the form of the failure and repair distributions does not matter and the state-transition model in Fig. 22.6 or the Markov processes in Fig. 22.7 constitute particular cases of such processes. These asymptotic values can be observed in formulae 22.3 to 22.6. When T goes to infinity, they give: Aas A = AA (∞) =

μA λA + μA

and

UAas = UA (∞) =

λA λA + μA

(22.9)

Aas B = AB (∞) =

μB λB + μB

and

UBas = UB (∞) =

λB λB + μB

(22.10)

These asymptotic values are reached after two or three times the MORT (mean overall repair time) of the considered components (e.g. 3 × 1/μA for component A). Then, when repairs are quickly achieved, the asymptotic values are quickly reached and then, the availability or unavailability becomes probabilistic point values. When all the availabilities related to the blocks of an RBD or the unavailabilities related to the primary events of an FT converge toward asymptotic values (e.g. when they are modelled by renewal processes), then the long-term average values over an interval [0, ∞] are given by:

296

22 Time-Dependent Probabilistic Calculations avg AS (∞)

avg

1 = lim T →∞ T 1 T →∞ T

US (∞) = lim



T 0



T 0

1

as Aas S · T = AS

(22.11)

1 as US · T = USas T →∞ T

(22.12)

AS (τ ) · dτ = lim

T →∞ T

US (τ ) · dτ = lim

The convergence is fast when the failures are quickly detected and repaired. Then, and for an interval [T1 , T2 ] far enough from the transient period or over a large interval [0, T ], the impact of the transient period is negligible. In this case, the time-dependent availability/unavailability calculations of a system can be reduced to as single point calculations from the asymptotic values Aas Ci or UCi of its components. This is illustrated in Fig. 22.13 for system Sp when the availability or unavailability of components A and B have converged toward their asymptotic values. It has to be noted that if the average availability/unavailability exists in any case, the asymptotic availability/unavailability exists only in the particular case where the components reach steady states where the probabilities to be a in a given state become constant. Hence, the systems made of periodically tested components which have no steady states have no asymptotic availability/unavailability. Nevertheless, when a regular pattern of test is implemented and when the failure coverage and the repairs are perfect (all failures are detected and the components are as good as new after repair), the average availability/unavailability between the test intervals tends to a limit when the number of test intervals increases. avg This is shown in Fig. 22.14 where the average availability AS (0, t) and unavailavg ability US (0, t) have been drawn in solid lines. They are closer and closer to a limit curve when the number of test intervals increases. In the end, when the number of intervals goes to infinity, the average availability/unavailability within an interval converge toward a limit value which depends on the failure and repair rate and also on the test pattern. The convergence observed for a periodically tested system is not of the same nature as this due to the steady states of the components but it can be used to calculate the average availability/unavailability of a periodically tested system just by considering Fig. 22.13 Asymptotic availability and unavailability of system Sp

0

0

1 I

A B

1 O

22.3 Average Availability/Unavailability Calculations

297

Fig. 22.14 Convergence of availability and unavailability of a simple periodically tested system

a test interval (or a group of test intervals) far enough from the origin and where the curves have converged. Nevertheless, as soon as the test coverage is not perfect, there are non-covered faults which are never detected. For this reason, the average availability continuously decreases and the average unavailability continuously increases, and the above convergence is no longer observed. In the general case, the calculation of the average availability or unavailability is not really possible by hand and the use of RBD or FT software packages is needed to perform the needed numerical calculations.

22.4 Failure Frequency and Derived Parameters 22.4.1 Average Failure Frequency, Number of Failures and MTBF avg

The average failure frequency wS (T1 , T2 ) of a system over an interval [T1 , T2 ] is a useful probabilistic indicator for repaired systems which can fail and be repaired several times within this time interval. It allows to calculate the expected number of failures n f (T1 , T2 ) to be observed over this interval (see Chap. 4): avg

n f (T1 , T2 ) = wS (T1 , T2 ) · (T2 − T1 )

(22.13)

And the mean time between failures MTBF(T ) over an interval [0, T ] is directly obtained through this expected number of failures: MTBF(T ) =

1 T T = avg = avg n f (0, T ) wS (0, T ) · T wS (0, T )

(22.14)

298

22 Time-Dependent Probabilistic Calculations

The classical MTBF (see Chap. 4) is obtained when T goes to infinity: MTBF = lim MTBF(T ) T →∞

(22.15)

In addition, the average failure frequency can be statistically estimated from the field feedback thanks to the number of failures n o f (T1 , T2 ) actually observed over a period of time [T1 , T2 ]: avg

wS (T1 , T2 ) ≈ n o f (T1 , T2 )/(T2 − T1 )

(22.16)

Mathematically speaking, the average failure frequency over an interval [T1 , T2 ] is the average value of the instantaneous failure frequency w S (t): avg wS (T1 , T2 )

1 = T2 − T1



T2

wS (τ ) · dτ

(22.17)

T1

When dealing with safety instrumented systems operated in high demand mode avg (see Chap. 36), it is mandatory to calculate wS (0, T ) over the period of interest. In this case, the average failure frequency is (improperly) named probability of failure per hour, PFH. Like for availability/unavailability calculations, when all the components of a system are modelled by renewal processes (e.g. Markov processes like in Fig. 22.7), the instantaneous failure frequency wS (t) reaches an asymptotic value wSas : avg wS (∞)

1 = lim T →∞ T



T 0

wS (τ ) · dτ = lim

1

T →∞ T

wSas · T = wSas

(22.18)

Again, the convergence is fast when the failures are quickly detected and repaired. Then, for an interval [T1 , T2 ] far enough from the transient period or over a large interval [0, T ], the impact of the transient period is negligible. In this case, the average failure frequency of a system can be reduced to a single point calculation from the asymptotic values wCasi of its components. Therefore, exactly as the instantaneous availability/unavailability was needed first to calculate the average availability/unavailability, the instantaneous failure frequency is needed first to calculate the average failure frequency. This is the purpose of the following section.

22.4 Failure Frequency and Derived Parameters

299

22.4.2 Instantaneous Failure Frequency/Birnbaum Importance Factor 22.4.2.1

Principle of Instantaneous Failure Frequency Calculations

The instantaneous failure frequency mentioned above is also named unconditional failure intensity (see Sect. 4.7) and it is defined as follows: probability per unit of time, wCi (t), that item Ci fails between t and t + dt, provided that it performs as required at 0. The number of failures which can occur between t and t + dt being whether 0 (no failure occurs) or 1 (one failure occurs), wCi (t).dt represents the expected number of failures occurring over [t, t + dt] and this is why the integral of wCi (t) over [T1 , T2 ] provides the average number of failures over this interval, as expressed by formula 22.17. Let us note wS,i (t) the contribution of component Ci to the instantaneous system failure frequency. As it is not possible for a system S to fail from several independent component failures occurring within an infinitesimal increment of time dt, the instantaneous overall system failure frequency is simply the sum of the contributions of the various components Ci : wS (t) =



wS,i (t)

(22.19)

i

Therefore, the problem is reduced to the calculations by RBDs or FTs of the contributions wS,i (t) and this is far more difficult than the calculations that have been performed until now. Let us consider a component Ci belonging to a system S. For the system to fail at time t due to the failure of Ci : 1. S must be in a critical state with regards to Ci (i.e. S will fail if Ci fails; 2. Ci must fail at t. The probability that S is in a critical state with regards to Ci is called Birnbaum importance factor of Ci with regards to S. This importance factor is also known as marginal importance factor and noted MIFS (Ci , t). The failure frequency of component Ci is wCi (t). Then the contribution wS,i (t) can be written as: wS,i (t) = MIFS (Ci , t) · wi (t)

(22.20)

But wi (t) is equal to the availability ACi (t) of Ci multiplied by its conditional failure intensity (Vesely failure rate), λV,Ci (see Chap. 4): wS,i (t) = MIFS (Ci , t) · ACi (t) · λV,Ci

(22.21)

300

22 Time-Dependent Probabilistic Calculations

Gathering all the above results leads to the overall instantaneous system failure frequency: wS (t) =



MIFS (Ci , t) · ACi (t) · λV,Ci

(22.22)

i

ACi (t) and λV,Ci are specific of component Ci . They constitute input data for the blocks/primary events and can be calculated (e.g. by Markov processes) independently of RBDs or FTs. For example, when component Ci is modelled by a Markov process as in Fig. 22.7, its availability ACi (t) is given by formula 22.3 and its unconditional failure intensity (failure frequency) is simply its failure rate, λV,Ci = λCi : wi (t) = λCi · ACi (t)

(22.23)

Then, it remains to calculate the Birnbaum importance factors which depend, themselves, on the logic structure embedded in the considered RBDs/FTs.

22.4.2.2

Birnbaum Importance Factor Calculations

More precisely, the Birnbaum importance factor is the probability that the state of component Ci is critical with regards to the state of S. This is the probability that: • S and Ci being in up state, S goes to the down state when Ci goes to the down state; • S and Ci being in down state, S goes to the up state when Ci goes to the up state. Like the Vesely-Fussell importance factor encountered for the semi-quantitative analysis, the Birnbaum importance factor is one of the various importance factors which can be calculated from RBDs/FTs (see Chap. 24). Mathematically speaking, the Birnbaum importance factor is defined as the partial derivative of the system availability/unavailability with respect to the availability/unavailability of component Ci : MIFS (Ci , t) =

∂ AS (t) ∂US (t) = ∂UCi (t) ∂ ACi (t)

(22.24)

This seems complicated but, fortunately, the calculations are rather simple when conditional probabilities are considered: let us note US|Ci (t) the unavailability of system S given that component Ci is in up state and US|C i (t) the unavailability of the system given that Ci is in down state. The total probability theorem allows to write US (t) as: US (t) = US|Ci (t) · UCi (t) + US|C i (t) · [1 − UCi (t)]

(22.25)

22.4 Failure Frequency and Derived Parameters

301

That is to say: US (t) = [US|Ci (t) − US|C i (t)] · UCi (t) + US|C i (t). This formula being of the form y = a · x + b, its derivative with respect to UCi (t) gives MIFS (Ci , t): MIFS (Ci , t) =

∂US (t) = US|Ci (t) − US|C i (t) ∂UCi (t)

(22.26)

Then, the Birnbaum importance factors of each primary event of an FT (or of each block of an RBD) can be directly calculated from conditional probabilities which are easily obtained when binary decision diagrams are implemented (see Chap. 21).

22.4.2.3

Application to a System Organized in 2 Out of 3 Logic

As the Birnbaum importance factor may appear to be a little bit esoteric, it may be useful to clarify its meaning by considering a typical example: the system S2/3 made of three components A, B and C organized in 2 out of 3 analysed in Chap. 21 and illustrated in Fig. 21.15 is used for this purpose. This figure gives the conditional probabilities Pr (S 2/3 |B) and Pr (S 2/3 |B) related to component B of system S2/3 (see formulae 21.14 and 21.15). Then MIFS2/3 (B, t) can be calculated as: MIFS2/3 (B, t) = [1 − PA (t)].PC (t) + PA (t) − PA (t).PC (t) This gives MIFS2/3 (B, t) = PA (t) + PC (t) − 2 · PA (t) · PC (t) which is the probabilistic formula corresponding to an exclusive OR gate ⊕. Then, finally: MIFS2/3 (B, t) = Pr (A ⊕ C, t) = Pr[(A ∩ C) ∪ (A ∩ C), t]

(22.27)

The state A ⊕ C can be written (A ⊕ C) ∩ (B ∪ B) and this leads to the two following states: 1. (A ⊕ C) ∩ B: this is an up state which gathers the critical system up states where the failure of B leads to the overall system failure; 2. (A ⊕ C) ∩ B: this is a down state which gathers the critical system down states where the repair of B leads to the overall system repair. Therefore, MIFS2/3 (B, t) is clearly the probability of the system states which are critical with regards to B as announced above. As A, B and C are independent, the probability of the critical up state above is equal to: Pr [(A ⊕ C) ∩ B, t] = MIFS2/3 (B, t) · AB (t)

(22.28)

If λV,B is the Vesely failure rate (conditional failure intensity) of B, then the contribution of B to the failure frequency of S2/3 is given by:

302

22 Time-Dependent Probabilistic Calculations

wS2/3 (B,t) = MIFS2/3 (B, t) · AB (t) · λV,B

(22.29)

The failure frequency of B being equal to wB (t) = AB (t) · λV,B , the contibution of B is obtained as: wS2/3 (B,t) = MIFS2/3 (B, t) · wB (t)

(22.30)

The contributions of A and C can be calculated in the same way and finally the instantaneous failure frequency of S2/3 is obtained as: wS2/3 (t) = wS2/3 (A, t) + wS2/3 (B, t) + wS2/3 (C, t)

(22.31)

Finally, the unconditional failure intensity (instantaneous failure frequency) can be calculated from RBDs and FTs from the conditional probabilities which can be easily handled when BDDs are implemented for the probabilistic calculations. When the instantaneous failure frequency is calculated, then its average can be calculated in the same way as average availability and unavailability have been calculated in the previous sections.

22.4.2.4

Examples of Curves Related to Failure Frequency and Its Average

The failure frequency and its average have been illustrated in Fig. 22.15 for two cases of initial conditions for the repaired components A and B of system Sp . The unavailability (or the availability) as well as the failure frequency reach asymptotic values which are independent of the initial conditions. In Fig. 22.15, the asymptotic values are reached for time T but the impact of the transient period has not completely vanished over [0, T ] and the average values still depend on the initial conditions. Note: the case with P0,A = P0,B = 0.0 is analysed in Fig. 31.13 of the chapter dealing with Markov processes modelling. The unavailability has been drafted in Fig. 22.15 for the fault tree but the frequency curves are identical for both the FT and the dual RBD. Like for availability/unavailability calculations, there is no such asymptotic failure frequency for periodically tested systems. Nevertheless, when the test pattern is

Fig. 22.15 Unavailability and failure frequency of system Sp when A and B are repaired items

22.4 Failure Frequency and Derived Parameters

303

Fig. 22.16 Unavailability and failure frequency of a simple parallel system

avg

regular, then wS (t) reaches a limit value but this limit is not in relationship with a steady state of the system. This is illustrated in Fig. 22.16 where the instantaneous failure frequency w Sp (t) of system Sp made of periodically components A and B is drafted in dotted line and avg the average failure frequency w Sp (t) is drafted in solid line. It has to be noted that the curves are the same for RBDs and FTs. It can also be noted that the curve of the average failure frequency is relatively similar to this drafted in greys on the right-hand side of Fig. 22.15 when the repaired components A and B are in up state at t = 0.

22.4.3 Combination of Sub-FTs for Unavailability and Frequency Calculations 22.4.3.1

Preliminary Considerations

When several FTs have been built separately, the question arises to combine them without drawing the overall FT involving these sub-FTs and to perform probabilistic calculations on this overall FT. This is possible only when the individual FTs to be combined are independent from each other: • Unavailability calculations: the unavailabilities of the individual FTs can be combined as usual through logic gates (see Sect. 22.3). • Frequency calculations: formulae are available to combine the unavailabilities and failure frequencies of the individual FTs. • Reliability calculations: generally, not possible due to the systemic dependencies introduced when such calculations are performed (see Sect. 22.5). Therefore, only the failure frequency calculation case needs to be analysed further and this is done hereafter.

304

22.4.3.2

22 Time-Dependent Probabilistic Calculations

Failure Frequency of Individual FTs Linked by an AND Gate

Let us consider two individual fault trees, FT1 and FT2 , related to two independent subsystems S1 and S2 and characterized by the following unavailabilities and failure frequencies: • S1 : U1 (t) = Pr (S 1 , t) and w(S 1 , t). • S2 : U2 (t) = Pr (S 2 , t) and w(S 2 , t). The failure frequency of the overall system Sp made of S1 and S2 operated in parallel (i.e. S p = S 1 ∩ S 2 ) is given by the following formula: w(S p , t) = Pr (S2 , t) · w(S1 , t) + Pr (S1 , t) · w(S 2 , t)

(22.32)

This calculation is illustrated in Fig. 22.17 and the formula is self-explaining. The top event S p occurs at time t: • if the system related to FT2 is already failed—probability U2 (t) = Pr (S2 , t)—and the system related to FT1 fails at this time—w(S1 , t); • if the system related to FT1 is already failed—probability U1 (t) = Pr (S1 , t)—and the system related to FT2 fails at this time—w(S2 , t). Finally, the failure frequency related to the output of an AND gate can be calculated from the probabilities of failure and failure frequencies of its inputs (provided that they are independent) without having to compute the overall resulting fault tree. In addition, the above formula splits the failure frequency in two cases according to whether S1 fails first or S2 fails first. This is very useful to calculate, for example, the failure frequency of system Sp when it fails only if S1 has failed before S2 . This can be modelled using a priority AND gate (PAND) which is a logic gate commonly used to model dynamic fault trees (see Sect. 22.6). As illustrated in Fig. 22.18, when using such a logic dynamic gate, the event with the maximum priority (i.e. S1 ) is placed on the left-hand side. A typical case where the order of the occurrence of the events matters is when FTs are used to calculate the occurrence frequency of event sequences modelled by sequential approaches described in Chap. 26 and Sect. 27.4. Fig. 22.17 Failure frequency calculation by combining two sub-FTs used as input of an AND gate AND Sub FT

Sub FT

22.4 Failure Frequency and Derived Parameters

305

Priority Failed first

PAND

Failed in second

Fig. 22.18 Failure frequency calculation when S1 has to fail before S2

Initiating event

Safety layer failed No

Yes

Final situations Situation under control

Hazardous situation

Hazardous situation Failed at

PAND

Safety layer failed

Condition Initiating event

Fig. 22.19 Application to a simple event tree (see Chap. 26 and Sect. 27.4)

This is illustrated in Fig. 22.19 where a simplified event tree is presented on the left-hand side and the resulting FT on the right-hand side: • If the safety layer is in up state (not failed) when the initiating event occurs, then the situation remains under control. • If the safety layer is in down state (failed) when the initiating event occurs, then a hazardous situation occurs (sequence in bold black line). The FT corresponding to the sequence in bold black line is similar to the FT illustrated in Fig. 22.18. Due to the convention attached to PAND gates, the failure of the safety layer is drafted on the left-hand side and the occurrence of the initiating event on the right-hand side. This approach opens the way to calculate the frequencies of the sequences developed from initiating events identified when cause consequence diagrams, event trees or bowtie models are implemented (see Chap. 27, Sect. 4). This approach is generalized in Vinuessa et al. (2019). Such calculations are widely used because the dynamic aspects can be handled in an analytical way and Monte Carlo simulation is not needed. Bear in mind that it works well (i.e. provides exact results) only when the initiating

306

22 Time-Dependent Probabilistic Calculations

event and the safety layer failure are independent events. However, a satisfying approximation can be provided using the adapted FT-model proposed in Innal et al. (2014).

22.4.3.3

Failure Frequency of Individual FTs Linked by an OR Gate

Let us consider, as above, two individual fault trees, FT1 and FT2 related to two independent subsystems S1 and S2 and characterized by the following unavailabilities and failure frequencies: • S1 : U1 (t) = Pr (S1 , t) and w(S1 , t). • S2 : U2 (t) = Pr (S2 , t) and w(S2 , t). The failure frequency of the overall system Ss made of S1 and S2 operated in series (i.e. Ss = S1 ∪ S2 ) is given by the following formula: w(S s , t) = w(S1 , t) + w(S2 , t) − w(S p , t)

(22.33)

This calculation is illustrated in Fig. 22.20 and the formula is self-explaining: • The top event Ss occurs at time t if the system related to FT1 fails or if the system related to FT2 fails. Then the upper bound of the failure frequency is equal to w(S1 , t) + w(S2 , t). • The sum w(S1 , t) + w(S2 , t) counts twice the cases where S1 and S2 fail at the same time calculated in the previous subsection, w(S p , t): once with w(S1 , t) and once with w(S2 , t). • Then w(S p , t) has to be subtracted once. Finally, the failure frequency related to the output of an OR gate can be calculated from the probabilities of failure and failure frequencies of its inputs (provided that they are independent) without having to compute the overall resulting fault tree. Again, when the inputs are not independent, a satisfying approximation can be provided using the adapted FT-model proposed in Innal et al. (2014). Fig. 22.20 Failure frequency calculation by combining two sub-FTs used as input of an OR gate

OR Sub FT

Sub FT

22.5 Reliability Calculations

307

22.5 Reliability Calculations 22.5.1 General Case Figure 22.21 illustrates the general case of a reliability chronogram: the system remains in up state until it fails at time t1 and after that it remains in down state all over [0, ∞] because, according to the reliability definition—R(t) = Pr (t < T T F)— only the occurrence of its first failure is of interest. Therefore, the only way to obtain the system down state by a logic combination of its component down states is to consider that the components are also stuck in down state after this first failure. As illustrated in Fig. 22.21, this implies that a component failure can be repaired only if the system failure has not occurred before or if it does not provoke the system failure by itself. In addition, the repair in progress of a component is immediately stopped when the system fails. Then, the chronogram related to a component (e.g. A in Fig. 22.21) is a hybrid between an availability chronogram—before the system failure at t1 —and a reliability chronogram—after t1 . Such a chronogram has no relationship with the availability nor with the reliability functions of this component. In addition, it changes according to the value of the random variable TTF which, in turn, depends on what happens on all the components of the system. This shows that the calculation of the reliability/unreliability of a system implies to take into account a systemic dependency which involves all the components of the considered system. This makes the calculations rather difficult and the users of reliability block diagrams should not be misled by the term reliability used to name this approach. It comes from the ages where reliability was used as umbrella term to identify the whole dependability domain as it is named nowadays. In fact, RBDs and FTs are more effective for availability/unavailability than for reliability/unreliability calculations. Depending whether the considered system is made of non-repaired or repaired components makes a big difference with regards to the above considerations. When

States

Up state Down state

)

Stuck in down state after system failure )

)

)

Reliability chronogram System TTF

System failure

) Down state

) Time

Fig. 22.21 Reliability chronograms of the states of system Sp (Fig. 22.1) according to the states of its components A and B

308

22 Time-Dependent Probabilistic Calculations

reliability calculations are straightforward for systems made of non-repaired components, only approximations based on the failure frequency (see Sect. 22.4) and the Vesely failure rate (see Chap. 4) are available for systems comprising repaired components. The result is that the RBD and FT approaches—generally considered to be very simple techniques—are perhaps these involving the most complicated mathematics when reliability/unreliability calculations are undertaken. In the chapters hereafter, the simpler case of systems made of non-repaired components is analysed first, before investigating the general case involving repaired components.

22.5.2 Systems Made of Non-repaired Items 22.5.2.1

Characteristics of Systems Made of Non-repaired Items

As explained in Chap. 4, the term non-repaired is used instead of non-reparable because it is more general: Non-repaired item: non-repairable item or repairable item which is not actually repaired when it fails. When a non-repaired item is considered (e.g. A or B at the top of Fig. 22.22), its probability to be in up state at a given instant t is equal to its probability to be in up state continuously all over the interval [0, t]: then its reliability (respectively unreliability) is identical to its instantaneous availability (respectively instantaneous unavailability) and, for example, AA (t) = RA (t) = Pr (A, t). When a system is made of non-repaired items, it is itself a non-repaired item (warning, the opposite is not true as for reliability calculation every system is a nonrepaired system). Then, the availability chronogram of the parallel system Sp at the bottom of Fig. 22.22 is identical to its reliability chronograms, its reliability RSp (t) is also equal to its instantaneous availability ASp (t), and its unreliability FSp (t) is equal to its instantaneous unavailability USp (t).

States

Up state

Down state

) ) )

)

Availability / reliability chronogram

) System failure )

System TTF

Time

Fig. 22.22 Chronograms of the states of system Sp (Fig. 22.1) when A and B are non-repaired items

22.5 Reliability Calculations

309

Therefore, in the case of systems made of non-repaired components, and only in this case, the reliability of a system can be calculated by combining the reliabilities of its components (using an RBD) and its unreliability by combining the unreliabilities of its components (using an FT).

22.5.2.2

Reliability/Availability Calculations

The simple state-transition model proposed in the middle of Fig. 22.23 represents the general behaviour of a binary item evolving from up to down state through a time-dependent failure rate λCi (t). The reliability of such an item is given by the general following formula: t

RCi (t) = exp[∫ λCi (u)du]

(22.34)

0

The item being non-repaired, this is also its instantaneous availability ACi (t) ≡ RCi (t). In the same way, its unreliability is its instantaneous unavailability UCi (t) ≡ FCi (t) = 1 − RCi (t). As shown in Sect. 22.2, the instantaneous availability/unavailability of a system can be calculated from the instantaneous availability/unavailability of its components (provided that the components are independent). Due to the identity between reliability and availability explained above, the reliability of a system made of several non-repaired items can be calculated from an RBD by combining the individual reliabilities/unavailabilities of its components calculated using formula 22.34. In the same way, the unreliability of the same system can be calculated from an FT by combining the individual unavailabilities of its components. A particular and very popular case occurs when the failure rates are constant (exponential law). In this case λCi (t) ≡ λCi ∀ t. This leads to the following formula: UCi (t) ≡ FCi (t) = 1 − exp(−λCi · t)

(22.35)

When λCi · t 1 the above formula can be approximated by: UCi (t) ≡ FCi (t) ≈ −λCi · t

Up state

Ci

Failure

(22.36)

Down state Primary event

Block Ci Fig. 22.23 Time-dependent event related to a non-repaired item modelled with a failure rate

310

22 Time-Dependent Probabilistic Calculations

Fig. 22.24 Reliability and unreliability of a system Sp made of two non-repaired components

Formula 22.36 is very simple and this is why it is widely used by reliability engineers to perform probabilistic calculations, especially when they are done by hand. Anyway, the general formula 22.34 has to be used when the failure rates are not constant (Weibull law, Log-Normal law, Erlang law, etc.). Figure 22.24 illustrates the reliability and the unreliability of a system made of two non-repaired items modelled with formulae like 22.35 for the RBD and 22.36 for the FT. It has to be noted that a periodically tested system is not made of non-repaired components and cannot be modelled in this way.

22.5.2.3

Failure Density and Failure Rate

The failure density f Ci (t) of an item is the derivative of the failure distribution (i.e. the unreliability of this item) FCi (t). It is defined as follows (see Chap. 4): conditional probability per unit of time, f Ci (t), that item Ci fails for the first time between t and t + dt, provided that it performs as required at 0. This is very close to the unconditional failure intensity (failure frequency) wCi (t) and the difference lays only in the conditions to be taken into account when the item fails: • performing as required at 0 for wCi (t); • performing continuously as required over [0, t] for f Ci (t). When the system is made of non-repaired items, this is exactly the same thing and therefore: f Ci (t) ≡ wCi (t)

(22.37)

Then, the development in Sect. 22.4 about the instantaneous failure frequency can be used to calculate the failure intensity of a system made of non-repaired items. On another hand, according to formula 4.50 in Chap. 4, f (t) = R(t) · λ(t). This allows to calculate the system failure rate from the reliability and the failure density:

22.5 Reliability Calculations

311

Fig. 22.25 Failure density and failure rate of system Sp with two non-repaired components

λS (t) =

f S (t) RS (t)

(22.38)

The failure density and the failure rate of system Sp made of two redundant non-repaired components are drawn in Fig. 22.25. The failure density evolves like the failure frequency in the general case (i.e. when repaired items are involved): it increases first from zero to a maximum and then it reaches an asymptotic value. The difference is that this asymptotic value is equal to zero because the system can fail only once: after the system failure has occurred, it cannot occur again. Then when the probability of failure increases, the failure frequency decreases. At the limit, when the probability of failure goes to one, the failure frequency goes to zero. It has to be noted that, in this case, the average failure frequency (i.e. the average value of f S (t)) is meaningless because it can be made as low as wanted just by increasing the time interval of interest! Therefore, the PFH of a safety instrumented system cannot be calculated using f S (t) (which does not take the repairs into account) instead of wS (t) (which does take the repairs into account). The system failure rate increases also from zero to a maximum value and, when enough time has elapsed to allow the components with the higher failure rates to fail, it reaches an asymptotic value which is the smallest failure rate of the components: if λB < λA then the asymptote is λB and it is reached, in the example, after about 6/λB = 6 × MTTFB (i.e. 1200 h if λB = 5.0 × 10−3 h−1 ). Anyway, when analysing systems made of non-repaired items, the failure density or the system failure rate are of little interest. They have been described here for the sake of completeness of the book.

22.5.3 Systems Made of Repaired Items 22.5.3.1

Characteristics of Systems Made of Repaired Items

The reliability of a system made of repaired items belongs to the general case presented above in Sect. 22.5.1 with the chronograms illustrated in Fig. 22.22.

312

22 Time-Dependent Probabilistic Calculations

It has to be noted that, due to the reliability calculation constraints, the system itself is always a non-repaired item even if its components are repaired items. Contrary to the case of the systems made of non-repaired items where the reliability or unreliability are straightforwardly calculated by RBDs or FTS, it is no longer possible to proceed in the same way when repaired items are involved. In fact, exact reliability calculations are even not possible at all. Fortunately, under some conditions, rather good approximations can be obtained using the Vesely failure rate (conditional failure intensity) instead of the true failure rate.

22.5.3.2

Failure Frequency and Vesely Failure Rate

The approximated failure density f Sp (t) and the approximated failure rate λV,Sp (t) of system Sp made of two redundant repaired components is illustrated in Fig. 22.26. The exact value calculated by a multiphase Markov graph (see Chap. 31) is drafted in dotted lines: the failure density f Sp (t) has the normal shape of a failure density function: it starts from zero, goes to a maximum value and then decreases until it reaches its asymptotic value which is zero. The Vesely failure rate λV,Sp (t) of this system reaches also an asymptotic value: the difference with the non-repaired case is that the convergence is now very fast as it is in relationship with the larger mean overall repair time MORTmax of the components. The shorter MORTmax , the faster the convergence which is obtained after two or three times this MORTmax (e.g. about 30 h if MORTmax = 10 h). Again, the exact values are drafted in dotted line. Figure 22.27 is similar to Fig. 22.26 but the components are periodically tested. In this example, components A and B have the same failure and repair rate and also the same test interval but component B is tested in the middle of the test interval of component A. The exact values calculated by a multiphase Markov graph (see Chap. 31) are drafted in dotted lines: the behaviours of the failure density and of the Vesely failure rate are similar to the repaired case but with peaks due to the performance of the tests. Fig. 22.26 Approximated failure density and Vesely failure rate of system Sp with repaired components

Approximation

Approximation

A B

0

Exact values

Exact value

22.5 Reliability Calculations

313

Fig. 22.27 Approximated failure density and Vesely failure rate of system Sp made of two redundant periodically tested components

22.5.3.3

Reliability Calculations

As previously said, it is not possible to perform exact reliability/unreliability calculations using RBDs/FTs with systems made of repaired items. Fortunately, the Vesely failure rate provides rather good approximations: the faster the repairs and the better the approximations: t

t

0

0

RCi (t) = exp[∫ λCi (u)du] ≈ exp[∫ λV,Ci (u)du]

(22.39)

In Fig. 22.28, the approximated reliability/unavailability of the system made of the two redundant repaired components A and B have been calculated using the Vesely failure rate (see Fig. 22.26) instead of the true failure rate (which is not accessible with RBDs or FTs). The exact values obtained by a Markovian model (see Chap. 31) have been drafted in dotted lines. At the top of the figure, a rather large mean overall repair times (100 h) has been considered in order to clearly visualize the difference but, in spite of that, the approximations are pretty good and above all they are conservative (i.e. the approximated reliability is lower than the exact one and the approximated Fig. 22.28 Approximated reliability and unreliability of system Sp made of repaired components

Approximation

A Exact values

B

Approximation

Approximations Exact values

314

22 Time-Dependent Probabilistic Calculations

Fig. 22.29 Approximated reliability and unreliability of system Sp made of periodically tested components

unavailability greater than the exact one. The same curves have been drafted at the bottom of the figure with mean overall repair times of 10 h: approximated and exact values are almost superposed. The same procedure has been applied to the system made of periodically tested components. The approximated reliability/unreliability have been calculated using the Vesely failure rate (see Fig. 22.27) and the results presented in Fig. 22.29 are similar to those in Fig. 22.28 except that the reliability/unreliability curves are continuous but not derivable at the instant of tests. Again, the exact values obtained by a Markovian model (see Chap. 31) have been drafted in dotted lines and again the approximation is pretty good and conservative. Some attentive readers could object that, when a periodically item fails, the failure is detected, in average, after half the test interval and that the restoration time (half the test interval + item MORT) is very long, which is in contradiction with what has been said above about the need for quick repairs for good approximations. This would be forgetting that a periodically tested item is a hybrid between a non-repaired and a repaired item. When the MORTs are small, they spend in fact most of their time as non-repaired items and only short periods of time as repaired items just after the tests. This explains that exact and approximated values are close together.

22.6 Dynamic Fault Trees When the primary events of a FT are not independent as time elapses, the ordinary FTs described above are no longer usable but can be superseded by the use of dynamic fault trees (DFT) (Bobbio et al. 2004; Distefano 2005; Simeu-Abazi et al. 2011). Examples of such dynamic dependencies have already been described: • in Fig. 22.10 where a composite primary event has been implemented to model two components operated in cold redundancy; • in Fig. 22.18 and Fig. 22.19 where a priority AND gate (PAND) has been implemented to combine an initiating event with the failure of a protection layer in order to calculate the failure frequency of the modelled system.

22.6 Dynamic Fault Trees

315

Many dynamic gates can be introduced to model dynamic fault trees but the main ones are the PAND gate already mentioned and the sequence gate (SEQ). They are illustrated in Fig. 22.30. The dynamic dependencies between the primary events S1 and S2 are the following: • PAND: the output becomes “true” only if the input on the right-hand side (i.e. S2 ) becomes “true” when the input on the left-hand side (i.e. S2 ) is already “true”. • SEQ: the output becomes “true” when both inputs are “true” but the occurrence of the input on the right-hand side (i.e. S2 ) is inhibited as long as the input on the left-hand side (i.e. S1 ) is not “true”. When the inputs are primary events with Markovian behaviours, these dynamic gates can be replaced by composite blocks modelled by Markov graphs like this proposed in Fig. 22.31 to model a PAND gate or this proposed in Fig. 22.32 to model a SEQ gate. Non-repaired items A and B have been chosen because more assumptions would be needed to define the behaviour of PAND and SEQ gates in case of repaired items. The corresponding Markov graphs are rather different and help to understand the difference between the two types of dynamic gates.

Inhibited as long as

Priority Failed first

PAND

Failed in second

Sequential SEQ gate

Fig. 22.30 PAND and SEQ gates to build dynamic fault trees A fails before B A 1st failed PAND Must fail first A failed

λA

λB 2

3

1 B failed

λB

B 1st failed

5

4

λA

A fails after B

Fig. 22.31 Markov process modelling a PAND gate with primary events as inputs

316

22 Time-Dependent Probabilistic Calculations B can fail

λA

SEQ

λB

2

3

1 A failed

B failed

Failure of B inhibited when

Fig. 22.32 Markov process modelling a SEQ gate with primary events as inputs

The above models belong to the FT-driven Markov processes (see Sect. 22.2.2) which, to some extent, allow to calculate dynamic fault trees in specific cases (Markovian behaviour of primary events and dynamic gates with primary events as inputs). However, in the general case, this does not cover all the modelling needs and Monte Carlo simulations have to be undertaken. In this case, Petri nets (see Chap. 33) prove to be very useful to close the gap.

22.7 Associated Exercises Four exercises related to Chap. 22 are proposed in Chap. 29: • Exercise 22.1: calculate the PFDavg (average unavailability), the PFH (average failure frequency) and the unreliability (probability of failure) of an overpressure protection system. • Exercise 22.2: idem Exercise 22.1 with partial and full stroking tests of safety valves. • Exercise 22.3: extend Exercise 22.1 to model common cause failures on sensors, valves and logic solvers. • Exercise 22.4: extend Exercise 22.1 to model the tests staggering of the safety valves.

References Bobbio A, Codetta-Raiteri AD (2004) Parametric fault trees with dynamic gates and repair boxes, RAMS 2004. IEEE, USA, pp 459–465 Cox DR (1962) Renewal theory, London methuen & Co. Ltd., Reprinted: Chapman and Hall, London, 1982 Distefano S (2005) System dependability and performances: techniques. Thesis of doctor in philosophy, University of Messina, Italy, Methodologies and Tools

References

317

GRIF-Workshop (2020) Boolean module. Funded and developed by TOTAL, https://grif-worksh op.fr/. Accessed Sep 2020 IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical / electronic / programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland. Innal F et al (2014) Probability and frequency calculations related to protection layers revisited. Journal of loss prevention in the process industries 31 pp 56–69. Elsevier Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering, Springer Simeu-Abazi Z, Lefebvre A, Derain J-P (2011) A methodology of alarm filtering using dynamic fault tree. Reliab Eng Syst Safety (RESS) 96(2):257–266. Elsevier Vinuessa C, Folleau C, Doux F, Collas S (2019)New frequencies assessment method for safety analysis. In Proceedings of the ESREL2019. Hanover, Germany

Chapter 23

CCF Modelling with FTs and RBDs

23.1 Introduction The common cause failures (CCFs) are described and analysed in Chap. 5. As they are the root causes of dependent events, it may seem contradictory to use Boolean models—like fault trees or reliability block diagrams designed to handle independent events—to deal with them. Among the various categories identified in Chap. 5, this obviously excludes dynamic dependencies but the apparent paradox is solved as many CCF categories can be modelled as independent primary events (FTs) or independent blocks (RBDs). The relationships between CCF and Boolean models have already been analysed in Chap. 17. The aim of this chapter is to develop in more detail what has been proposed in Sect. 17.2 and to explain which kind of CCFs can be modelled by implementing Boolean approaches, how this can be undertaken and which are the limitations of this approach, in particular when the items impacted by a CCF are repaired after being failed by this CCF. For the purpose of this chapter, the CCFs are split between: • tangible and non-tangible CCFs and • CCFs implying the failure of the impacted items with a probability equal to 1 (interpreted as lethal shocks) and these implying the failure of the impacted items with a probability lower than 1 (interpreted as non-lethal shocks).

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_23

319

320

23 CCF Modelling with FTs and RBDs

23.2 Modelling Tangible CCFs 23.2.1 Introduction of Tangible CCFs in RBD and FT Models 23.2.1.1

CCF as a Simple Individual Event or as a Lethal Shock

As explained in Chap. 5, the tangible CCFs are the results of causes which can be clearly identified. Therefore, they can be handled as any other events within Boolean models. This is the case, for example, of events like loss of power, command-control failure, pipe plugging, fire, flooding, etc. which impact several items at the same time and can be clearly identified as individual events. Let us consider the fault tree (Fig. 16.16) developed for the pumping system described in Chap. 16 (Fig. 16.12). As explained in this chapter, the primary events have not been developed further for the sake of simplicity but, however, this could be useful, in particular, with regards to common cause failures. Let us consider, for example, that three pumps are supplied by the same source of electrical power which constitutes a common cause of failure Ccf . In this case, a given pump Pi is in up state when the pump is in up state independently of the CCF, PiInd , and when the common cause failure has not occurred, Ccf . Reciprocally, this given pump Pi is in down state when the pump is in down state independently of the CCF, PiInd ,or when the common cause failure has occurred, Ccf . This can be written as follows: Pi = PiInd ∩ Ccf ⇔ Pi = PiInd ∪ Ccf

(23.1)

The above formula leads to the sub-RBD (left-hand side) and the sub-FT (righthand side) illustrated in Fig. 23.1, which explains how such a tangible CCF can be introduced as:

Pi failed Pi failed (any causes)

Gi Pi failed (independent failure)

Electric power failed

Repeated block Repeated event

Fig. 23.1 Tangible CCF modelling by RBDs and FTs

23.2 Modelling Tangible CCFs

321

• a repeated block within an RBD (serial logic structure); • a repeated primary event within an FT (OR gate). This leads to the RBD illustrated on the left-hand side of Fig. 23.2 where each single block related to one pump of the RBD provided in Fig. 15.10 (Chap. 15) is replaced by a sub-RBD like the one illustrated on the left-hand side of Fig. 23.1. The obtained RBD models, for example, the pumping system including pumps subject to failures due to loss of power. This RBD is rather difficult to calculate by hand but, after logic simplification, the equivalent RBD drafted on the right-hand side of the same figure is easier to calculate (and should be preferred for this purpose). However, when BDDs (see Chap. 21) are implemented, the repeated blocks do not matter and the two representations do not make any difference. Then, the development illustrated in Fig. 23.1 (left-hand side) provides a general way to introduce tangible CCFs within RBD models. In the same way, each primary event related to one pump can be replaced by the sub-FT including the CCF within the FT provided in Fig. 16.16 (Chap. 16). This leads to the FT modelling the pumping system including pumps subject to failures due to loss of power which is presented in Fig. 23.3. S

S V2 I

Train 1

V1 V3

Train 2

O

V2 I

Train 1

V1 V3

Train 2

Fig. 23.2 Introduction of pump CCF into the RBD illustrated in Fig. 15.10 and simplification

Fig. 23.3 Introduction of pump CCF into the FT illustrated in Fig. 16.16 and simplification

O

322

23 CCF Modelling with FTs and RBDs

Fig. 23.4 Equivalent Markov graph (see Chap. 31) modelling block and primary event CCFs

Again, the FT obtained on the left-hand side can be simplified into an equivalent FT presented on the right-hand side, easier to calculate by hand. Again, when BDDs are implemented, this simplification is not useful and the development illustrated in Fig. 23.1 (right-hand side) provides a general way to introduce tangible CCFs within FT models. Until now, only the static aspects of the CCFs have been considered and the above developments (RBDs and FTs) are only valid with blocks or primary events implying constant probabilities, which can be easily extended to non-repaired items with time-dependent probabilities (e.g. described by failure rates), as explained in Chap. 22. This includes the CCFs related to a failure described by a failure rate: in this case, the impacted items go to the down state as soon as the CCF occurs and it is as if they were suffering a shock leading immediately to their failure. This is why this kind of CCFs can be assimilated to lethal shocks (Atwood 1986), see Sect. 23.3.2. When the CCF occurs, two situations have to be considered with regards to the states of the impacted items: 1. the functioning of the items is only inhibited when the CCF is present (e.g. loss of power supply); 2. the items actually fail upon the occurrence of the CCF. In the first case, the impacted items run again when the cause of the CCF disappears (e.g. is repaired). Therefore, the occurrence of the CCF can be, for example, modelled by a failure rate, λCcf , and its repair by a repair rate, μCcf . This has been done with the simple Markov graph (see Chap. 31) represented in Fig. 23.4 and which can be used to calculate the probability of a corresponding block of an RBD or primary event of an FT. In the second case, the impacted items have actually to be repaired before running again: modelling such repairs is generally not possible with Boolean models and the problem is analysed in more details in Sect. 23.4.

23.2.1.2

CCF as Non-lethal Shock

The analysis of lethal shocks above leads to the idea to analyse what happens when an impacted item goes to down state only with a given probability (e.g. a transitory over voltage opening the circuit breakers protecting the pumps). In this case, the common

23.2 Modelling Tangible CCFs

323 Failure due to non-lethal shock

E3 ρ

E1

0

E2

Failure upon non-lethal shock

Non-lethal

0

Fig. 23.5 Equivalent Markov graph (see Chap. 31) and FT related to a non-lethal non-repaired shock

cause failure can be assimilated to a non-lethal shock which can be characterised by the following parameters (Atwood 1986): • ρ: occurrence rate. • γi : conditional probability of failure of item i upon the non-lethal shock. This model deals with time-dependent probabilities for the CCFs and constant probabilities for the impacted items. It is more difficult to model than a lethal shock with Boolean models because it introduces a dynamic dependency between the CCF and the impacted items which fail with the probability γi and survive with the probability (1 − γi ). Modelling such a dependency is, normally, outside the scope of Boolean models. Anyway, this is possible in particular cases. Non-repaired non-lethal shock: A non-repaired non-lethal shock is modelled in Fig. 23.5 by a simple Markov graph (Chap. 31). Such a non-lethal shock occurs only once, its probability of occurrence between t and t + dt is equal to ρ · e−ρ.t dt and the probability that an impacted item i fail at time t is equal to γi · ρ · e−ρ.t dt. Then, the probability, PrinLc (t), that the item be in down state at a given time t due to the occurrence of this non-lethal shock is t

equal to γi · ∫ ρ · e−ρ.τ d τ , and: 0

  PrinLc (t) = γi · 1 − e−ρ.t

(23.2)

  PrinLc (t) is obtained as the product of the occurrence, 1 − e−ρ.t , of the CCF over [0, t], with the probability of failure γi of the impacted item upon this CCF. Therefore, PrinLc (t) can be modelled with a logic AND gate into a fault tree and also by a parallel structure into an RBD. The equivalent FT is illustrated in the middle of Fig. 23.5 and the curve related to PrinLc (t) on the right-hand side of the same figure. It has to be noted that, on the long range, PrinLc (t) reaches an asymptotic value PrinLc (∞) = γi . This approach allows to model non-repaired non-lethal shocks impacting nonrepaired items.

324

23 CCF Modelling with FTs and RBDs

Repaired non-lethal shock: In actual life, the failures leading to non-lethal shocks are going to be repaired and this implies that the non-lethal shock may occur several times over an interval [0, t]. Such a repaired non-lethal shock is modelled by the simple Markov graph on the left-hand side of Fig. 23.6 (warning: in this model the items are non-repaired, see Sect. 23.4 for considerations about repaired items). Similar but more complicated calculations than above lead to the same conclusion: the Markov graph on the left-hand side of the figure is equivalent to the FT in the middle of the figure: it combines the probability of the non-lethal shock at time t with the probability that the item fail due to the shock. In this case and when the cause of the non-lethal shock is rapidly repaired, PrinLc (t) reaches quickly an asymptotic value PrinLc (∞) ≈ γi μρnLc . Therefore, and again, a repaired non-lethal shock impacting non-repaired items can be modelled with a logic AND gate into an FT or a parallel structure into an RBD. This is illustrated in Fig. 23.7: μ nLc

Failure due to non-lethal shock

E3 ρ



E1

E2

0 Non-lethal shock ( , )

Failure upon non-lethal shock

0

μ nLc

Fig. 23.6 Equivalent Markov graph (see Chap. 31) and FT related to a non-lethal repaired shock

Pi failed (any causes)

Pi failed

Gi Pi failed (independent failure)

Pi failed on non-lethal shock

Gi Repeated block

Failure probability on non-lethal shock

Non-lethal shock

Repeated event

Fig. 23.7 Shock model CCF modelling by RBDs and FTs

23.2 Modelling Tangible CCFs

325

V2 Train 1

I

O

V1 V3

Train 2

Fig. 23.8 Introduction of non-lethal shock on pumps into the RBD illustrated in Fig. 15.10

• a repeated block within an RBD (parallel logic structure on the left-hand side); • a repeated primary event within an FT (AND gate on the right-hand side). In these models, the probability of failure γi is related to one of the items impacted by the non-lethal shock and the non-lethal shock itself (nLc) can be modelled (nonrepaired or repaired) as explained above. This is implemented in Fig. 23.8, within the RBD provided in Fig. 15.10 (Chap. 15) where each single block related to a pump has been replaced by a sub-RBD like the one illustrated on the left-hand side of Fig. 23.7. This leads to an RBD modelling the pumps subject to non-lethal shocks (e.g. an overvoltage of short duration). Contrary to the RBD established for a lethal shock, no logic simplification is available. Fortunately, this does not matter when BDDs (see Chap. 21) are implemented and the development illustrated in Fig. 23.7 (left-hand side) provides a general way to introduce tangible non-lethal shocks within RBD models. In the same way, each primary event related to pumps can be replaced in the FT provided in Fig. 16.16 (Chap. 16) by a sub-FT including the non-lethal shock. This leads to the FT modelling the pumps subject to non-lethal shocks (e.g. an overvoltage) presented in Fig. 23.9. Again, the obtained FT cannot be simplified but this does not matter when BDDs are implemented. Then, the development illustrated in Fig. 23.7 (right-hand side) provides a general way to introduce non-lethal shocks within FT models. Again, and similarly as above with lethal shocks, two situations have to be considered with regards to the states of the impacted items when the non-lethal shock occurs: 1. the functioning of the items is only inhibited when the CCF happens (e.g. cutting off their power supplies by the spurious opening of their individual protective circuit breakers set too low in case of a fugitive overvoltage); 2. the item actually fails upon the occurrence of the CCF (e.g. when it actually burns with a given probability γi when an overvoltage occurs).

326

23 CCF Modelling with FTs and RBDs

Output flow rate < 80m3/h G1 G2 G3 G5 Failure on non-lethal shock

G4 G6

G8

G7 G9

G10

Non-lethal shock

Fig. 23.9 Introduction of non-lethal shock on pumps into the FT illustrated in Fig. 16.16

In the first case, the impacted items run again as soon as (or after the very short time needed to close the circuit breakers) the shock disappears. In the second case, the impacted items have actually to be repaired before running again: modelling such repairs is generally not possible with Boolean models and the problem is analysed in more details in Sect. 23.4.

23.3 Modelling Non-tangible CCFs According to Chap. 5, non-tangible CCFs are these which are difficult to identify and qualify but which do exist and cannot be ignored without risking over-optimistic evaluations. Several approximate approaches have been introduced to deal with such CCFs and the beta-factor model (IEC 61508-6 2010) and the shock model (Atwood 1986) are going to be analysed hereafter.

23.3 Modelling Non-tangible CCFs

327

23.3.1 Beta-Factor Model This approach applies to a set of similar impacted items with the same failure rate, λ. According to Chap. 5, the beta-factor model consists in estimating the CCF failure rate, λCcf , affecting these items as a percentage, β, of the total failure rate (e.g. 5%) and the independent failure rate of each item, as the complement λInd : λCcf = β · λ

(23.3)

λInd = (1 − β) · λ

(23.4)

Then the failure rate of each item is obtained as: λ = λInd + λCcf

(23.5)

Coming back to Fig. 23.1, the independent failure rate, λInd , can be used to calculate the independent probability of failure, PiInd (t), or of success, PiInd (t) and the CCF failure rate, λCcf , the probability of occurrence of the CCF, PCcf (t) or non-occurrence, PCcf (t). Therefore, the beta-factor model is simply an adaptation of the process used to model tangible CCFs (lethal shocks) to non-tangible CCFs impacting similar items. It has to be noted that, when dealing with time-independent CCFs, a similar approach can be applied to a set of similar impacted items with the same constant failure probabilities, γ , which can be split between γInd and γCcf such as: γCcf = β · γ

(23.6)

γInd = (1 − β) · γ

(23.7)

γ = γInd + γCcf

(23.8)

and:

Then, γInd can play the role of PiInd and γCcf this of PCcf in the sub-FT on the left-hand side of Fig. 23.1. In the same way, (1 − γInd ) can play the role of PiInd and (1 − γccf ) this of PCcf in the sub-RBD on the right-hand side of Fig. 23.1.

328

23 CCF Modelling with FTs and RBDs

23.3.2 Shock Model According to Chap. 5, the shock model, (Atwood 1986), is another technique to handle non-tangible CCFs when dealing with a set of similar impacted items with the same failure rate, λ. Three parameters are involved: • ω: occurrence rate for lethal shocks. • ρ: occurrence rate for non-lethal shocks. • γ : conditional probability of failure of each item, given a non-lethal shock. As analysed in Chap. 5, ω and ρ can be estimated as percentages of the total failure rate λ. The lethal shock has already been analysed in Sects. 23.2.1.1 and 23.3.1 and the non-lethal shock in Sect. 23.2.1.2. Then the shock model is a combination of the various cases analysed above. When both lethal and non-lethal shocks are considered, the techniques proposed in Figs. 23.1 and 23.7 can be combined to obtain RBDs and FTS modelling both of them.

23.4 Considerations with Regards to Item Repairs As shown above, the proposed approaches properly model the occurrence and repair of the CCFs themselves as well as their impacts with regards to item failures but do not take into account the repairs of the individual items which actually fail upon CCF impacts. They work very well when the impacted items are only inhibited when the CCF occurs as this is, for example, the case when the power supply is lost: the impacted items restart as soon as the power is restored. Therefore, and except if the items are subject to failures to start on demand (which is another problem), the items have not to be repaired and the lethal shock model is relevant. This is different in case of e.g. flooding where the impacted items have actually to be repaired after the CCF has occurred: this cannot be properly modelled with the lethal shock models described above. It is the same when non-lethal shocks are considered. For example, if some protection circuit breakers are set too low, they can spuriously open (with a given constant probability γ ) when a fugitive overvoltage occurs. In this case, some of the impacted items are lost and some others are not affected. When the overvoltage disappears, then the items can immediately be restarted. On another hand, in case of high overvoltage, some items can burn and they do have to be actually repaired. To see what happens, the repaired items impacted by CCFs can be modelled by Markov graphs like these proposed in Fig. 23.10: • On the left-hand side is modelled a lethal shock impacting an item being actually repaired.

23.4 Considerations with Regards to Item Repairs

329

Fig. 23.10 Markov graph related to lethal and non-lethal shocks and repaired items

• On the right-hand side is modelled a non-lethal shock impacting an item being actually repaired. In order to simplify, it has been supposed in both cases that the failed item can be repaired even if the common cause of failure is still present. The problem with such models is that the parameters related to the CCF and these related to the impacted item are interlinked in a such a way that it is not possible to split them into independent parts (i.e. a part related to the common cause failure and another one specific to a given impacted item). Therefore, it is not possible to model them by using the approaches proposed in Fig. 23.1 or in Fig. 23.7. Of course, the same limitations apply when implementing the beta-factor or the shock models for the non-tangible CCFs. Nevertheless, with regards to the lethal shocks, some approximation could be considered as, for example, when only one repair team is available: • Series structure: after a lethal shock, all the impacted items plus the CCF itself have to be repaired. Then, the mean overall repair time, MORTeq , can be estimated to the sum of the overall repair times of all the items ( μ1i ) plus this of the CCF  1 1 itself ( μ1eq ). This leads to MORTeq ≈ μCcf + and to replace the repair rate, μi μCcf , by an equivalent repair rate μeq ≈ 1/MORTeq . • Parallel structure: after a lethal shock, at least one of the impacted items plus the CCF itself have to be repaired. The mean overall repair time, MORTeq , can be estimated to the minimum of the sum of the overall repair time of one of the items 1 + μ1i ) and to replace plus this of the CCF itself. This leads to MORTeq ≈ min( μCcf i

the repair rate, μCcf , by an equivalent repair rate μeq ≈ 1/MORTeq .

• Majority vote structure m out of n: after a lethal shock, at least m of the impacted items plus the CCF itself have to be repaired. The mean overall repair time,

330

23 CCF Modelling with FTs and RBDs

MORTeq , is the minimum of the sum of the overall repair times of m of the items plus this of the CCF itself: for a 2 out of 3 system (two items to repair), this 1 + μ1i + μ1j ) and to replace the repair rate,μCcf , by leads to MORTeq ≈ min( μCcf i,j=i

an equivalent repair rate μeq ≈ 1/MORTeq . When the impacted items are similar, the formulae above are simpler but, anyway, in the case of non-lethal shocks or when several repair teams are available, the equivalent repair rate, μeq , is not easy to calculate: it has to be done on a case-by-case basis and the impact of the approximation may be difficult to estimate. When Boolean models are implemented, it is generally not possible to model items repaired after being failed by the occurrence of a common cause failure and, in this case, Markov graphs or Petri nets should be used instead.

23.5 Lineage CCFs According to Chap. 5, the lineage dependencies are linked to common causes impacting in the same way one or several probabilistic parameters (e.g. the failure rate) of items belonging to a same family (e.g. same design, manufacturer, provider). This impacts for example the probability to get a bad batch or a good batch of components when provisioning similar items to be installed in an industrial system. Such CCFs require to undertake uncertainty calculations (see Chap. 38) performed through Monte Carlo simulations (see Chap. 32). This is a general approach where the parameter of interest (e.g. failure rate) is considered as a random variable governed by a probabilistic distribution. The items belonging to the same lineage are of good, medium or bad quality at the same time and this implies, for example, that their failure rates have the same value (i.e. are correlated). Therefore, during the Monte Carlo simulation, instead of firing as many random numbers as related items, a single random number is fired and used to simulate the failure rate for all the items of the family. The impact of the lineage CCFs is to broaden the confidence interval of the probabilistic calculations undertaken for data uncertainty propagation (see Chap. 38). For Boolean models, this is analysed in Chap. 25.

23.6 Use of Minimal Cut Sets As analysed in the previous chapters, the minimal cut sets obtained from RBDs or FTs are very useful for qualitative analyses (Chap. 17), semi-quantitative analyses

23.6 Use of Minimal Cut Sets

331

(Chap. 20) and even probabilistic calculations (Chap. 19). In particular, Chap. 17 explains how minimal cut sets can be used to identify common cause failure candidates which may have an impact on the probability of system failure. The basic probabilistic calculations of minimal cut sets consist simply in multiplying the probabilities of occurrence of each of the events involved in the minimal cut set because they are considered to be independent. Therefore, the dynamic aspects between CCFs and impacted events cannot be handled by such calculations. Nevertheless, any minimal cut set can be represented by a fault tree involving a logic AND gate. This implies that any common cause failure can be introduced in a minimal cut set exactly in the same way as this has been done in Figs. 23.3 or 23.9. Therefore, anything explained above for FTs (and for RBDs)—including the limitations—applies for individual minimal cut sets. For example, let us consider one of the minimal cut sets, P1 ∩ P3 , identified for the pumping system in Chap. 17. Introducing a CCF into this minimal cut set leads to: P1 ∩ P3 = (P1Ind ∪ Ccf ) ∩ (P2Ind ∪ Ccf ) = Ccf ∪ (P1Ind ∩ P2Ind )

(23.9)

Two new minimal cut sets are obtained: (P1Ind ∩ P2Ind ) and Ccf . For performing probabilistic calculations, let PP1 , PP2 and PCcf be the probabilities of occurrence of P1Ind , P2Ind and Ccf : • P1Ind ∩ P2Ind do not present any calculation problems and can be calculated as usual Ind (Chap. 20): Pr(P1 ∩ P2Ind ) = PP1 · PP2 . • Ccf presents the same calculation difficulties as these identified in the chapters above: Pr(Ccf ) depends, in particular, on the CCF nature: – lethal shock: Pr(Ccf ) = PCcf – non-lethal shock: Pr(Ccf ) = γ 2 · PCcf When time-dependent probabilities are considered, the calculation of Pr(Ccf , t) is more difficult as it depends also on different assumptions: • • • •

CCF repaired or not repaired; P1 and P2 actually repaired or only inhibited after being impacted by the CCF; one or several repair teams available; CCF repaired first or if it does not matter; etc.

All these different cases can be modelled by simple Markov graphs but this has to be done, on a case-by-case basis, by a cautious detailed analysis of the effects of the CCF. This is an application of the FT/RBD-driven Markov graphs (see Chap. 27). In addition, Chap. 31 provides some examples of Markov graphs modelling common cause failures and which could be used for this purpose.

332

23 CCF Modelling with FTs and RBDs

23.7 Associated Exercises Three exercises related to Chapter 23 are proposed in Chap. 29. They are shared with Chaps. 20, 22 and 31: • Exercise 20.4: identify the minimal cut sets related to an overpressure protection system which are subject to CCFs and calculate the impact of CCFs by using a beta factor of 5%. • Exercise 22.3: model the CCFs of an overpressure protection system with a fault tree and calculate its PFDavg (average unavailability), PFH (average failure frequency) and unreliability (probability of failure). • Exercise 31.2: introduce the CCF on the pumps of a pumping system into a Markov graph.

References ATWOOD CL (1986) The binomial failure rate common cause model, technometrics, vol 28(2). American Statistical Association and American Society for Quality, USA IEC 61508-6 Ed. 2.0 (2010) Functional safety: safety of electrical/ electronic/ programmable/electronic safety related systems, Part 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3, International Electrotechnical Commission (IEC), edition 2.0, Geneva, Switzerland

Chapter 24

Critical States and Importance Factors

24.1 Critical and Non-critical States 24.1.1 Minterms and Exclusive and Inclusive Cofactors 24.1.1.1

Minterm-Based Representations

The notion of critical and non-critical states are core concepts helping to understand the meaning of many importance factors. The example of Boolean formulae involving the states of three components A, B and C may be helpful to clarify the following development. Their constituent components being binary (up and down states), each of those Boolean functions involves 23 different states:  π = A · B · C, A · B · C, A · B · C, A · B · C,  A · B · C, A · B · C, A · B · C, A · B · C

(24.1)

This leads to the definition of the concept of minterms as follow: Minterm: product of all the Boolean variables involved into a Boolean function in their direct or complementary forms. It has to be noted that, the minterms being mutually exclusive, the minterm-based representation of a Boolean function is very effective from a probabilistic calculation point of view. In the above example, the set π contains eight different minterms corresponding to eight different states and any Boolean function S of three variables implies a partition of the above states in two classes (up and down states) in order to define S and S respectively.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_24

333

334

24 Critical States and Importance Factors

Monotone Boolean functions. Examples S1 and S2 : These Boolean functions involve variables only in direct or opposite state. – S1 = A + B + C can be written as follows: S1 = A · B · C + A · B · C + A · B · C + A · B · C +A·B·C +A·B·C +A·B·C (i.e. the union of the minterms belonging to π, with at least A, B or C in up state) S1 = A · B · C (i.e. the union of the minterms belonging to π, with A, B and C in down state) – S2 = A · B + A · C + B · C can be written as follows: S2 = A · B · C + A · B · C + A · B · C + A · B · C (i.e. the union of the minterms belonging to π, with at least two items in up state) S2 = A · B · C + A · B · C + A · B · C + A · B · C (i.e. the union of the minterms belonging to π, with at least two items in down state). Non-monotone Boolean functions. Example S3 : This Boolean function involves a biform variable (A and its opposite A). – S3 = A · B + A · C can be written as follows: S3 = A · B · C + A · B · C + A · B · C + A · B · C (i.e. the union of the minterms belonging to π, satisfying S3 ) S3 = A · B · C + A · B · C + A · B · C + A · B · C (i.e. the union of the minterms belonging to π, satisfying S 3 ). A glance at the two above minterm-based representations shows that S3 is actually non-monotone: when in the up state A · B · C, the repair of component C leads to the down state A · B · C and when in the down state A · B · C, the failure of component C leads to the up state A · B · C.

24.1 Critical and Non-critical States

335

Minterm containing a given variable Ck : As Ck · Ck = Ck and Ck · C k = 0, the Boolean formula Ck · S retains all the minterms of S containing Ck . Example: – S2 = A · B · C + A · B · C + A · B · C + A · B · C – A · S2 = A · B · C + A · B · C + A · B · C. Then the minterms of Ck · S are the minterms of S containing Ck and if Ck · X is a given minterm of S, it is also a minterm of Ck · S.

24.1.1.2

Conditional Boolean Functions from Minterm-based Representations

The minterm-based representation of Boolean formulae is useful as it allows to easily define conditional Boolean functions. Conditional function (S|C k ): Boolean function S given that the variable Ck is true (i.e. equal to 1). As C k = 0, the minterms of S containing C k disappear and (S|Ck ) is obtained from the union of the minterms containing Ck (i.e. from Ck · S). Applying Ck = 1 to these minterms leads to: S · Ck = Ck · (S|Ck ) Examples: • Function S1 – – – –

(S1 |A) = B · C + B · C + B · C + B · C = 1 (certain state); (S 1 |A) = 0 (impossible state); (S1 |A) = B · C + B · C + B · C = B + C; (S 1 |A) = B · C.

• Function S2 – – – –

(S2 |A) = B · C + B · C + B · C = B + C; (S 2 |A) = B · C; (S2 |A) = B · C; (S 2 |A) = B · C + B · C + B · C = B + C.

• Function S3 – – – –

(S3 |A) = B · C + B · C = B; (S 3 |A) = B; (S3 |A) = B · C + B · C = C; (S 3 |A) = C.

(24.2)

336

24.1.1.3

24 Critical States and Importance Factors

Exclusive Cofactor

For monotone and non-monotone Boolean functions (related respectively to coherent and non-coherent RBDs/FTs), the above conditional Boolean functions lead to the definition of the exclusive cofactor (Dutuit et al. 2000), as follows: Exclusive cofactor at logic function level, SC# k : The exclusive cofactor SC# k of function S with regards to its variable Ck is given by the product of the conditional function (S|Ck ) by the conditional function (S|C k ): SC# k = (S|Ck ) · (S|C k )

(24.3)

As shown in Table 24.1, the above definition leads to define three other exclusive cofactors. According to Table 24.1 this implies: #

#

SC# k = S C k and S Ck = SC#

k

(24.4)

These exclusive cofactors are defined in any case (S monotone or not with regards # to Ck ) and independent of Ck and C k . When function S is monotone, SC# and S Ck are k equal to zero: #

S Ck = SC# = 0 k

(24.5)

Exclusive cofactor at system level, S#Ck : The exclusive cofactor S#Ck of system S with regards to its component Ck is given by the union of the exclusive cofactors SC# k and SC# : k

S#Ck = SC# k + SC#

k

(24.6)

Due to the equalities shown in Formula (24.4), the exclusive cofactor S#Ck is also equal to: #

#

S#Ck = S C k + S Ck Table 24.1 The four exclusive cofactors at logic function level

(24.7)

24.1 Critical and Non-critical States

337

Therefore, there is only one exclusive cofactor at system level when there are four exclusive cofactors at logic function level. It has to be noticed that this cofactor is written with upright letters (item) instead of italic letters (item state). The exclusive cofactors S#Ck and SC# k can be used to represent the critical states of a system S with regards to the state of its component Ck (see 24.1.2). They involve the calculation of conditional functions which are easy to handle with Boolean models when BDDs are implemented. Therefore, they are very effective to deal with critical states when FTs or BDDs are used. To end the current subsection about exclusive cofactors, it has to be noted that the exclusive cofactor has been used with a different formulation by other authors, Andrews and Beeson (2003), Zaitseva and Levashenko (2013), Zaitseva et al. (2015) and Aliee et al. (2017). Examples: Applying Formula (24.3) (i.e. Table 24.1) successively to S1 , S2 and S3 with regards to Ck = A gives: #

# – S1,A = S 1,A = (S1 |A) · (S 1 |A) = 1 · B · C = B · C ≡ A · (B · C) + A · (B · C) #

# – S2,A = S 2,A = (S2 |A) · (S 2 |A) = (C + B · C) · (B · C + C) = B · C + B · C #

# – S3,A = S 3,A = (S3 |A) · (S 3 |A) = B · C.

Doing the same successively to S1 , S2 and S3 with regards to Ck = A gives: #

# – S1,A = S 1,A = (S1 |A) · (S 1 |A) = (B · C + B · C + B · C) · 0 = 0 #

# = (S 2 |A) · (S2 |A) = (B · C) · (B · C) = 0 – S 2,A = S2,A #

# – S3,A = S 3,A = (S3 |A) · (S 3 |A) = B · C = 0.

These results show that S1 and S2 are monotone with regards to the states of A contrarily to S3 which is non-monotone with regards to the states of A. Continuing with S3 with regards to Ck = B and Ck = B gives: #

# – S3,B = S 3,B = (S 3 |B) · (S3 |B) = (A · C) · (A · C) = 0 #

# – S3,C = S 3,C = (S 3 |C) · (S3 |C) = (A · B) · (A · B) = 0.

These results show that S3 is monotone with regards to the states of B and of C. The meaning of the above results is graphically explained in Sect. 24.1.2.

24.1.1.4

Inclusive Cofactor

In the same way as the exclusive cofactor has been introduced to deal with critical states, the inclusive cofactor can be introduced to deal with non-critical states (see Sect. 24.1.3). Again, this is done by using the conditional Boolean functions identified in Sect. 24.1.1.2.

338

24 Critical States and Importance Factors

Table 24.2 The four inclusive cofactors at function level

Inclusive cofactor at logic function level, SC∗ k : The inclusive cofactor SC∗ k of function S (monotone or not) with regards to its variable Ck (direct or complemented) is given by the product of the conditional function (S|Ck ) by the conditional function (S|C k ): SC∗ k = (S|Ck ) · (S|C k )

(24.8)

As shown in Table 24.1, the above definition leads to define three other inclusive cofactors. According to Table 24.2 this implies: ∗



SC∗ k = SC∗ and S Ck = S C k k

(24.9)

Like the exclusive cofactors, the inclusive cofactors are independent of both Ck and C k . The inclusive cofactors can be used to represent the non-critical states of a system S with regards to the state of its component Ck (see 24.1.3). Again, they involve the calculation of conditional functions which are easy to handle with Boolean models when BDDs are implemented. Therefore, they are very effective to deal with noncritical states when FTs or BDDs are used. Examples: Applying Formula (24.7) (i.e. Table 24.2) successively to S1 , S2 and S3 with regards to Ck = A gives: ∗ ∗ – S1,A = S1,A = (S1 |A) · (S1 |A) = 1 · (C + B · C) = B + C

∗ ∗ – S2,A = S2,A = (S2 |A) · (S2 |A) = (C + B · C) · (B · C) = B · C ∗ ∗ – S3,A = S3,A = (S3 |A) · (S3 |A) = B · C.

24.1.2 Critical States The notion of critical states (Vesely 1970; Pagès and Gondran 1986; Vaurio 2010; Kovalenko et al. 1997) also known as boundary states (Singh 1981), is defined and illustrated hereafter in three different ways.

24.1 Critical and Non-critical States

24.1.2.1

339

Definitions

• Literal definition: Let S be a system and Ck one of its components. S is said to be in a critical state with regards to Ck if any change of the state (up to down or down to up) of Ck causes a change (similar or opposite) of the state of S. This defines two critical states at system level: – crit(S, Ck ): system S is in critical state with regards to variable Ck , – crit(S, C k ): system S is in critical state with regards to variable C k . • Minterm-based definition: Let S (resp. S) be the Boolean function indicating that system S is in an up state (resp. down state) and Ck (resp. C k ) be the variable representing the up state (resp. down state) of component Ck . Let X be a product of variables of S containing neither Ck nor C k . Two critical up states are defined as follows: – crit(S, Ck ) = {minterm Ck · X such as Ck · X ∈ S and C k · X ∈ S} Boolean function S critical with regards to the Boolean variable Ck – crit(S, C k ) = {minterm C k · X such as C k · X ∈ S and Ck · X ∈ S} Boolean function S critical with regards to the Boolean variable C k . Two critical down states are defined as follows: – crit(S, Ck ) = {minterm Ck · X such as Ck · X ∈ S and C k · X ∈ S} Boolean function S critical with regards to the Boolean variable Ck – crit(S, C k ) = {minterm C k · X such as C k · X ∈ S and Ck · X ∈ S} Boolean function S critical with regards to the Boolean variable C k . • Exclusive cofactor-based definition: This third definition emanates from the minterm-based definition and it is illustrated with regards to crit(S, Ck ) as follows: – As a given minterm Ck · X of S is also a minterm of S · Ck , which is equal to Ck · (S|Ck ) (see Sects. 24.1.1.1 and 24.1.1.2), so it can be claimed that X belongs to the conditional Boolean function (S|Ck ). Example: A · (B · C) is a critical minterm of S2 with X = B · C. Then, A · S2 = A · B · C + A · B · C + A · B · C and A · (B · C) is a minterm of A · S2 . And (B · C) belongs to (S2 |A) = B + C = B + B · C + B · C.

340

24 Critical States and Importance Factors

– In the same way, one finds that X also belongs to (S|C k ) and thus to (S|Ck ) · (S|C k ), i.e. to SC# k . #

# Example: (B · C) belongs to S2,A = S 2,A = (S2 |A) · (S 2 |A) = B · C + B · C.

The above result holds for any minterm Xi related to crit(S, Ck ). Therefore SC# k can be considered as the union of the critical states (up and down) of system S and can be expressed as follows: SC# k = crit(S, Ck ) + crit(S, C k )

(24.10)

Examples # S2,A = B · C + B · C = A · (B · C + B · C) + A · (B · C + B · C)

= crit(S2 , A) + crit(S 2 , A) # S3,A = BC = A · (B · C) + A · (B · C) = crit(S3 , A) + crit(S 3 , A)

Formula (24.10) can be written: #

SC# k = S C k =



Ck · Xi +

i



C k · Xi =



i

Xi

(24.11)

i

Moreover, an interesting property of any Boolean function S is illustrated hereafter with respect to any of its variables: #

SC# k = S C k = crit(S, Ck )|Ck = crit(S, C k )|C k #

SC# = S Ck = crit(S, C k )|C k = crit(S, Ck )|Ck k

(24.12) (24.13)

The two above equivalences lead to the exclusive cofactor-based definition of critical states of system S with regards to its component Ck . The minterm and exclusive cofactor-based definitions lead to 4 critical states defined at function level whereas the literal definition leads to only 2 critical states at system level. This is because each of the critical states defined at system level encompasses two critical states defined at function level: crit(S, Ck ) = crit(S, Ck ) + crit(S, Ck )

(24.14)

crit(S, C k ) = crit(S, C k ) + crit(S, C k )

(24.15)

Therefore, and according to Table 24.3:

24.1 Critical and Non-critical States Table 24.3 The four critical states at function level

341

S

S

Ck

crit(S, Ck ) = Ck · SC# k

crit(S, Ck ) = Ck · S Ck

Ck

crit(S, C k ) = C k · S #

crit(S, C k ) = C k · S C k

#

#

Ck

#

crit(S, Ck ) = Ck · SC# k + Ck · S Ck = Ck · SC# k + Ck · SC# = Ck · S#Ck k

#

#

#

crit(S, C k ) = C k · SC# + C k · S C k = C k · S Ck + C k · S C k = C k · S#Ck k

(24.16) (24.17)

It has to be noted that, when dealing with monotone functions (coherent systems) = SC# = 0, and then:

# S Ck

k

crit(S, Ck ) = Ck · SC# k = crit(S, Ck ) #

crit(S, C k ) = C k · S C k = crit(S, C k )

24.1.2.2

(24.18) (24.19)

Examples

Several examples are given hereafter. It has to be noted that some of the above formulae, applied to coherent or non-coherent systems, can lead to null results. As mentioned previously, this means that such critical states concerned by those formulae do not exist. Coherent systems S1 and S2 Critical states of S1 with regards to component A: • critical up states: – crit(S1 , A) = A · (B · C) – crit(S1 , A) = 0 • critical down states: – crit(S 1 , A) = A · (B · C) – crit(S 1 , A) = 0. Then: crit(S1 , A) = crit(S1 , A) = A · (B · C) crit(S1 , A) = crit(S 1 , A) = A · (B · C).

342

24 Critical States and Importance Factors

Critical states of S2 with regards to component A: • critical up states: – crit(S2 , A) = A · B · C + A · B · C = A · (B · C + B · C) – crit(S2 , A) = 0 • critical down states: – crit(S 2 , A) = A · (B · C + B · C) – crit(S 2 , A) = 0 . Then: crit(S2 , A) = crit(S2 , A) = A · (B · C + B · C) crit(S 2 , A) = A · (B · C + B · C). It has to be noted that, in accordance with the status of monotone functions allowed to S1 and S 2 , the critical states crit(Si , A) and crit(S i , A), with i = 1, 2 do not exist (i.e. are equal to 0). Similar expressions can be obtained for the up and down critical states of the same system with regards respectively to components B and C. Non-coherent systems S3 Critical states of S3 with regards to component A: • critical up states: – crit(S3 , A) = A · (B · C) – crit(S3 , A) = A · (B · C) • critical down states: – crit(S 3 , A) = A · (B · C) – crit(S 3 , A) = A · (B · C). Then: crit(S  = crit(S  2 , A)  + crit(S  3 , A) = A · (B · C + B · C)  3 , A) crit S3 , A = crit S3 , A + crit S 3 , A = A · B · C + B · C . Critical states of S3 with regards to component B: • critical up states: – crit(S3 , B) = B · (A · C + A · C) = B · (A) – crit(S3 , B) = B · (0) = 0 • critical down states: – crit(S 3 , B) = B · (0) = 0 – crit(S 3 , B) = B · (A · C + A · C) = B · (A). Then: crit(S3 , B) = crit(S3 , B) = B · A crit(S3 , B) = crit(S 3 , B) = B · A.

24.1 Critical and Non-critical States

343

Critical states of S3 with regards to component C: • critical up states: – crit(S3 , C) = C · (0) = 0 – crit(S3 , C) = C · (A · B + A · B) = C · (A) • critical down states: – crit(S 3 , C) = C · (A · B + A · B) = C · (A) – crit(S 3 , C) = C · (0) = 0 . Then: crit(S3 , C) = crit(S3 , C) = C · A crit(S3 , C) = crit(S 3 , C) = C · A. It has to be noted that, in accordance with the status of non-monotone functions allowed to S3 , the critical states crit(S3 , A) and crit(S 3 , A) do exist (i.e. they are not equal to zero). In addition: • crit(S3 , B) = crit(S 3 , B) = 0 which indicates a coherent behaviour of system S3 with regards to component B; • crit(S3 , C) = crit(S 3 , C) = 0 which also indicates a coherent behaviour of system S3 with regards to component C but in the opposite way: the system cannot be in up state if C is in up state and vice versa.

Even if the first feature has been already unveiled with the above developments on systems S1 , S2 and S3 , it deserves to be explained as follows: “When dealing with a coherent system, the system failure can only be caused by a component failure. Hence a component in a coherent system can only be failure-critical. However, when dealing with a non-coherent system, the system failure can be caused, not only by a component failure, but also by a component repair. Thus, a component in a non-coherent system can be failure-critical or repair-critical. These two criticalities must be considered separately because any component can exist in only one state at any time” (Andrews and Beeson 2003).

24.1.3 Non-critical States As for critical states, several formulations can be used to define non-critical states of a system with regards to each of its components. • Literal definition: Let S be a system and Ck one of its components.

344

24 Critical States and Importance Factors

S is said to be in a non-critical state with regards to Ck if any change of the state (up to down or down to up) of Ck causes no change (i.e. up to up or down to down) of the state of S. • Minterm-based definition: Let S (resp. S) be the Boolean function indicating that system S is in an up state (resp. down state) and Ck (resp. C k ) be the variable representing the up state (resp. down state) of component Ck . Let X be a product of variables of S containing neither Ck nor C k . This leads to define two non-critical up states: – crit(S, Ck ) = {minterm Ck · X such as Ck · X ∈ S and C k · X ∈ S} – crit(S, C k ) = {minterm C k · X such as C k · X ∈ S and Ck · X ∈ S}. And two non-critical down states: – crit(S, Ck ) = {minterm Ck · X such as Ck · X ∈ S and C k · X ∈ S} – crit(S, C k ) = {minterm C k · X such as C k · X ∈ S and Ck · X ∈ S}. • Inclusive cofactor-based definition: This third definition is new. It emanates from the minterm-based definition of noncritical states and it has been inspired by the exclusive cofactor-based definition of the critical states (see 24.1.2). It is illustrated with regards to crit(S, Ck ) as follows: SC∗ k = crit(S, Ck )|Ck = crit(S, C k )|C k = (S|Ck ) · (S|C k )

(24.20)

Let us illustrate the notions of both inclusive cofactor and non-critical states by applying the above definition to systems S1 , S2 and S3 with respect to some of their respective components: • Non-critical states of S1 with regards to component A: ∗ – crit(S1 , A) = A · S1,A = A · (S1 |A) · (S1 |A) = A · (1) · (B + C) = A · (B + C) ∗ – crit(S1 , A) = A · S1,A = A · (S1 |A) · (S1 |A) = A · (B + C).

• Non-critical states of S2 with regards to component B: ∗ – crit(S2 , B) = B · S2,B = B · (S2 |B) · (S2 |B) = B · (A + C) · (A · C) = B · (A · C) ∗ – crit(S2 , B) = B · S2,B = B · (S2 |B) · (S2 |B) = B · (A · C) · (A + C) = B · (A · C).

• Non-critical states of S3 with regards to component C: ∗ – crit(S3 , C) = C ·S3,C = C ·(S3 |C)·(S3 |C) = C ·(A · B)·(A + B) = C ·(A · B)

∗ – crit(S3 , C) = C ·S3,C = C ·(S3 |C)·(S3 |C) = C ·(A + B)·(A · B) = C ·(A · B).

24.1 Critical and Non-critical States

345

• Non-critical states of S3 with regards to component A: ∗ – crit(S3 , A) = A · S3,A = A · (S3 |A) · (S3 |A) = A · (C) · (B) = A · (C · B)

∗ – crit(S3 , A) = A · S3,A = A · (S3 |A) · (S3 |A) = A · (B) · (C) = A · (C · B).

24.1.4 Link Between Critical and Non-critical States Gathering all the possible critical and non-critical states allows to write the Boolean equation S related to a system S as: S = crit(S, Ck ) + crit(S, C k ) + crit(S, Ck ) + crit(S, C k )

(24.21)

And the conditional states S|Ck and S|C k can be calculated as follows: S|Ck = crit(S, Ck )|Ck + crit(S, Ck )|Ck

(24.22)

S|C k = crit(S, C k )|C k + crit(S, C k )|C k

(24.23)

Similar calculations undertaken with regards to S lead to: S = crit(S, Ck ) + crit(S, C k ) + +crit(S, Ck ) + crit(S, C k )

(24.24)

S|C k = crit(S, C k )|C k + crit(S, C k )|C k

(24.25)

S|Ck = crit(S, Ck )|Ck + crit(S, Ck )|Ck

(24.26)

The above formulae are valid in any case (monotone and non-monotone Boolean functions) but they can be simplified in case of monotone functions as crit(S, C k ) = crit(S, Ck ) = 0

(24.27)

A Boolean function S is a sum of minterms which is the union of all its critical and non-critical states. Therefore, the product S · Ck removes all the minterms containing C k and only keeps the minterms related to the states of S which are critical or non-critical with regards to Ck . This leads to the property which is given below: S · Ck = crit(S, Ck ) + crit(S, Ck )

(24.28)

346

24 Critical States and Importance Factors

24.1.5 Graphical Synthesis of the Concepts Presented in a mathematical form, the above concepts may seem rather complicated. This is why a graphical synthesis of the fundamental concepts useful for better understanding and computation of importance factors is presented in this subsection.

24.1.5.1

Analysis of the Boolean Function S2

Function S2 = A · B + A · C + B · C has been introduced in Sect. 24.1.1 and Fig. 24.1 illustrates the two states S2 and S 2 of this function through its minterms. • Critical states of S2 with regards to component C: – crit(S2 , C) = A · B · C + A · B · C, which is a failure-critical state wrt (with regards to) S2 ; – crit(S 2 , C) = A · B · C + A · B · C, which is a repair-critical state wrt S 2 . #

# This leads to S2,C = S 2,C = A·B+A·B which is the sum crit(S2 , C)+crit(S 2 , C), i.e. the above minterms where C and C have been removed. Let us consider cofactor S#2,C of system S2 with regards to its component C. It is defined by the sum of the exclusive cofactors related to the failure of S when C changes state (see Sect. 24.1.1.3): # # + S2,C =A·B+A·B S#2,C = S2,C

(24.29)

This cofactor is equal to the union of all the failure-critical minterms related to A:

Fig. 24.1 Graphic analysis of the Boolean function S2 with regards to component C

24.1 Critical and Non-critical States

347

S#2,C = A · B · C + A · B · C + A · B · C + A · B · C = A · B + A · B #

#

# # It has to be noted that S#2,C = S2,C + S2,C = S 2,C + S 2,C . Then, it is also the sum of exclusive cofactors related to a repair of S: it describes the transition from S2 to S 2 (i.e. a failure of S2 ) as well as the transition from S 2 to S2 (i.e. a repair of S2 ). Therefore, both failure and repairs of S2 due to a change of state of C can be described by the single cofactor S#2,C . It has to be reminded that this cofactor is written with upright letters (system) instead of italic letters (Boolean variables). This result is essential to compute the marginal importance factor (MIF)—also called Birnbaum importance factor—(see Sect. 24.2.3).

• Non-critical states of S2 with regards to component C: – crit(S2 , C) = A · B · C which is a failure-non-critical state wrt S2 ; – crit(S2 , C) = A · B · C which is a repair-non-critical state wrt S2 . ∗ ∗ This leads to S2,C = S2,C = A · B which is the sum crit(S2 , C) + crit(S2 , C), i.e.

the above minterms where C and C have been removed. – crit(S 2 , C) = A · B · C which is a failure-non-critical state wrt S 2 ; – crit(S 2 , C) = A · B · C which is a repair-non-critical state wrt S 2 . ∗



This leads to S 2,C = S 2,C = A · B which is the sum crit(S 2 , C) + crit(S 2 , C), i.e. the above minterms where C and C have been removed.

24.1.5.2

Analysis of the Boolean Function S3

Function S3 = A·B+A·C has been introduced in Sect. 24.1.1 and Fig. 24.2 illustrates the two states S2 and S 2 of this function through its minterms. • Critical states of S3 with regards to component A: – crit(S3 , A) = A · B · C which is a failure-critical state wrt S3 ; – crit(S 3 , A) = A · B · C which is a repair-critical state wrt S 3 . #

# This leads to S3,A = S 3,A = B · C which is the sum crit(S3 , A) + crit(S 3 , A), i.e. the above minterms where A and A have been removed.

– crit(S3 , A) = A · B · C which is a repair-critical state wrt S3 ; – crit(S 3 , A) = A · B · C which is a failure-critical state wrt S 3 . #

# = S 3,A = B · C which is the sum crit(S3 , A) + crit(S 3 , A), i.e. This leads to S3,A

the above minterms where A and A have been removed. It has to be noted that these critical states exist because of the non-coherence of S3 .

348

24 Critical States and Importance Factors

Fig. 24.2 Graphic analysis of the Boolean function S3 with regards to component A

As for system S2 above, let us consider the cofactor S#3,A of system S3 with regards to its component A defined by the sum of the exclusive cofactors related to the failure of S when A changes state: # # + S3,A =B·C +B·C S#3,A = S3,A

(24.30)

This cofactor is equal to the union of all the failure-critical minterms related to A: S#3,A = A · B · C + A · B · C + A · B · C + A · B · C = B · C + B · C #

#

# # Again S#3,A = S3,A + S3,A = S 3,A + S 3,A which is also the sum of exclusive cofactors related to a repair of S. Therefore, this cofactor also allows to compute the marginal importance factor (MIF) in case of non-coherent systems (see Sect. 24.2.3).

• Non-critical states of S3 with regards to component A: Figure 24.2 illustrates two non-critical states A · B · C and A · B · C with regards to the failure of S3 when A changes state (fails or is repaired) and two non-critical states A · B · C and A · B · C with regards to the repair of S3 when A changes state (fails or is repaired). The analysis is similar to this already done for the non-critical states of system S2 (see Fig. 24.1). Figure 24.3 illustrates the critical and non-critical states of S3 with regards to component B. As S3 is monotone with regards to the states of B, the analysis is similar to this already done for the critical and the non-critical states of system S2 (see Fig. 24.1).

24.1 Critical and Non-critical States

349

Fig. 24.3 Graphic analysis of the Boolean function S3 with regards to component B

Once again, the following equality holds: #

#

# # + S3,B = S 3,B + S 3,B = A S#3,B = S3,B

(24.31)

And this is the union of the critical states with regards to component B: S#3,B = A · B · C + A · B · C + A · B · C + A · B · C = A Figure 24.4 illustrates the critical and non-critical states of S3 with regards to component C. Compared to component B, S3 is also monotone with regards to the

Fig. 24.4 Graphic analysis of the Boolean function S3 with regards to component C

350

24 Critical States and Importance Factors

states of C but in opposite way (S3 fails when C is repaired and vice versa). However, the analysis is similar to this is already done for the critical and the non-critical states of system S2 (see Fig. 24.1). Once again, the following equality holds: #

#

# # + S3,C = S 3,C + S 3,C = A S#3,C = S3,C

(24.32)

And this is the union of the critical states with regards to component C. S#3,C = A · B · C + A · B · C + A · B · C + A · B · C = A

The above analyses of S2 and S3 confirm that both critical states and cofactors related to a given system, coherent or not, are two concepts strongly linked.

24.2 Importance Factors 24.2.1 Generalities About Importance Factors Ranking the components according to their impacts on the probability of success or failure of the system in which they are included is very useful for the in-depth analysis of this system. When RBDs/FTs are implemented, this can be done by using one or several of the importance factors (Kuo and Zhu 2012; Dutuit and Rauzy 2014) which have been developed for this purpose. Importance factor: probabilistic indicator capturing a specific impact at system level of an event at component level. Various impacts can be considered and, therefore, several importance factors can be defined. They are generally defined within the time-independent frameworks (i.e. for constant probabilities of blocks or primary events) and regardless of the coherence or non-coherence of the related RBDs/FTs. Nevertheless, they can be extended to time-dependent probabilities (e.g. in the same way as probability of success/failure have been extended to availability/unavailability in Chap. 22) and some of them (or some calculations of them) are specifically devoted to coherent systems. The aim of this subsection is to present the main importance factors related to coherent RBDs/FTs, explain their meanings and describe how they can be practically calculated when RBDs/FTs are used. However, some brief clues are given about the importance factors related to non-coherent RBDs/FTs. Importance factors of dynamic RBDs/FTs imply to use Monte Carlo simulation: this is rather complicated

24.2 Importance Factors

351

and beyond the scope of the following subsections (see for instance Ou and Dugan 2000; Zhang et al. 2010). It has to be noted that two different importance factors are referred as VeselyFussell importance factors in the literature of interest. They are normally used in different contexts. The first one is defined from minimal cut sets (see Sect. 24.2.2) whereas the second one—which is also named diagnostic importance factor (DIF)— is not (see 24.2.5). To avoid any ambiguity, the name Vesely-Fussell importance factor is kept for the first one which is the most often used and the name DIF is retained for the second one.

24.2.2 Vesely-Fussell Importance Factor The Vesely-Fussell importance factor has already been encountered in Chap. 20 for proceeding to semi-quantitative analyses using the minimal cut sets obtained from FTs or RBDs and it is one of the older importance factors introduced for that purpose. As mentioned above, it should not be mixed up with the DIF (see Sect. 24.2.5) which has also be introduced by Vesely and Fussell. Let us consider the following notations: • Csn : any minimal cut set leading to the system S down state; • Csik : any minimal cut set containing the failure, C k , of Ck ; • Csj−k : any minimal cut set not containing C k . For a component Ck belonging to a system S, VFS (C k ) is originally defined as follows in Fussell (1975) on the basis of Vesely (1970) and confirmed in Vesely et al. (1981) and Vesely (1996): VFS (C k )

 Pr(Csik ) = i n Pr(Csn )

(24.33)

 In the notation VFS (C k ), the letter is a reminder of the way this importance factor is calculated. In the same way, another importance factor can be introduced to measure the impact of the minimal cut sets not containing C k :  −k IFS

Pr(Csi−k )

j

=  n

Pr(Csn )

(24.34)

   −k k As n Pr(Csn ) = i Pr(Csi ) + j Pr(Csj ) with n = i + j, therefore  VFS (C k ) can be calculated from the minimal cut sets not containing C k :

352

24 Critical States and Importance Factors

 VFS (C k )

Pr(Csj−k )

j

=1− 

Pr(Csn )

n

= 1 − IFS

−k

(24.35)

According to the above formula, the Vesely-Fussell importance factor can be calculated as well from the minimal cut sets containing C k as from the minimal cut sets not containing C k . Being based on minimal cut sets, it is usable only for coherent RBDs/FTs. Contrary to some other importance factors developed hereafter which are related to specific system states, the Vesely-Fussell importance factor VFS (C k ) is merely a simple and pragmatic indicator easy to understand and to use by hand for simple RBDs/FTs. As it takes into account both the probability of failure of Ck and the order of the minimal cut sets in which it is involved, it provides rather good rankings of the impact of the component failures on the overall system failure S. This is why it is very popular and widely used. Nevertheless, when the size of the models (FTs or RBDs) increases, the number of minimal cut sets becomes large (hundreds, thousands or even millions of minimal cut sets) and the above formula becomes very difficult to implement or use without truncations. This is why a sounder definition has been proposed (see for instance Misra 1992) based on the fact that the original definition is built on the first terms of the Sylvester-Poincaré formula (see Chap. 19) which are approximations of Pr(∪i Ccik ) and of Pr(∪i Cci ). Replacing the approximated probabilities by the non-approximated probabilities leads to: VFS (C k ) =

Pr(∪ Csik ) i

Pr(∪ Csn )

=

Pr(∪ Csik )

n

i

Pr(S)

≡ Pr[(∪ Csik )|S] i

(24.36)

If Pr(S) is easy to calculate in the above formula, the calculation of Pr(∪i Csik ) is, unfortunately, rather difficult and time-consuming. This is why attempts have been made to replace this formula by something easier to calculate. Let us consider the following simplified notations (where ∩ is replaced by ·). • Y = ∪i Csi−k : all minimal cut sets not containing C k ; • C k · X = ∪i Csik : all minimal cut sets containing C k ; • S = C k · X + Y : all minimal cut sets. This implies that S|Ck = Y . Then Pr(Y ) = Pr(S|Ck )and IFS−k can be calculated as: IFS−k

=

Pr(∪ Csj−k ) j

Pr(∪ Csn ) n

=

Pr(Y ) Pr(S)

=

Pr(S|Ck ) Pr(S)

(24.37)

As Pr(S|Ck ) is easy to calculate when BDDs are implemented, IFS−k is also easy to calculate from RBDs/FTs and it measures the impact of the minimal cut sets not containing the failure, C k , of component Ck .

24.2 Importance Factors

353

With the above notations, VFS (C k ) can be calculated as: VFS (C k ) =

Pr(C k · X )

(24.38)

Pr(S)

Unfortunately, Pr(S) = Pr(C k · X ) + Pr(Y ) − Pr(C k · X · Y ) in this case and, contrary to what happens with the calculation of VFS (C k ) above, VFS (C k ) is no longer the complement to 1 of IFS−k . Therefore, it is more difficult to calculate. Dividing the development of Pr(S) by Pr(S) leads to: Prt(C k · X ) Pr(S)

+

Pr(Y ) Pr(S)



Pr(C k · X · Y ) Pr(S)

Then VFS (C k ) = 1 − IFS−k +

Pr(C k ·X ·Y ) Pr(S)

= VFS (C k ) + IFS−k −

Pr(C k · X · Y ) Pr(S)

=1

and this implies:

VFS (C k ) ≥ 1 − IFS−k

(24.39)

Finally, 1 − IFS−k provides a lower bound of VFS (C k ) and this lower bound is easy to calculate. Coming back to the development of Pr(S) leads to: Pr(S) = Pr(C k · X ) + Pr(Y ) − Pr(C k · X ) · Pr(Y |X ) = Pr(C k · X ) · [1 − Pr(Y |X )] + Pr(Y ) = Pr(C k · X ) · Pr(Y |X ) + Pr(Y ) Then Pr(C k · X ) =

Pr(S)−Pr(Y ) Pr(Y |X )

and dividing by Pr(S) gives VFS (C k ) as:

VFS (C k ) =

Pr(S) − Pr(S|Ck ) Pr(S) · Pr(Y |X )

(24.40)

This formula provides an exact expression of VFS (C k ) but it is very difficult to calculate because Pr(Y |X ) remains hard to obtain. Fortunately, considering Pr(Y |X ) allows to find lower bounds: Pr(S)−Pr(S|Ck ) . This Pr(S) −k VFS (C k ) ≥ 1 − IFS .

– Pr(Y |X ) is lower than 1 in any case: VFS (C k ) ≥ same approximation already found above:

provides the

– Pr(Y |X ) is also lower than Pr(Y ) and this provides another lower bound, less Pr(S)−Pr(S|Ck ) k) pessimistic than the previous one: VFS (C k ) ≥ Pr(S)−Pr(S|C = Pr(S)·[1−Pr(S|C . Pr(S)·Pr(Y ) )] k

354

24 Critical States and Importance Factors

Replacing Pr(S) by Pr(C k ) · Pr(S|C k ) + Pr(Ck ) · Pr(S|Ck ) in the numerator of the first above lower bound gives: Pr(S) − Pr(S|Ck ) Pr(S)

= = =

Pr(C k ) · Pr(S|C k ) + Pr(Ck ) · Pr(S|C k ) − Pr(S|Ck ) Pr(S) Pr(C k dotPr(S|C k ) − Pr(S|Ck ) · [1 − Pr(Ck )] Pr(S) Pr(C k ) · Pr(S|C k ) − Pr(S|Ck ) · Pr(Ck )

Pr(S) ¯ Pr(Ck ) ¯ C¯ k ) − Pr(S|C ¯ k )] .[Pr(S| = ¯ Pr(S) k) · MIFS (C k ) = CIFS (C k ) (see 24.2.3 and 24.2.4.1) = Pr(C Pr(S) According to the above equation, the impact of the minimal cut sets which do not involve component Ck is the complement to 1 of CIFS (C k ):

IFS−k = 1 − CIFS (C k )

(24.41)

Using the same above equation allows to calculate the second lower bound as: VFS (C k )inf =

CIFS (C k )

(24.42)

[1 − Pr(S|Ck )]

and this leads to: CIFS (C k ) ≤ VFS (C k )inf ≤ VFS (C k )

(24.43)

An upper bound can be found by comparing VFS (C k ) to the diagnostic importance factor, DIFS (Ck ) = Pr(C k |S) (see 24.2.5, Eq. 24.70). Then DIFS (Ck ) =

Pr(C k ·S) Pr(S)

Comparing VFS (C k ) =

=

Pr[C k ·(C k ·X +Y )] Pr(S)

Pr(C k ·X ) Pr(S)

=

Pr(C k ·X +C k ·Y ) . Pr(S)

to DIFS (Ck ) =

Pr(C k ·X +C k ·Y ) Pr(S)

demonstrates that

DIFS (Ck ) is greater than VFS (C k ) and provides an upper bound to the Vesely-Fussell importance factor. Finally, gathering all the results about lower and upper bounds gives the following framing: CIFS (C k ) ≤ VFS (C k )inf ≤ VFS (C k ) ≤ DIFS (Ck )

(24.44)

It has to be noted that these bounds are related to VFS (C k ) and not to VFS (C k ) which depends on how many minimal cut sets have been kept to calculate it. It has

24.2 Importance Factors

355

also to be noted that VFS (C k ) should not be mixed up with the diagnostic importance factor DIF (see Sect. 24.2.5 hereafter) as this is sometimes done in literature. It has also to be noted that, in the case of non-coherent systems, an extended version of the Vesely-Fussell importance factor has been proposed by Beeson and Andrews (2003) (see Sect. 24.2.3.3.2).

24.2.3 Birnbaum Importance Factor (MIF) The Birnbaum importance factor is also known as the marginal importance factor (MIF). Like the Vesely-Fussell importance factor, the Birnbaum importance factor (Andrews and Beeson 2003; Birnbaum 1969), has also been already encountered in previous chapters: this is the key for frequency and approximated reliability calculations of repaired systems analysed in Chap. 22. The idea of this indicator is to calculate the sensitivity of the output probability at system level to the input probabilities at component level: • for an FT, it is an indicator of the sensitivity of the probability of system down state (failure), Pr(S), to a slight variation of the probability of its component down states (failures), Pr(C k ); • for an RBD, it is an indicator of the sensitivity of the probability of system up state (success), Pr(S), to a slight variation of the probability of its component up states (success), Pr(Ck ). 24.2.3.1

MIF Calculations from FTs

When coherent FTs are considered, the Birnbaum importance factor is developed from down states and it is defined as follows: MIFS (C k ) =

∂Pr(S) ∂Pr(C k )

(24.45)

Therefore, when S represents the failure of a system, MIFS (C k ) measures the impact of a change in the probability of failure of component Ck , Pr(C k ), on the probability of system failure, Pr(S). The use of the Shannon decomposition of Pr(S) allows to calculate this partial derivative: Pr(S) = Pr(S|C k ) · Pr(C k ) + Pr(S|Ck ) · Pr(Ck ) = Pr(S|C k ) · Pr(C k ) + Pr(S|Ck ) · [1 − Pr(C k )t] = [Pr(S|C k ) − Pr(S|Ck )] · Pr(Ck ) + Pr(S|Ck )

356

24 Critical States and Importance Factors

∂Pr(S) Then, ∂Pr(C = Pr(S|C k ) − Pr(S|Ck ) and MIFS (C k ) can be calculated as the k) difference between two conditional probabilities:

MIFS (C k ) = Pr(S|C k ) − Pr(S|Ck )

(24.46)

When the FT is used to calculate the system unavailability US (t), Pr(S|C k ) is equal to the conditional unavailability given that component Ck is in down state, US|C k (t), and Pr(S|Ck ) is equal to the conditional unavailability given that component Ck is in up state, US|Ck (t). Then, the marginal importance factor at a given instant t is given by: MIFS (C k , t) = US|C k (t) − US|Ck (t)

(24.47)

MIFS (C k ) can be calculated from an RBD by using the following formula: MIFS (C k , t) = AS|Ck (t) − AS|C k (t)

(24.48)

Therefore, when binary decision diagrams (BDDs)   are used to perform FT probabilistic calculations, the importance factor MIFS C k is very easy to calculate through conditional probability calculations (see Chap. 21).

24.2.3.2

MIF Calculations from RBDs

The same mathematical development made for down states of coherent FTs can also be done with regards to up states of coherent RBDs. In this case, the Birnbaum importance factor is defined as follows: MIFS (Ck ) =

∂Pr(S) ∂Pr(Ck )

(24.49)

MIFS (Ck ) can also be calculated by using the Shannon decomposition: MIFS (Ck ) = Pr(S|Ck ) − Pr(S|C k )

(24.50)

When the RBD is used to calculate the system availability AS (t), Pr(S|Ck ) is equal to the conditional availability given that component Ck is in up state, AS|Ck (t), and Pr(S|C k ) is equal to the conditional availability given that component Ck is in down state, AS|C k (t). Then, the marginal importance factor at a given instant t is given by: MIFS (Ck ) = AS|Ck (t) − AS|C k (t)

(24.51)

24.2 Importance Factors

357

MIFS (Ck ) can be calculated from an FT by using the following formula: MIFS (Ck ) = US|C k (t) − US|Ck (t)

(24.52)

Therefore, MIFS (Ck ) is easy to calculate when binary decision diagrams (BDDs) are implemented (see Chap. 21).

24.2.3.3

MIF with Regards to Critical States and Exclusive Cofactor

Coherent Systems According to Eq. (24.46), MIFS (C k ) = Pr(S|C k ) − Pr(S|Ck ) and this formula can be calculated by using Formulae (24.25) and (24.26): –

Pr(S|C k ) = Pr[crit(S, C k )|C k + crit(S, C k )|C k ] = Pr[crit(S, C k )|C k } + Pr{crit(S, C k )|C k ]

– Pr(S|Ck ) = Pr[crit(S, Ck )|C k ]. And, finally: MIFS (C k ) = Pr[crit(S, C k )|C k ]

(24.53)

But, according to Formula (24.19) and when the modelled system is coherent # (monotone Boolean formula), crit(S, C k ) = C k · S C k . #

Then crit(S, C k )|C k = S C k and, finally MIFS (C k ) can be calculated from the #

exclusive cofactor S C k : #

MIFS (C k ) = Pr(S C k ) On another hand,

∂Pr(S) ∂Pr(C k )

=

∂[1−Pr(S)] ∂[1−Pr(Ck )]

=

∂Pr(S) ∂Pr(Ck )

(24.54)

which results in:

MIFS (Ck ) = MIFS (C k ) = MIFS (Ck )

(24.55)

Then, MIFS (Ck ) can also be calculated as: MIFS (Ck ) = MIFS (Ck ) = Pr{crit(S, Ck )|Ck }

(24.56)

In the same way as above, crit(S, Ck ) = Ck · SC# k (Formula 24.18) in case of coherent system. Then MIFS (Ck ) can be also calculated from the exclusive cofactor SC# k :

358

24 Critical States and Importance Factors #

MIFS (Ck ) = Pr(S C k ) = Pr(SC# k )

(24.57)

Therefore, in case of coherent systems, three different formulae are available to calculate the marginal importance factor.

Non-coherent Systems Andrews and Beeson were the first authors to propose a convincing formula extending the concept of MIF to non-coherent systems with regards to their components that can be either failure-critical or repair-critical. The corresponding mathematical procedure can be found in Andrews and Beeson (2003). When dealing with a system S non-coherent with regards to one of its components Ck , the exclusive cofactor SC# is not equal to zero. Then, in some up states (i.e. k

some minterms containing C k ), the system can fail when component Ck is repaired. Then Formula (24.57), which takes into account only the system failures due to the contribution of Ck failures, has to be extended to the contribution of Ck repairs: MIFS (Ck ) = Pr(SC# k ) + Pr(SC# )

(24.58)

k

#

#

But SC# k = S C k and SC# = S Ck , then MIFS (Ck ) can be written as follows: k

#

#

MIFS (Ck ) = Pr(S C k ) + Pr(S Ck )

(24.59)

Therefore: #

#

MIFS (Ck ) = Pr(SC# k + SC# ) = Pr(S C k + S Ck ) k

#

(24.60)

#

As S#Ck = SC# k + SC# = S C k + S Ck (see Formulae 24.6 and 24.7), MIFS (Ck ) is k finally given by: MIFS (Ck ) = Pr(S#Ck )

(24.61)

Conclusion About MIF Calculations The above Formulae (24.58) and (24.59) provide a general procedure to calculate MIFS (Ck ) in both cases of coherent or non-coherent systems. The only difference is # that the probabilities of the exclusive cofactors Pr(SC# ) and Pr(S Ck ) are equal to 0 k for monotone functions (see 24.1.1).

24.2 Importance Factors

359

The exclusive cofactors being combinations of conditional events (see 24.1.1.3), their probabilities are easy to calculate with Boolean models when BDDs are implemented (see Chap. 21). It has to be noted that MIFS (Ck ) does not depend on the state of component Ck : this is the probability for system S to be in a critical state with regards to component Ck : – if Ck is in up state, Pr(Ck ) · MIFS (Ck ) is the probability to be in a critical up state with regards to Ck ; – if Ck is in down state, Pr(Ck ) · MIFS (Ck ) is the probability to be in a critical down state with regards to Ck . These properties are very important because they allow to calculate the failure frequency, the Vesely failure rate and the reliability/unreliability of systems made of repaired items not only from RBDs or FTs (Sect. 22.3) but also from Markovian models (see Chap. 31).

24.2.4 Lambert Importance Factor (CIF) This factor is also known as critical importance factor (CIF). The probability of failure/success of component Ck is not taken into consideration when calculating the marginal importance factor. This is the main drawback of the MIF because, if several blocks/primary events play the same role in the logic structure, they will have similar MIFs even if their probabilities are very different: the Lambert importance factor (Vaurio 2016; Lambert 1975) has been introduced as an attempt to correct this drawback.

24.2.4.1

CIF Calculations from FTs

When coherent FTs are considered, the critical importance factor is developed from down states and it is defined as follows: CIFS (C k ) = MIFS (Ck )

Pr(C k )

(24.62)

Pr(S)

As MIFS (Ck ) = Pr[crit(S, C k )|C k ], the critical importance factor can be written as: CIFS (C k ) =

Pr[crit(S, C k )|C k ] · Pr(C k ) Pr(S)

=

Pr[crit(S, C k )] Pr(S)

(24.63)

360

24 Critical States and Importance Factors

Therefore, CIFS (C k ) is the probability for the system to be in a critical down state due to component Ck normalized by the overall probability of system failure and this normalization allows to compare the importance of similar components belonging to different systems. When an FT is used to calculate the system unavailability, MIFS (Ck ) is equal to [US|C k (t) − US|Ck (t)] and the CIF can be calculated as: CIFS (C k , t) =

[US|C k (t) − US|Ck (t)] · UCk (t) US (t)

(24.64)

Then, CIFS (C k , t) can be calculated from an RBD by using the following formula: CIFS (C k , t) =

[AS|C k (t ) − AS|Ck (t)] · [1 − ACk (t)] 1 − AS (t)

(24.65)

Therefore, CIFS (C k ) is easy to calculate when binary decision diagrams (BDDs) are implemented (see Chap. 21). As demonstrated in Sect. 24.2.2, it measures the complement to 1 of the impact of the minimal cut sets which do not involve component Ck .

24.2.4.2

CIF Calculations from RBDs

When RBDs are considered, the critical importance factor is developed from up states and it is defined as follows: CIFS (Ck ) = MIFS (Ck )

Pr(Ck ) Pr(S)

(24.66)

As MIFS (Ck ) = Pr[crit(S, Ck )|Ck ], the critical importance factor can be written as: CIFS (Ck ) =

Pr[crit(S, Ck )] Pr[crit(S, Ck )|Ck ] · Pr(Ck ) = Pr(S) Pr(S)

(24.67)

Therefore CIFS (Ck ) is the probability for the system to be in a critical up state, due to component Ck , normalized by the overall probability of system up state. Again, this normalization allows to compare the importance of similar components belonging to different systems. When an RBD is used to calculate the system availability, the MIFS (Ck ) is equal to AS|Ck (t) − AS|C k (t) and the CIF can be calculated as: CIFS (Ck ) =

[AS|Ck (t) − AS|C k (t)] · ACk (t) AS (t)

(24.68)

24.2 Importance Factors

361

Then, CIFS (Ck ) can be calculated from an FT by using the following formula: CIFS (Ck ) =

  [US|C k (t) − US|Ck (t) ·[1 − UCk (t) 1 − US (t)

(24.69)

Again, CIFS (Ck ) is easy to calculate when binary decision diagrams (BDDs) are implemented (see Chap. 21). As for the Birnbaum’s importance factor, it has to be noted that in the case of noncoherent systems the overall importance of a component is obtained by summing its failure and repair contributions (Birnbaum 1969; Andrews and Beeson 2003; Lambert 1975; Vaurio 2016).

24.2.5 Diagnostic Importance Factor (DIF) The diagnostic importance factor (DIF) (Dutuit and Rauzy 1999; Pagès and Gondran 1986) is not intended to rank the components with regards to the system failure but rather to rank them according to the probability that they are failed when the system is, itself, failed. Therefore, it is focused on system and component down states only. It allows to determine which components have to be examined first when the overall system is failed and this is very useful to save time to diagnose the failures, especially when the accessibility of the components is poor. The DIF has also been introduced by Vesely and Fussell and it should not be mixed up with the Vesely-Fussell importance factor analysed above. It is defined as follows for coherent RBDs/FTs: DIFS (Ck ) = Pr[C k |S]

(24.70)

No similar importance factor with regards to up states has been introduced because diagnosing which components are more likely to be in up state when the system is in up state is of very limited interest! This is why the DIF has been written with regards to the system (S) and its components (Ck ) rather than to their down states (S and C k ). (C k ) It has to be noted that Pr(C k |S) = Pr(S|C k ) PrPr(S) . Then the DIF can be written as: DIFS (Ck ) = Pr(S|C k )

Pr(C k ) Pr(S)

(24.71)

But S|E k = crit(S, E k )|E k + crit(S, E k )|E k according to Formula (24.25) and then: DIFS (Ck ) =

Pr[crit(S, E k ) + crit(S, E k )] Pr(S)

(24.72)

362

24 Critical States and Importance Factors

Therefore, DIFS (Ck ) is the probability, normalized by Pr(S), for the system to be in a critical or a non-critical down state involving component Ck (i.e. the sum of the probabilities of all the minterms containing C k ). This is in fact the probability, normalized by the overall probability of system failure, to be in any down state involving Ck . Therefore, this indicator gives the same weight to non-critical failures and critical failures involving Ck . This why it is useful from a maintenance diagnostic point of view. When an FT is used to calculate the system unavailability, US (t), the CIF can be calculated as: DIFS (Ck ) =

US|C k (t) · UCk (t) US (t)

(24.73)

It can be also calculated from an RBD providing the system availability: DIFS (Ck ) =

[1 − AS|C k (t)]·[1 − ACk (t)] 1 − AS (t)

(24.74)

Therefore, DIFS (Ck ) is easy to calculate when binary decision diagrams (BDDs) are implemented (see Chap. 21).

24.2.6 Risk Achievement Worth (RAW), Risk Reduction Worth (RRW) Two other importance factors which are widely used in risk analysis are the risk achievement worth (RAW) and the risk reduction worth (RRW). They are also known as the risk increase and risk decrease importance factors and therefore are focused on system and component down states. No equivalent terms exist from an up state point of view. Both of them have been initially defined by using the risk analysis terminology (Vesely and Davis 1985), but here they are presented in the framework of the safety and dependability terminology. In these conditions, the RAW is defined as follows for coherent RBDs/FTs: RAWS (Ck ) =

Pr[S|C k ] Pr(S)

(24.75)

As Pr[S|C k ] (> Pr(S), RAWS (Ck ) > 1 and the difference Pr[S|C k ] − Pr (S) is positive. This is an indicator of how much the system failure probability (assimilated to the risk) increases when component Ck actually fails. k) As DIFS (Ck ) = Pr[S|C k ] Pr(C then RAWS (Ck ) is linked to DIFS (Ck ) by the Pr(S) following equation:

24.2 Importance Factors

363

RAWS (Ck ) =

DIFS (Ck ) Pr(C k )

(24.76)

Then the RAW is an indicator of how important it is to maintain the current level of the up state probability of component Ck and, as mentioned in Vesely and Davis (1985), “the components having the highest risk achievement worths are of particular interest for risk assurance programs, quality assurance programs, and inspection activities”. The RRW is defined as follows for coherent RBDs/FTs: RRWS (Ck ) =

Pr[S|Ck ] Pr(S)

(24.77)

As Pr[S|Ck] (> Pr(S)), RRWS (Ck ) < 1 and the difference Pr[S|Ck ] − Pr(S) is negative. This is an indicator of how much the system failure probability (assimilated to the risk) decreases if component Ck does not fail. It may be used to select the best component candidates to improve the up state system probability. As S|Ek = crit(S, Ck )|Ck (see Formula 24.26), RRWS (Ck ) can be written as: RRWS (Ck ) =

Pr[crit(S, Ck )|Ck ] Pr(S)

(24.78)

Therefore, RRWS (Ck ) is also the probability, normalized by Pr(S), to be in a non-critical state regardless of the state of Ck . It has to be noted that RRW is often defined as the inverse of Eq. (24.77): RRWS (Ck ) =

Pr(S) Pr[S|Ck ]

(24.79)

Of course, RRWS (Ck ) ranks the events in the reverse order to RRWS (Ck ). Then, the components having the highest RRW (and then the lowest RRW ) are of particular interest for risk reduction efforts (once again see Vesely and Davis 1985). When an FT is used to calculate the system unavailability, US (t), or an RBD to calculate the system availability, AS (t), the RAW and the RRW can be calculated as follows: RAWS (Ck ) = RRWS (Ck ) =

US|C k (t) US (t)

=

1 − AS|C k (t) 1 − AS (t)

1 − AS|Ck (t) US|Ck (t) = US (t) 1 − AS (t)

(24.80) (24.81)

Again, RAWS (Ck ) and RRWS (Ck ) or RRWS (Ck ) are easy to calculate when binary decision diagrams (BDDs) are implemented (see Chap. 21).

364

24 Critical States and Importance Factors

24.2.7 Differential Importance Measure (DIM) The differential importance measure (DIM) designed by Borgonovo and Apostolakis (2001) belongs to the additive importance measures family which comprises this developed by Lemaire (1999) or Barlow and Proschan (1975). The original idea behind this importance factor is to consider an overall system dependability feature, Q (e.g. reliability, availability, failure frequency) as a function of the various parameters, xi (e.g. reliability, availability, failure frequency, failure rate) of its components and to evaluate the impact of a small variation xi on the value of Q. The DIM dedicated to basic events or components is defined for coherent RBDs/FTs as follows: dQi = ∂Q · dxi measures the impact on Q of a small variation of a parameter xi and  ∂xi dQ = j dQj measures the total variation of Q with regards to all the components parameters. From these results DIM (Q, xi ) is defined as follows: DIM (Q, xi ) =

∂Q · dxi dQi ∂x =  i∂Q dQ j ∂x · dxj

(24.82)

j

By construction DIM (Q, xi ) is an additive importance factor and: DIM (Q, xi, xj ) = DIM (Q, xi, ) + DIM (Q, xj )

(24.83)

As defined above, DIM (Q, xi ) is difficult to handle and calculate and this is why two different assumptions are generally done to help to use it: the uniform variation (assumption H1 ) and the relative variation (assumption H2 ). Uniform variation, H1 This consists in applying the same small variation xi = xj ∀ i, j to all the parameters involved in the calculation of Q. Then Eq. (24.82) can be simplified to: ∂Q ∂xi ∂Q j ∂xj

DIM H1 (Q, xi ) = 

(24.84)

When Q is the probability of system failure, Pr(S), and xi is a component failure, Pr(C k ), then the DIM can be calculated from the marginal importance factors (MIFs) defined in Sect. 24.2.3: DIMSH 1 (C k )

∂Pr(S) ∂Pr(C k )

=

∂Pr(S) k ∂Pr(C k )

MIFS (Ck ) = k MIFS (Ck )

(24.85)

24.2 Importance Factors

365

Relative variation (percentage), H2 This consists in applying the same small relative variation

xi xi

=

xj xj

the parameters involved in the calculation of Q. Then, replacing dxi by Eq. (24.82) gives: ∂Q · xi ∂xi ∂Q j ∂xj · xj

DIM (Q, xi ) = 

∀ i, j to all dxi xi

· xi ∀ i in

(24.86)

When Q is the probability of system failure, Pr(S), and xi is the component failure, Pr(C k ), then the DIM can be calculated from the critical importance factors (CIFs) defined in 24.2.4: DIMSH 2 (C k )

∂Pr(S) ∂Pr(C k )

=

· Pr(C k )

∂Pr(S) k ∂Pr(C k )

· Pr(C k )

∂Pr(S) ∂Pr(C k )

=

·

∂Pr(S) k ∂Pr(C k )

Pr(C k ) Pr(S)

·

Pr(C k ) Pr(S)

CIFS (Ck ) = k CIFS (Ck ) (24.87)

Therefore, in the above particular cases, the DIM can be calculated by using the MIFs or the CIFs which, in turn, involve conditional probabilities obtained through BDD calculations. It has to be noted that the DIM can be applied to any component parameters (see Borgonovo and Apostolakis 2001) and that it generalizes the usage of the MIF via Formula (24.85) and the usage of the CIF via Formula (24.87).

24.2.8 Barlow-Proschan Importance Factor (BPIF) Introduced by Barlow and Proschan (1975), this importance factor has been explicitly defined for systems made of repaired items. This is original compared to the other importance factors for which this assumption is not mandatory. It represents the probability that component Ck caused the system failure, given that the system failed at time t. The Barlow-Proschan importance factor is defined as follows for coherent RBDs/FTs: MIFS (Ck , t) · dNk (t) BPIFS (Ck ) =  i MIFS (Ci , t) · dNi (t)

(24.88)

where Nk (t) is the number of failures over [0, t] of component Ck . Therefore, dNk (t) = Nk (t + dt) − Nk (t) = wk (t) · dt is the probability that Ck fails between t and t + dt given that it was as good as new at t = 0. Then wk (t) is the unconditional

366

24 Critical States and Importance Factors

failure intensity of Ck (see Chap. 4). Replacing dNk (t) by wk (t) · dt and simplifying by dt leads to the following formula: MIFS (Ck , t) · wk (t) BPIFS (Ck ) =  i MIFS (Ci , t) · wi (t)

(24.89)

In the above expression, the numerator represents the contribution of component Ck to the system failure frequency, and the denominator, the overall system failure frequency (i.e. the unconditional failure intensity of system S). Therefore, the BPIF ranks the components according to the contributions of their failure frequencies. Like the DIM introduced later, an important feature of this importance factor is its additivity: BPIFS (Ck , Cq ) = BPIFS (Ck ) + BPIFS (Cq )

(24.90)

An advantage of the BPIF over the DIM is that it is not necessary to choose between two assumptions like H1 and H2 (see 24.2.7) to use it.

24.2.9 Application and Remarks About Importance Factors 24.2.9.1

Illustrative Example

The fault tree illustrated in Fig. 24.5 has been built in Chap. 16 to model the probability of failure of a simple pumping system. It has been qualitatively analysed in Chap. 17 and a semi-quantitative analysis has been proposed in Chap. 20. The development of the importance factors is an opportunity to proceed to more accurate quantitative analyses.

Fig. 24.5 Fault tree used for importance factor illustration

24.2 Importance Factors

367

Table 24.4 Primary events ranking according to the Vesely-Fussell importance factor Comp.

VF  S (C k )

Binf VF S (C k )inf

V1

0.8131833

0.8143326

VF  S (C k )

Exact VF S (C k )

VF  S (C k )

0.8143326 0.1691272

Bsup DIF S (Ck ) 0.8143326

P3

0.1689629

P1

0.0880495

0.0889654

0.0890107

0.1698519

0.0970128

0.1770497

P2

0.0880495

0.0889654

0.0890107

0.0970128

V3

0.0167280

0.0168924

0.0169648

0.0176837

V2

0.0087165

0.0088904

0.0088859

0.0096896

In this example, the probabilities of the primary events are modelled by simple Markov processes (see Chap. 31) with two states (up and down) and two transitions (failure and repair). The failures are modelled by failure rates, λi , and the repair by repair rates, μi . The same parameters (λV and μV ) are used for the three valves, V1 , V2 and V3 and the same parameters (λP and μP ) are used for the three pumps, P1 , P2 and P3 . They have been chosen in order that the asymptotic values λV /(λV + μV ) = 10−3 and λP /(λP + μP ) = 10−2 are the same as those used in Chap. 20 for the semiquantitative analyses. Calculations have been performed for a time long enough to reach the asymptotic values and by using the software package (GRIF-Workshop 2020) which implements these calculations.

Vesely-Fussell Example Table 24.4 shows the results obtained by applying the formulae developed in Sect. 24.2.2. In addition, the exact calculation of VFS (C k ) = Pr(∪i Csik )/Pr(S) has been performed and added in the table. As expected, the exact value lays between the lower and the upper bound determined in Sect. 24.2.2.  If VFS (C k ) = i Pr(Csik )/ n Pr(Csn ) is also lower than the upper bound, it can be lower (V1 , P1 , P2 ) or greater (P3 , V2 , V3 ) than the exact value and even lower than the lower bound (V1 ). Fortunately, the numerical values are very close and it is slightly optimistic for V1 , P1 and P2 and slightly pessimistic for P3 , V2 and V3 . Therefore, no general conclusion can be made from the comparison between VFS (C k ) and VFS (C k ) except that, as expected, the numerical results are very close. The ranking of Table 24.4 is in accordance with the semi-qualitative analysis previously undertaken in Chap. 20 and this consolidates the opinion that V1 has to be considered first and P3 in second position. This table also shows that V2 has the smaller impact. The impacts of the various components are rather well discriminated and P1 and P2 , which have the same reliability parameters and similar situations within the FT structure, have the same impact.

368

24 Critical States and Importance Factors

Table 24.5 Comparison between DIF, CIF, RRW , RRW and BP Component

DIF

CIF

RRW

RRW

BP

V1

0.81433261

0.814148185

5.38063079

0.18585182

0.6875243

P3

0.17704975

0.168791729

1.20306791

0.83120827

0.14250516

P1

0.09701279

0.087951629

1.09643307

0.91204837

0.07425459

P2

0.09701279

0.087951629

1.09643307

0.91204837

0.07425459

V3

0.01768373

0.016708006

1.01699191

0.98329199

0.01410942

V2

0.00968964

0.008705973

1.00878243

0.99129403

0.00735194

Table 24.6 Comparison between RAW and MIF

Component

RAW

MIF

V1

820.648849

0.99977353

P3

17.8208905

0.02070274

V3

17.8208905

0.02051742

P1

9.76479388

0.0107875

P2

9.76479388

0.0107875

V2

9.76479388

0.01069093

Other Examples Related to Importance Factors The results for the DIF, CIF, RRW and the BPIF have been gathered in Table 24.5 because these importance factors rank the components in the same order as the Vesely-Fussell importance factor analysed above. Again: – – – –

the impacts of the various components are rather well discriminated; V1 is in first position, P3 in second position and V2 in last position; P1 and P2 show the same contributions; the numerical values provided by DIF, CIF and BP are close to these provided by the Vesely-Fussell importance factor but rather different from these provided by the RRW .

The RRW, which is the inverse of RRW , has also to be introduced in the table in order to show that RRW + CIF = 1 in any case. This property may be useful for calculations of the RRW from the CIF and vice versa. The results for the RAW and the MIF have been gathered in Table 24.6 because these importance factors rank the components in a rather different order than the Vesely-Fussell importance factor analysed above: – again, V1 is in first position, P3 in second position, V2 in last position, and P1 and P2 show the same contributions; – but the RAW and MIF are less discriminating because P1 , P2 and V2 (or P3 and V3 ) have the same value for the RAW and almost the same value for the MIF: this is due to the fact that they play the same role in the logic of the fault tree.

24.2 Importance Factors

369

Table 24.7 Ranking of importance factor values for a given component

V1

P3

820.64 5.3806

MIF

0.999

DIF

0.8143

DIF

0.1770

DIF

0.1699

CIF

0.8141

VF CIF

VF CIF

VF BP

0.8132 BP

0.1425

MIF

0.0207

24.2.9.2

0.6875

RAW RRW

17.821 1.2031

P1, P2

RAW RRW

0.1688

V2

V3

RAW 9.7648 RAW 9.76479 RAW RRW 1.09648 RRW 1.00878 RRW

17.821 1.0170

MIF

0.01069

MIF

0.02050

0.09708

DIF

0.00969

DIF

0.01768

0.0899

0.00889 0.00871

VF CIF

0.01696

0.0879

VF CIF

BP

0.0743

BP

0.00735

BP

0.01411

MIF

0.0108

0.01671

Final Remarks About Importance Factors

Table 24.7 ranks the various importance factors for each primary event of the fault tree represented in Fig. 24.5. Except for the MIF and the Vesely-Fussell importance factors, the ranking is homogeneous and, for the example, this gives: RAWS (C k ) ≥ RRWS (C k ) ≥ DIFS (C k ) ≥ VFS (C k ) ≥ CIFS (Ck ) ≥ BPIFS (Ck ) (24.91) This is in accordance with the ranking in the general case which is given hereafter: DIFS (C k ) ≥ VFS (C k ) ≥ CIFS (Ck )

(24.92)

The MIF moves from the third position (V1 , V2 , V3 ) to the fifth position (P1 , P2 , P3 ) and the Vesely-Fussell importance factors move from the third position (P1 , P2 , P3 , V2 , B3 ) to the sixth position (V1 ), and no systematic position with regards to the other importance factors can be identified. Among the importance factors, only the Vesely-Fussell importance factor is tractable by hand and only when the number of minimal cut sets is not too high. When large RBDs/FTs are involved, they should be calculated by using an RBD or an FT software package. This is why the above example has been processed by using the GRIF workshop (2020) software package which implements BDD calculations and directly provides the MIF, CIF, DIF, RAW, RRW and BPIF related to the primary events of an FT as well as the minimal cut sets or the conditional probabilities. Paradoxically, the Vesely-Fussell importance factor is difficult to calculate when large RBDs/FTs are involved and only a framing by a lower and an upper bound is achievable.

370

24 Critical States and Importance Factors

24.3 Associated Exercise One exercise related to this chapter is proposed in Chap. 29: • Exercise 24.1: calculate the various importance factors related to the items belonging to an overpressure protection system.

References Aliee H, Borgonovo E, Glass M, Teich J (2017) On the Boolean extension of the Birnbaum importance to non-coherent systems. Reliab Eng Syst Saf 160:191–200. Elsevier Andrews JD, Beeson S (2003) Birnbaum’s measure of component importance for non-coherent systems. IEEE Trans Reliab 52(2):213–219. IEEE Barlow RE, Proschan F (1975) Importance of system components and fault tree events. In: Stochastic processes and their applications 3. North-Holland Publishing Company, Elsevier, pp 153–173 Beeson S, Andrews JD (2003) Importance measures for non-coherent-system analysis. IEEE Trans Reliab 52(3):301–310. IEEE Birnbaum ZW (1969) On the importance of different components and a multicomponent system. In: Korishnaiah PR (ed) Multivariable analysis II. Academic Press, New York, pp 581–592 Borgonovo E, Apostolakis GE (2001) A new importance measure for risk informed decision making. Reliab Eng Syst Saf 72:193–212. Elsevier Dutuit Y, Rauzy A (1999) New algorithms to compute importance factors CPr, MIF, CIF, DIF, RAW, RRW. In: Proceedings of the European safety and reliability association conference, ESREL’99, vol 2. A.A. Balkema, pp 1015–1020. ISBN 90 5809111 2 Dutuit Y, Rauzy A (2014) Importance factors of coherent system: a review. Proc IMechE Part O J Risk Reliab 228(3):313–323. Sage Dutuit Y, Lemaire O, Rauzy A (2000) New insight on measures of importance of components and systems in fault tree analysis. In: Proceedings of the international conference on probabilistic safety assessment and management (PSAM’5), Osaka. Universal Academy Press, pp 729–734 Fussell JB (1975) How to hand-calculate system reliability and safety characteristics. IEEE Trans Reliab R-24(3):169–174. IEEE GRIF-Workshop (2020) Boolean module. Funded and developed by TOTAL. http://grif-worksh op.fr/. Accessed Sept 2020 Kovalenko IN, Kuznetsov NY, Pegg PA (1997) Mathematical theory of reliability of time dependent systems with practical applications. In: Wiley series in probability and statistics. Wiley, New York. ISBN 06471-95060-2 Kuo W, Zhu X (2012) Importance measures in reliability, risk and optimization. Principles and applications. Wiley, Hoboken. ISBN 978-1-119-99344-5 Lambert HE (1975) Measures of importance of events and cut sets in fault trees. In: Barlow RE, Fussell JB, Singpurwalla ND (eds) Reliability and fault tree analysis. SIAM Press, Philadelphia, pp 77–100 Lemaire O (1999) Importance and contribution factor for systems. In: Proceedings of the ESREL conference 1999. Balkema, München-Garching, pp 1147–1151 Misra KB (1992) Reliability analysis prediction—a methodology oriented treatment. In: Fundamental studies in engineering, n°15. Elsevier, pp 767–768 Ou Y, Dugan JB (2000) Sensitivity analysis of modular dynamic fault trees. In: Proceedings of the IEEE international computer performance and dependability symposium, Chicago. IEEE. https:// doi.org/10/1109/IPDS.2000.839462. ISBN 067695-0553-8 Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering. Springer

References

371

Singh C (1981) Rules for calculating the time-specific frequency of system failure. IEEE Trans Reliab R-30(4):364–366. IEEE Vaurio JK (2010) Ideas and development in importance measures and fault-tree techniques for reliability and risk analysis. Reliab Eng Syst Saf 95:99–107. Elsevier Vaurio JK (2016) Importance of components and events in non-coherent systems and risk models. Reliab Eng Syst Saf 147:117–122. Elsevier Vesely WE (1970) A time dependent methodology for fault tree evaluation. Nucl Eng Des 13(2):337– 360. Elsevier Vesely WE (1996) The use of risk importance for risk-based applications and risk-based regulations. In: Proceedings of the international topical meeting on probabilistic safety assessment, PSA’96, Park City, Utah. American Nuclear Society, La Grange Park, pp 1623–1631. ISBN 9780894486210 Vesely WE, Davis TC (1985) Two measures of risk importance and their application. Nucl Technol J 68(2): 226–234. Taylor & Francis Group Vesely WE, Goldberg FF, Roberts NH, Haasl DF (1981) Fault tree handbook. NUREG-0492. Nuclear Regulatory Commission Zaitseva E, Levashenko V (2013) Importance analysis by logical calculus. Autom Remote Control 74:171–182. Springer Zaitseva E, Levashenko V, Kostolny J (2015) Importance analysis based on logical differential calculus and binary decision diagram. Reliab Eng Syst Saf 138:135–144. Elsevier Zhang HL, Zhang CH, Liu D, Xie GW (2010) Importance measure method for dynamic fault tree based on isomorphic node. In: Proceedings on the international conference on information computing and applications (ICICA). Springer, pp 9–16

Chapter 25

Uncertainty Handling with RBDs and FTs

25.1 Introduction The probabilistic calculations are mainly performed by considering the reliability parameters (e.g. the components failure or repair rates) as point values theoretically known with a perfect accuracy. However, they are generally estimated from statistical data samples collected from actually operating installations (field feedback, see Chap. 38). This allows to proceed to statistical estimations (e.g. maximum likelihood estimations) and/or to build histograms from which can be estimated average values, standard deviation or even full distributions. The accuracy of these estimations is directly linked to the amount of the accumulated experience (e.g. the accumulated observation time and the number of observed failures or repairs) and is not perfect. Therefore, the reliability parameters should be considered as random variables rather than simple point values in order to measure the impact of the component parameters uncertainties on the probabilistic result uncertainty of the overall system. The present chapter is focused on uncertainty handling with RBDs and FTs but a general analysis of data collection, estimation and uncertainty modelling can be found in Chap. 38. Taking uncertainty into consideration is mentioned in some standards like IEC 61025 (in progress) and ISO/TR 12489 (2013) and it is, in certain cases, mandatory for safety instrumented systems when implementing IEC 61508 (2010) standard related calculations. This is analysed in detail in Chap. 36. Concerning RBDs or FTs, the estimation of the impact of input data uncertainties on probabilistic results is tractable using analytical calculations only in very specific and simple situations (e.g. components organized in series). Therefore, in the general case, as developed in this chapter, the Monte Carlo simulation (see Sect. 32.5) has to be used instead.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_25

373

374

25 Uncertainty Handling with RBDs and FTs

25.2 Principle and Application to Non-correlated Events The principle of calculation is illustrated in Fig. 25.1 for an FT related to a redundant system (parallel structure) made of two components A and B which has already been analysed in Chap. 22, Fig. 22.5. This is an example of FT-driven Markov process as described in Chap. 27, where the primary event probabilities (unavailabilities) are modelled by simple individual Markov processes involving three parameters: initial condition, P0 , failure rate, λ and repair rate, μ. In addition, the calculations illustrated in Fig. 25.1 are performed with the following assumptions: • • • •

The two components A and B are similar, i.e. λA = λB and μA = μB . They are in up state at t = 0, i.e. P0,A = P0,B = 0. There is no uncertainty on the repair rates μA and μB . The failure rates λA and λB have similar log-normal distributions with an error factor of q5% = 3 (see Chap. 38) which are drafted in dotted lines at the bottom of Fig. 25.1.

The log-normal distribution has been chosen because it is widely used to model uncertainties (see Chap. 38 for other examples of random distributions). Another important assumption is that the failure rates of A and B are not correlated. Even if A and B are governed by the same distribution, that means that, if the random value of the failure rate of component A is low, this is not necessarily the case of the random value of the failure rate of component B and vice versa. In other words, A and B are not of good or bad quality at the same time and the value of the random variable λA is completely independent of the value of the random variable

Without uncertainty S( )

A ( 0,A , A , A )

90% confidence interval

Simulation with uncertainty B ( 0,B , B , B ) Random value

Random value Fig. 25.1 Uncertainty propagation through a simple FT made of an AND gate

25.2 Principle and Application to Non-correlated Events

375

λB . This implies that, in the Monte Carlo simulation, the random values of λA and λB have to be simulated separately. This is illustrated by small circles drafted on the log-normal distributions at the bottom of Fig. 25.1. Then, using these failure rate values for calculating the probabilities of the two primary events at a given time ti (i.e. components unavailabilities) leads to the probability of the top events (i.e. the system unavailability U S (ti ) at time ti ) which is also drafted as a small circle on the top right-hand side of the figure. According to the Monte Carlo terminology, this constitutes one history for time ti : • Performing many such histories (e.g. 1,000 for Fig. 25.1) provides a histogram of the unavailability at time ti . • From this histogram, the average value and the 90% confidence interval lower and upper bounds can be calculated by classical statistic calculations. • Performing the same calculations for values of ti ranging from 0 to t provides the curves illustrated on the top right-hand side of Fig. 25.1 where the average is drafted in bold line and the confidence interval bounds in thin dotted lines. In addition, the calculation of the unavailability without uncertainties on the failure rates has been drafted on the top right-hand side of Fig. 25.1: in the example, it is slightly higher than the average value given by the Monte Carlo simulation, which is then slightly non-conservative. Figure 25.2 illustrates exactly the same thing as Fig. 25.1 but for an OR gate (series structure). Again, the average of the simulation is drafted in bold lines and the 90% confidence interval in dotted lines on the top right-hand side of Fig. 25.2 and again, in the example, the unavailability without uncertainties is slightly higher than the average value given by the Monte Carlo simulation, which is then slightly non-conservative.

Without uncertainty S( )

A ( 0,A , A , A )

90% confidence interval Simulation with uncertainties B ( 0,B , B , B ) Random value

Random value Fig. 25.2 Uncertainty propagation through a simple FT made of an OR gate

376 Fig. 25.3 Comparison of the pseudo error factor related to the OR gate, AND gate and failure rate

25 Uncertainty Handling with RBDs and FTs

AND gate

5 4 3

λA and λB

UA and UB

2 1

OR gate

Then, in both cases, the average of the Monte Carlo simulation is non-conservative and, in fact, this non-conservativeness increases when the data uncertainties increase. This is why it is necessary to be cautious with the simulated average values and this is why, for example, when such calculations are performed, the standard 61508 (2010) requires to use the 90% upper bound instead of the average value (see Chap. 36). Looking at Figs. 25.1 and 25.2, the width of the confidence interval increases as time elapses but it is difficult to see if, with regards to the average value, the confidence increases or decreases. A measure of the confidence can be provided  (ti ) using the pseudo error factor introduced in Chap. 38: it is equal to q5% (ti ) = UU95% 5% (ti ) where [U5% (ti ), U95% (ti )] is the 90% confidence interval at time ti . This has been done in Fig. 25.3 where the pseudo error factors related to the AND and OR gates have been drafted and compared to these related to the unavailabilities of A and B. In addition, the error factor of the input failure rates has also been drafted in dotted lines. This figure shows:

• the pseudo error factor linked to the component unavailability decreases when time increases; • the pseudo error factor linked to the OR gate is lower than the pseudo error factor related to the failure rate and lower than the pseudo error factor related to the component unavailability; • the pseudo error factor linked to the AND gate is higher than the error factor related to the failure rates and higher than the pseudo error factor related to the component unavailability; • both pseudo error factors decrease when time increases (i.e. when the unavailability increases). The same calculations have been performed in Fig. 25.4. The difference is only P0,B which has been changed from 0 to 0.5 (i.e. one chance over two to be in up state at t = 0). This small modification changes a lot the pseudo error factor related to UB (t): now it increases from 1 (no uncertainty because UB (0) = P0,B = 0.5) instead of decreasing from three. Both UA (t) and UB (t) reach the same asymptotic value.

25.2 Principle and Application to Non-correlated Events

377

AND gate

90% confidence interval

AND gate

4

λA and λB

3 0.5

2

OR gate 90% confidence interval

1

UA

UB

OR gate

Fig. 25.4 Uncertainties and pseudo error factors when the initial condition of B is P0,B = 0.5

This impacts the outputs of AND and OR gates: now the output of the AND gate has a maximum and the output of the OR gate starts from 0.5 instead of 0. Again, the average of the simulation (bold line) is lower than the results of calculations performed without uncertainties. On the right-hand side of Fig. 25.4, the pseudo error factors are now increasing whereas they were decreasing in the previous example but, nevertheless, the pseudo error factor linked to the OR gate is still lower than the one linked to the AND gate. Another interesting parameter is the estimation of the average value of the system unavailability U S (T ) over a given interval [0, T ] along with the corresponding 90% confidence interval. This may be required, for example when dealing with safety instrumented systems (IEC 61508 2010) where this parameter is named PFDavg (average probability of failure on demand). This is illustrated in Fig. 25.5 for the small tree with the AND gate (Fig. 25.1) when the initial conditions are P0,A = P0,B = 0 (left-hand side) and P0,A = 0, P0,B = 0.5

Fig. 25.5 Average value and 90% confidence interval related to the AND gate

378

25 Uncertainty Handling with RBDs and FTs

(right-hand side). The lower and upper bounds of US (t) converge to the same values when time increases but the average values of U S (T ) and their corresponding 90% confidence intervals are different due to the difference of the transient period: the average value is lower on the left-hand side than on the right-hand side of the figure. The pseudo error factors of the unavailabilities related to the AND gate are illustrated in Fig. 25.6 (drafted in black lines). Due to the Markovian property of the model (see Chap. 31), the impact of the initial conditions vanishes and they converge toward the same value (about 4.0 with the parameters used for the calculations). Over the interval [0, T ], the pseudo error factor related to the average unavailability is equal to 4.2 when P0,A = P0,B = 0 and equal to 3.1 when P0,A = 0, P0,B = 0.5. Therefore, the average unavailability is more scattered in the first case than in second case and this is counter-intuitive with regards to Fig. 25.6 which shows a larger confidence interval in the second case. The same estimation of the average value of the system unavailability U S (T ) over a given interval [0, T ] along with the corresponding 90% confidence interval is illustrated in Fig. 25.7 for the small tree with the OR gate (Fig. 25.2) when the conditions are P0,A = P0,B = 0 (left-hand side) and P0,A = 0, P0,B = 0.5 (righthand side). Again, the lower and upper bounds of US (t) converge to the same values when time increases but the average values of U S (T ) and their corresponding 90% confidence intervals are different due to the difference of the transient period: again,

Fig. 25.6 Evolution of the pseudo error factors related to the AND gate

Fig. 25.7 Average value and 90% confidence interval related to the OR gate

25.2 Principle and Application to Non-correlated Events

379

Fig. 25.8 Evolution of the pseudo error factors related to the OR gate

the average value is lower on the left-hand side than on the right-hand side of the figure. The pseudo error factors of the unavailabilities related to the OR gate are illustrated in Fig. 25.8 (drafted in black lines). Due to the Markovian property of the model (see Chap. 31), the impact of the initial conditions vanishes and they converge toward the same value (about 2.0 with the parameters used for the calculations). Over the interval [0, T ], the pseudo error factor related to the average unavailability is equal to 2.0 when P0,A = P0,B = 0 and equal to 1.5 when P0,A = 0, P0,B = 0.5. Therefore, the average unavailability is more scattered in the first case and, contrary to the case with the AND gate above, this is in accordance with the sizes of the confidence interval shown in Fig. 25.7. This comfort the idea that the size of the confidence interval is not a good indicator of the uncertainty linked to a random variable and that a relative measure like the pseudo error factor is better. The above results show again that the AND gate increases the size of the confidence interval while the OR gate reduces it. With the above example, the pseudo error factor of about 3 on the failure rate leads to a pseudo error factor of about 4 with the AND gate and about 2 with the OR gate. Then once again, when considering a whole FT made of many AND gates and many OR gates, it is not really possible to guess if the result will be less or more scattered than the input data.

25.3 Application to Correlated Events In the calculations performed above (see Fig. 25.1 or Fig. 25.2), the failure rates of components A and B had similar distributions but they were non-correlated. Then, with regards to the Monte Carlo simulation, A and B were not of good or bad quality at the same time. However, in real industrial life, the similar components A and B may come from the same provider and from the same production batch. In this case, A and B are of good or bad quality at the same time and their failure rates become correlated. This introduces a kind of common cause failure which is identified and named lineage common cause failure in Chap. 5. Then, for a given Monte Carlo history, the failure rates λA and λB are equal and shall be simulated by generating a single random

380

25 Uncertainty Handling with RBDs and FTs

S(

Simulation with uncertainty ) 90% confidence interval Without uncertainty

A(

0,A ,

A,

A)

B(

0,B ,

B, B)

Single random value

Fig. 25.9 Uncertainty propagation when the failures of the components are correlated (AND gate)

number. This is illustrated in Fig. 25.9 where, for each history, the same simulated random value is used both for λA and λB (drafted by the small circle in the log-normal distribution). Then, using this single failure rate values for calculating the probabilities of the two primary events at a given time ti (i.e. component unavailabilities) leads to the probability of the top event (i.e. the system unavailability US (ti ) at time ti ), which is also drafted as a small circle on the top right-hand side of the figure. In the same way as for Fig. 25.1, performing many histories for various values of time allows to obtain the evolution of the average value of US (t) and of its 90% confidence interval. This is drafted on the top right-hand side of Fig. 25.9 where, in addition, the unavailability without uncertainty has also been drafted. The first impact of this correlation is shown in Fig. 25.9: the simulated value of US (t) is now higher than the values obtained without uncertainties. The second impact of this correlation is shown in Fig. 25.10 which compares the

Fig. 25.10 Impact of correlation between similar components (AND gate)

25.3 Application to Correlated Events

381

results obtained with or without correlation: the average value U S (T ) is higher with correlation than without and the width of the confidence interval is also larger. The evolution of the pseudo error factors related to US (t) for the AND gate are drafted in Fig. 25.11 in both correlated and non-correlated cases. Both decrease as time elapses but the pseudo error factor related to the non-correlated case is lower than the one related to the correlated case. Therefore, the correlation increases the uncertainty of the results related to the top event. In the same way and as illustrated in Fig. 25.12, the impact of the correlation between the failure rates of A and B can be analysed for the small FT with the OR gate proposed in Fig. 25.2. The simulation is performed exactly in the same way as above in Fig. 25.9. Again, the simulated values and the values obtained without uncertainty of US (t) are very

10 8

A and B correlated

6 4 2 1 0

A and B not correlated

Fig. 25.11 Impact of correlation on the pseudo error factors (AND gate)

Fig. 25.12 Uncertainty propagation when the failures of the components are correlated (OR gate)

382

25 Uncertainty Handling with RBDs and FTs

Fig. 25.13 Impact of correlation between similar components (AND gate)

Fig. 25.14 Impact of correlation on the pseudo error factors (OR gate)

3 2

A and B correlated A and B not correlated

1 0 close but, in this case, it can be noticed that the values taking the uncertainties into account are slightly lower rather than being higher. As shown in Fig. 25.13, the average values of U S (T ) are very close with and without correlation but, again, the width of the confidence interval is larger when the events are correlated than when they are not correlated. The evolution of the pseudo error factors related to US (t) for the OR gate are drafted in Fig. 25.14 in both correlated and non-correlated cases. Both decrease as time elapses but the pseudo error factor related to the non-correlated case is lower than the one related to the correlated case. Again, the correlation increases the uncertainty of the results related to the top event.

25.4 Considerations About the Pseudo Error Factor The uncertainty propagation has been performed above using FTs but it can be done exactly in the same way using RBDs. Nevertheless, the concept of pseudo error factor is most relevant with low probabilities (e.g. unavailabilities <

LS1 ∩ PSH 2 = Cs4

LS1 ∩ PSH 2 ∩ LS2

=>

LS1 ∩ LS2 = Cs5

LS1 ∩ PSH 2 ∩ LS2 ∩ SV 2

=>

LS1 ∩ SV 2 = Cs6

LS1 ∩ PSH 1 ∩ PSH 2

=>

PSH 1 ∩ PSH 2 = Cs7

LS1 ∩ PSH 1 ∩ PSH 2 ∩ LS2

=>

PSH 1 ∩ LS2 = Cs8

LS1 ∩ PSH 1 ∩ SV 1 ∩ PSH 2

=>

SV 1 ∩ PSH 2 = Cs1

LS1 ∩ PSH 1 ∩ SV 1 ∩ SH 2 ∩ LS2

=>

SV 1 ∩ LS2 = Cs2

LLS1 ∩ PSH 1 ∩ SV 1 ∩ PSH 2 ∩ LS2 ∩ SV 2

=>

SV 1 ∩ SV 2 = Cs3

LS1 ∩ PSH 1 ∩ PSH 2 ∩ LS2 ∩ SV 1 ∩ SV 2

=>

PSH 1 ∩ SV 1 ∩ SV 2 (included in Cs3)

Minimal cut sets

Non-minimal cut sets

analyses have to be performed to keep only the minimal cut sets. This has been done in Table 29.10 where 8 minimal cut sets and one non-minimal cut set have been identified. It can be verified that, fortunately, the minimal cut sets identified in Table 29.10 are identical to the minimal cut sets found in exercise 16.2 (Sect. 29.4.4). It has to be noted that developing the BDD in the order 2–1–3–4–5–6 (see Fig. 29.5) instead of 1–2–3–4–5–6, and starting by PSL2 would have led to 5 disjoint paths instead of 6. The readers are invited to try this walk throughout the fault tree to verify that the size of the BDD depends on the order of the variables used to perform the decomposition.

29.4.10 Exercise 21.2: Comparison of Probabilistic Results (Disjoint Paths Versus Minimal Cut Sets) The aim of the exercise is to use the disjoint paths identified in exercise 21.1 to calculate the probability of OPPS failure at t = 8760 h with the reliability data proposed in Table 29.1 in order to compare with the approximate probability calculated by using minimal cut sets. Then, the exercise aims to pursue this comparison when the failure rates are multiplied by 10 and 100. The first step is to calculate the probability of failure of PSH, SV and LS according to the values of the failure rates multiplied by 1, 10 and 100. To simplify the calculations, the 6 disjoint paths involving PLS2 found in exercise 21.1 are considered instead of the 9 disjoint paths involving all the primary events of the considered fault tree. Then, this implies to calculate also the probability of failure of the macro component PLS2 which is an input for these calculations.

29.4 Solutions of the Exercises Related to the OPPS

445

Table 29.11 Item probabilities of failure at t = 8760 h. Failure rates multiplied by 1, 10 and 100 Item

Failure rate (λ) (h−1 ) 4.00 ×

10−7

4.03 ×

10−6

LS

5.00 ×

10−5

PLS2

/

PSH SV

Repair rate (μ) (h−1 )

Ui (t) λ×1

λ × 10

3.50 ×

10−3

3.47 ×

10−2

0.10

5.00 ×

10−4

/

4.00 × 10−3

0.10 0.10

λ × 100

3.44 ×

10−2

2.96 × 10−1

2.97 ×

10−3

9.71 × 10−1

4.98 ×

10−3

4.76 × 10−2

3.92 × 10−2

3.29 × 10−1

This has been done in Table 29.11 where the following formulae have been used with = 8760 h: • UPSH (t) = 1 − exp(−λPSH · t) • USV (t) = 1 − exp(−λSV · t) • ULS = λLS /(λLS + μLS ) • UPLS2 (t) = UPSH (t) + [1 − UPSH (t)] · ULS (t). The probabilistic results are given in Table 29.12 where the calculations are performed with the reliability data provided in Table 29.1 (second column), with the failure rates multiplied by 10 (third column) and with the failure rates multiplied by 100 (fourth column). The exact results calculated by summing the probabilities of these paths are given in the penultimate line and compared to these obtained by using the minimal cut sets given in the last line. Then, with the reliability data provided in Table 29.1, the results are very close (0.45%). They differ by 4.5% when the failure rates are multiplied by 10 and by 51% when they are multiplied by 100. Therefore, the approximation is acceptable when the result is lower or equal to about 0.1 but becomes quickly unrealistic when it is greater than that. In the last case (fourth column), it is even obviously false as it is greater than 1! Table 29.12 Calculations with disjoint paths and comparison with minimal cut set results Disjoint path\multiplying factor

λ×1

λ × 10

λ × 100

LS1 ∩ PSL2

2.00 E−6

1.95 E−4

1.57 E−2

LS1 ∩ PLS2 ∩ SV 2

1.73 E−5

1.42 E−3

3.10 E−2

LS1 ∩ PSH 1 ∩ PLS2

1.40 E−5

1.34 E−3

9.27 E−2

LS1 ∩ PSH 1 ∩ SV 1 ∩ PLS2

1.38 E−4

1.12 E−2

2.14 E−1

LS1 ∩ PSH 1 ∩ SV 1 ∩ PLS2 ∩ SV 2

1.19 E−6

8.17 E−2

4.24 E−1

LS1 ∩ PSH 1 ∩ PLS2 ∩ SV 1 ∩ SV 2

4.19 E−6

2.91 E−3

1.78 E−1

Probability of OPPS failure

1.37 E−3

9.88 E−2

0.956

Sum of the 6 minimal cut sets

1.38 E−3

1.03 E−1

1.44

446

29 Boolean Family Exercises

29.4.11 Exercise 22.1: Unavailability, Failure Frequency and Unreliability Calculations The aim of the exercise is twofold: calculate the PFDavg (average unavailability), the PFH (average failure frequency) and the unreliability (probability of failure) of the OPPS over 5 years of operation. Contrary to the previous exercises which were tractable by hand, this one needs an FT software package to be achieved. The size of the fault tree being not very large, this can be done, for example, by using the free demo version of the GRIF workshop (2020) module Tree software package. For the calculations, PSHs and SVs are modelled as tested repaired items (TPE/Extended periodic test in GRIF module Tree) and LSs as repaired items with self-revealed failures (IND/unavailability in GRIF-Tree). The unavailability curve, U (t), presented in Fig. 29.12 has been obtained from the fault tree drafted in Fig. 29.2 with the reliability data presented in Table 29.1. Due to the yearly proof tests, this is a saw tooth curve. The calculation performed over 5 years (43,800 h) provides an average value equal to PFDavg = 4.68 × 10−4 . The value of the unavailability for t = 8760 h is equal to 1.37 × 10−3 . This result is very close to the approximated result of 1.38 × 10−3 obtained by summing the probabilities of the minimal cut sets in Sect. 29.4.5, which appears to be slightly conservative (0.7%). The failure frequency curve, w(t), presented in Fig. 29.13 has been obtained from the same fault tree and same reliability data input as above. This is also a saw tooth curve and the calculation performed over 5 years (43,800 h) provides an average value equal to PFH = 2.1 × 10−6 failure per hour. Again, the unreliability curve, F(t), presented in Fig. 29.14 has been obtained from the same fault tree and same reliability data input as above. This is no longer

Fig. 29.12 Overpressure protection system unavailability and PFDavg

29.4 Solutions of the Exercises Related to the OPPS

447

Fig. 29.13 Overpressure protection system failure frequency and PFH

Fig. 29.14 Overpressure protection system unreliability

a saw tooth curve as this is a non-decreasing function. It is continuous but nonderivable at the instant of tests. As the average value is meaningless, it is not drafted in this figure.

29.4.12 Exercise 22.2: Unavailability Calculation with Partial and Full Stroking Tests The aim of this exercise is to extend exercise 22.1 to model partial and full stroking tests of the safety valves. Taking into account the full and partial stroking tests of the safety valve implies to split the elementary event “SV1 fails to close” (respectively “SV2 fails to close”) between two events organized with an OR gate: “SV1 stuck open” OR “SV1 not tight” (respectively “SV2 stuck open” OR “ SV2 not tight”). The first event is tested

448

29 Boolean Family Exercises SV1 fails to close

SV1 fails to close

SV1 periodic-test 4.03E-6 0.1 8760.0 0.0 SV1 stuck open

SV1 not tight

SV1_SO periodic-test 3.29E-6 0.1 2920.0 0.0

SV1_Leak periodic-test 7.4E-7 0.1 8760.00.0

Fig. 29.15 Splitting SV failure modes between those detected by full and partial stroking tests

every four months while the second one is tested only every year. This is illustrated in Fig. 29.15. The unavailability curve, U (t), presented in Fig. 29.16 has been obtained from the fault tree drafted in Fig. 29.2 modified as explained above and with the reliability data presented in Tables 29.1 and 29.2. Due to the various proof test frequencies, the number of peaks of the saw tooth curve has been multiplied by 3 compared to the case without partial stroking. The calculation performed over 5 years (43,800 h) provides an average value equal to PFDavg = 1.1 × 10−4 . Compared to the case without partial stroking, the PFDavg has dropped from 4.68 × 10−4 to 1.1 × 10−4 . The decrease is of about 76%: i.e. the PFDavg has been divided by about 4. The value of the unavailability for t = 8760 h is equal to 3.41 × 10−4 . This result is close to the approximated result of 4.29 × 10−4 obtained by summing the probabilities of the minimal cut sets in Sect. 29.4.6, which appears to be more conservative (20%) than in the case where the partial and full stroking tests have not been considered.

Fig. 29.16 Overpressure protection system unavailability and PFDavg : impact of full and partial stroking tests of safety valves

29.4 Solutions of the Exercises Related to the OPPS

449

29.4.13 Exercise 22.3: Unavailability Calculation with Common Cause Failures The aim of this exercise is to extend exercise 22.1 to the modelling of common cause failures on PSHs, SVs and LSs. Taking into account the CCFs of safety valves, pressure sensors and logic solvers implies to split the failure modes of each elementary event between independent and common cause failures. This leads to modify the fault tree drafted in Fig. 29.2 as follows (see Fig. 29.17): • Replace all the elementary events in the FT by the corresponding independent failure modes. This gives the independent failure part of the FT. • Add at the top an OR gate with the independent FT in input and the three CCFs on PSHs, SVs and LSs which are minimal cut sets of order 1. The unavailability curve, U (t), presented in Fig. 29.18 has been obtained from the fault tree drafted in Fig. 29.2 modified as explained above and with the reliability data presented in Tables 29.1 and 29.8. The shape of this saw tooth curve is similar to the one without CCFs and the calculation performed over 5 years (43,800 h) provides an average value equal to PFDavg = 1.42 × 10−3 . This is an increase of more than 200% compared to the results without CCFs (PFDavg = 4.68 × 10−4 ) found above. The value of the unavailability for t = 8760 h is equal to 1.24 × 10−3 . This result is close to the approximated result of 1.44 × 10−3 obtained by summing the probabilities of the minimal cut sets in Sect. 29.4.8, which appears to be conservative by 14%.

Top event Fig 29.4

CCF SV

OPPS failure with CCFs

Or7

Independent failure of the OPPS

CCF PSH

CCF LS

And4 CCF_SV periodic-test 2.0E-7 0.1 8760.0 0.0

CCF_PSH periodic-test 2.0E-8 0.1 8760.0 0.0

Fig. 29.17 Modification of the top event to include the common cause failures

CCF_LS GLM 0.0 2.5E-6 0.1

450

29 Boolean Family Exercises

Fig. 29.18 Overpressure protection system unavailability and PFDavg : impact of CCFs

29.4.14 Exercise 22.4: Unavailability Calculation with Test Staggering The aim of this exercise is to extend exercise 22.1 to model thetest staggering of the safety valves. Taking into account the test staggering of safety valves implies to test the valves with the same test interval (1 year) but to perform the first test of SV1 at t = 6 months and the first test of SV2 at t = 1 year. The unavailability curve, U (t), presented in Fig. 29.19 has been obtained from the fault tree drafted in Fig. 29.2, where the tests of SV1 and SV2 have been staggered as explained above. Compared to the non-staggering case, the number of peaks of the saw tooth curve has been multiplied by 2. The calculation performed over 5 years (43,800 h) provides an average value equal to PFDavg = 3.02 × 10−4 . This is a decrease of about 35% compared to the results without test staggering (PFDavg = 4.68 × 10−4 ) found above without test staggering.

Fig. 29.19 Overpressure protection system unavailability and PFDavg : impact of test staggering

29.4 Solutions of the Exercises Related to the OPPS

451

29.4.15 Exercise 24.1: Importance Factor Calculations The aim of this exercise is to extend exercise 22.1 to calculate the various importance factors related to the items belonging to the OPPS at t = 5 years and t = 39,420 h (middle of the last test interval). It aims also to extend exercise 22.4 to do the same at t = 5 years. Except for the Vesely-Fussell importance factor when the number of minimal cut sets is small, the calculation of the other importance factors requires the use of an FT software package. Then, the various importance factors related to the elementary events of the fault tree drafted in Fig. 29.2 can been calculated by using the GRIFTree (free demo version) which provides the MIF, CIF, DIF, RAW, RRW = 1/RRW and BIPF importance factors (see Chap. 24). OPPS without partial stroking: t = 43,800 h (5 years) The results obtained for t = 5 years and for the FT calculated in exercise 22.1 (i.e. OPPS without partial stroking) are presented in Table 29.13. This time has been chosen because it encompasses all the failures and repairs occurred over [0, 5 years]. This allows to draw Table 29.14 where the various items belonging to the OPPS are ranked from left to right according to the various importance factors: • CIF, DIF and RRW give the ranking already found with the Vesely-Fussell importance factor (VFIF) calculated by hand for t = 1 year. • MIF, RAW and BIPF give a different ranking (see Chap. 24 for the meaning of each of the importance factors). OPPS without partial stroking: t = 39,420 h The results obtained for t = 39,420 h (i.e. the middle of the last test interval) and for the FT calculated in exercise 22.1 (i.e. OPPS without partial stroking) are presented in Table 29.13. Table 29.15 shows the same ranking as this presented in Table 29.14. Table 29.13 Importance factors of the items belonging to the OPPS (FT without partial stroking) Item

MIF

CIF

DIF

RAW

RRW

BPIF

LS1

3.719E−02

1.358E−02

1.407E−02

2.815E+01

9.864E−01

4.624E−01

LS2

3.719E−02

1.358E−02

1.407E−02

2.815E+01

9.864E−01

4.624E−01

PSH1

3.855E−03

9.851E−03

1.331E−02

3.806E+00

9.901E−01

3.823E−04

PSH2

3.730E−02

9.532E−02

9.848E−02

2.815E+01

9.047E−01

3.699E−03

SV1

3.851E−02

9.757E−01

9.766E−01

2.815E+01

2.428E−02

3.727E−02

SV2

3.503E−02

8.875E−01

8.914E−01

2.570E+01

1.125E−01

3.390E−02

452

29 Boolean Family Exercises

Table 29.14 Ranking of the items according to various importance factors (FT without partial stroking) Importance factor

Ranking of the items belonging to the OPPS

VFIF

SV1

SV2

PSH2

LS1

LS2

PSH1

CIF

SV1

SV2

PSH2

LS1

LS2

PSH1

DIF

SV1

SV2

PSH2

LS1

LS2

PSH1

RRW

SV1

SV2

PSH2

LS1

LS2

PSH1

MIF

SV1

PSH2

LS1

LS2

SV2

PSH1

RAW

LS1

LS2

PSH2

SV1

SV2

PSH1

BIPF

LS1

LS2

SV1

SV2

PSH2

PSH1

Table 29.15 Importance factors of the items belonging to the OPPS (FT without partial stroking) Item

MIF

CIF

DIF

RAW

RRW

BPIF

LS1

1.94E−02

2.70E−02

2.75E−02

5.50E+01

9.73E−01

4.62E−01

LS2

1.94E−02

2.70E−02

2.75E−02

5.50E+01

9.73E−01

4.62E−01

PSH1

2.21E−03

1.08E−02

1.25E−02

7.15E+00

9.89E−01

4.22E−04

PSH2

1.94E−02

9.47E−02

9.63E−02

5.50E+01

9.05E−01

3.70E−03

SV1

1.97E−02

9.62E−01

9.62E−01

5.50E+01

3.85E−02

3.73E−02

SV2

1.79E−02

8.76E−01

8.78E−01

5.02E+01

1.24E−01

3.40E−02

OPPS with partial and full stroking: t = 43,800 h (5 years) The results obtained for t = 5 years and for the FT calculated in exercise 22.4 (i.e. OPPS with partial and full stroking) are presented in Table 29.16. This allows to draw Table 29.17 where the various items belonging to the OPPS are ranked according to the various importance factors: • CIF, DIF and RRW give the same ranking. It is the same for the three most important items as in the previous case (without partial stroking) but changes for the three less important items. Table 29.16 Importance factors of the items belonging to the OPPS (FT with partial stroking) Item

MIF

CIF

DIF

RAW

RRW

BPIF

LS1

2.07E−02

1.35E−02

1.40E−02

2.79E+01

9.87E−01

3.27E−01

LS2

3.78E−02

2.46E−02

2.51E−02

5.03E+01

9.75E−01

5.98E−01

PSH1

3.86E−03

1.76E−02

2.10E−02

6.01E+00

9.82E−01

4.86E−04

PSH2

3.79E−02

1.73E−01

1.76E−01

5.03E+01

8.27E−01

4.78E−03

SV1

2.14E−02

9.68E−01

9.69E−01

2.79E+01

3.22E−02

2.63E−02

SV2

3.50E−02

7.99E−01

8.03E−01

4.59E+01

2.01E−01

4.39E−02

29.4 Solutions of the Exercises Related to the OPPS

453

Table 29.17 Ranking of the items according to various importance factors (FT with partial stroking) Importance factor

Ranking of the items belonging to the OPPS

CIF

SV1

SV2

PSH2

LS2

PSH1

LS1

DIF

SV1

SV2

PSH2

LS2

PSH1

LS1

RRW

SV1

SV2

PSH2

LS2

PSH1

LS1

MIF

PSH2

LS2

SV2

SV1

LS1

PSH1

RAW

LS2

PSH2

SV2

LS1

SV1

PSH1

BIPF

LS2

LS1

SV2

SV1

PSH2

PSH1

• MIF, RAW and BIPF give, again, different ranking (see Chap. 24 for the meaning of each of the importance factors).

29.4.16 Exercise 25.1: Uncertainty Propagation The aim of this exercise is to extend exercise 22.1 to calculate the impact of uncertainties according to Table 29.3 on the PFDavg (average unavailability) and the PFH (average failure frequency) over 5 years of operation. Taking into account the uncertainties of the failure rate implies to replace point values by probabilistic distributions. This has been done in the FT built in exercise 16.1 according to the parameters provided in Table 29.3. The results obtained for the unavailability are provided in Fig. 29.20. Calculations have been performed with error factors equal to 2 on the left-hand side and 3 on the right-hand side. In each of the cases, the average value, Uˆ (t), is drafted in plain black lines and the bounds of the centred 90% confidence interval, [B5% , B95% ], in dotted grey lines. As expected, the confidence interval is smaller in the first case than in the second case.

Fig. 29.20 Impact of reliability data uncertainties on the unavailability

454

29 Boolean Family Exercises

Fig. 29.21 Impact of reliability data uncertainties on the failure frequency

The average availability values (PFDavg ) over 50,000 h and their confidence intervals have also been drafted on the same figure. The average values are almost equal but slightly higher (4.67×10−4 ) in the first case than in the second case (4.59×10−4 ). The pseudo error factor is equal to 2.4 in the first case and 3.9 in the second case: this is higher than the error factors related to the item failure rates. The results obtained for the failure frequency are provided in Fig. 29.21. Calculations have been performed with error factors equal to 2 on the left-hand side and 3 on the right-hand side. In each of the cases, the average value, w(t), ˆ is drafted in plain black lines and the bounds of the centred 90% confidence interval, [B5% , B95% ], in dotted grey lines. Again and as expected, the confidence interval is smaller in the first case than in the second case. The average availability values (PFH) over 50,000 h and their confidence intervals have also been drafted on the same figure. The average values are almost equal but slightly higher (2.05 × 10−6 ) in the first case than in the second case (2.02 × 10−6 ). The pseudo error factor is equal to 2.03 in the first case and 3.09 in the second case: this is very close to the error factors related to the item failure rates. Then in this case, the opposite effects of OR and AND gates compensate for each other.

Reference GRIF-workshop (2020) module Tree. Funded and developed by TOTAL. https://grif-workshop.fr/. Accessed Aug 2020

Part IV

Dynamic Systems and Stochastic Processes

Chapter 30

Introduction to Dynamic Systems and Stochastic Processes

30.1 Miscellaneous Dynamic Aspects 30.1.1 Dynamic Aspect Linked to System Operation Theoretically speaking, only items with a single unchanging state are really static. This is for example the case of items for which no failure and no wear are expected during the period where they are going to be used. All the others and even the simplest non-repaired item have a dynamic behaviour when they jump from their up state to their down state when a failure occurs. More generally, any industrial system has a dynamic behaviour and, from time to time, it jumps from one state to another according to events occurring randomly (e.g. failure or repair) or deterministically (e.g. planned maintenance or periodic tests). Therefore, any industrial system is a dynamic system but, however, some are more dynamic than others as this is analysed hereafter. Such a behaviour constitutes a random (or stochastic) process (Birolini 2014; Cocozza Thivent 1997; Çinlar 2013; Coleman 1974; Karlin and Taylor 1975, 1981; Žitkovi´c 2010). When they don’t undergo much changes, e.g. when the architecture does not change over time, they can be modelled by the static Boolean approaches described in Chaps. 15–28. This is illustrated in Fig. 30.1 with a very simple safety system made of two redundant flow transmitters FT1 and FT2 operated at the same time and one safety valve SDV. When the flow rate reaches a pre-set value, sensor A or/and sensor B send signals to order the valve to close and the flow out rate goes to zero. The functioning logic of this safety system does not change when time elapses and this is typically a case where a static model like a reliability block diagram (see Chap. 15) or a fault tree (see Chap. 16) can be used. This is done with the reliability block diagram presented on the right hand-side of the figure where blocks A and B model the redundant sensors and block C models the valve. In such a model, the failures propagate from the left © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_30

457

458

30 Introduction to Dynamic Systems and Stochastic Processes

FT1

FT2

Flow transmitters

A input

Flow in

output

C

Flow out

B

SDV Fig. 30.1 Simple safety system and associated RBD model

FT1

FT2

Cold standby

A input

Flow in

output

C

Flow out SDV

B

Fig. 30.2 Simple safety system implementing cold standby

(input) to the right (output) and the main assumption is that the individual blocks (i.e. the corresponding components) are independent from each other. As explained in Chap. 22, this allows to introduce the time in the probabilistic calculations and thus to address, to some extent, the dynamic aspects (failure and repairs) related to individual components. As explained in Chap. 27, this is generally done by mixing the Boolean and Markovian approaches, Markovian approach which is described in detail in Chap. 31 hereafter. In Fig. 30.2, the simple safety system of Fig. 30.1 is operated in a different way: FT1 is normally operating when FT2 is in cold standby position. Therefore, when a failure of FT1 occurs and when it is detected, the functioning logic changes as FT2 is started to replace FT1. This constitutes a systemic dependency between the states of FT2 and the states of FT1 and it is represented by arrows in Fig. 30.2. Even if this dynamic dependency seems very simple, it falls outside the Boolean model area and cannot be directly modelled by using common RBDs or fault trees (FTs). The dynamic RBD (see Chap. 27) on the right-hand side of the figure has, for example, to be used instead. If the above models in Figs. 30.1 and 30.2 propagate the impacts of failure from the input to the output of the model, the possible retro-feedback impacts occurring from the output to the input are not taken into consideration. Yet, when the valve closes, if the process is protected downstream, the pressure is likely to increase upstream and this is another source of potential hazard. This kind of feedback is illustrated in Fig. 30.3 with a simplified pumping system made of two redundant pumps (P1 operating and P2 in cold standby) and a valve in series with the pumps. The functioning logic is similar to the previous example and therefore the previous

30.1 Miscellaneous Dynamic Aspects

459

Fig. 30.3 Simple pumping system implementing cold standby and with feedback interactions

dynamic RBD can be used as a basis to model the pumping system. However, several dependencies have to be added and they have been represented by dotted lines on the right-hand side of the figure: • When the valve closes, the flow rate drops to zero and the pump which is in service has to be shut down in order to avoid overheating (dotted line from C to A) and the redundant pump cannot be started (dotted line from C to B). • When the pump which is in service stops or fails, the level in the tank increases as long as the redundant pump is not started. Then an overflow may occur (dotted lines from A to the input and from B to the input) as long as the redundant pump is not started. • When the two pumps are stopped or failed, an overflow may occur in the tank if nothing is done to stop the incoming flow. Therefore, the closure of the valve has a feedback impact on the tank level. The conventional unreliability, availability and failure frequency (see Chap. 4) are the relevant parameters for the above safety and pumping systems which are binary systems (i.e. systems with only two classes of states: e.g. up and down). Figure 30.4 moves a step forward with regards to dynamic aspects. Even if the drawing is similar to the models discussed above, the diagram is now a flow diagram (see Sect. 33.1.5 and Chap. 35) and no longer a reliability block diagram. It represents a simplified oil and gas production process:

Flow in

Production capacity

Activated: 90 m3/h Non-activated: 50 m3/h

40 m3/h

Flow out

PU1 PU3

Process units

Wells

Demanded: 30

70 m3/h

100 m3/h PU2

m3/h

Fig. 30.4 Simple production system with several production levels

90 m3/h 70 m3/h 50 m3/h 40 m3/h 0 m3/h Customer 30 m3/h

460

30 Introduction to Dynamic Systems and Stochastic Processes

• The crude oil comes from the wells which can produce 50 m3 /h in a natural way and 90 m3 /h when it is activated (e.g. by using gas-lift). • Then, the flow is divided between the process units PU1 and PU2 with different capacities. • Then, the flow processed by PU1 and PU2 enters in PU3. • When processed by PU3, it is delivered to the customers. Therefore, the analysis of the impact of the states of the wells and of the process units leads to identify 5 different production levels at the output of the system: 90, 70, 50, 40 and 0 m3 /h. The main feedback effects on such a system is that the production levels of the production units and of the wells have to be adapted to each situation. For example: • In the nominal case, the wells produce 90 m3 /h and the capacities of PU1 and PU2 have to be decreased e.g. to 60 and 30 m3 /h. • When PU2 fails, then 70 m3 /h can be processed by PU1 and the production rate of the wells and the capacity of PU3 have to be adapted to this value. • If PU3 fails, the production drops to 0 and PU1, PU2 and the wells have to be stopped. • When the gas-lift is lost, the production rate of the wells drops to 50 m3 /h and PU1 and PU3 have to be adapted to this rate and PU2 has to be stopped. • Etc. In addition, Fig. 30.4 illustrates one situation often encountered when producing gas: the customer imposes day per day, by contract, the quantity to be delivered (here 30 m3 /h) and the production of the wells and of the process units has to be adapted to this demand. It has to be noted that, in addition to a nominal capacity, one process unit can have a maximum and a minimum capacity which can be used by the operators to find the best configuration corresponding to a given production level. Between the “perfect” state producing 90 m3 /h and the completely failed state producing nothing, this production system comprises several degraded production states which have to be taken into consideration. This is a typical multistate system as introduced in Chap. 5. The above situation is already complicated but the behaviour of the actual industrial systems is often more complicated than that, as illustrated in Fig. 30.5 for the same production system where the process unit PU3 produces electricity, fuel gas and gas-lift. The analysis of the main flow described in Fig. 30.4 does not change but new dependencies are introduced between PU3 and PU1, PU2 and the wells. More precisely PU3 produces: • the gas-lift (GL) which is used to activate the wells and to increase the flow rate from 50–90 m3 /h; • the fuel gas (FG) which is used to feed the gas turbines driving the compressors used in the process unit PU1 and also to produce the electricity with a gas turbine in unit EL;

30.1 Miscellaneous Dynamic Aspects

461

EL Flow in Wells

FG

PU1 Process units

PU3

Flow out

PU2 GL

Fig. 30.5 Simple production system with several interactions between process units

• the electricity (EL) which is used elsewhere but mainly for electric motor driven compressors used in process unit PU2. The conventional unreliability, availability and failure frequency (see Chap. 4) are no longer relevant parameters for the above production systems which are multistate systems (i.e. systems with more than two classes of states) and it is necessary to switch to the production availability analysis (see Chap. 5).

30.1.2 Dynamic Aspect Linked to System Maintenance In the previous Sect. 30.1.1, only dynamic aspects linked to the way systems are operated are considered but another source of dynamic aspect is related to the way the system is maintained. The maintenance is split between preventive and corrective maintenance (see Chap. 4). The preventive maintenance is planned whereas the corrective maintenance is performed only when some failure has occurred. With regards to production and safety, both preventive and corrective maintenance are normally performed in order to insure the maximum of production in safe conditions. When a failure occurs, this implies to reach a safe state first and then to retrieve the maximum of the production as soon as possible. The maintenance philosophy of the plant in which the system is installed describes how to proceed to reach this target. It should, for example and as illustrated in Fig. 30.6, define in which order the failures have to be repaired. Therefore, the safety related failures have to be repaired first (e.g. the failure of a safety system), then the failures for which the production is lost (e.g. a valve which untimely closes on a production pipeline), then the failures for which no production is lost but is likely to lead to problems at medium term (e.g. a valve stuck open on a production pipeline) and finally the failures which are related to something else (e.g. a burnt-out bulb). It has to be noted that, the maintenance teams being busy at any

462

30 Introduction to Dynamic Systems and Stochastic Processes

Failure

Mitigation action

Yes

Safety related failure

Production related failure

Yes

Priority 1 Priority 2

No

Yes

Production lost

No

No

Priority 3 Priority 4 Fig. 30.6 Priority for repairs

time, when a failure occurs it has to wait its turn to be repaired. Therefore, except for the safety related failures with the highest priority, the current assumption that there are as many repair teams as needed to repair all the failures at the same time is generally not realistic. When a failure has occurred, it cannot be repaired if other failures have a higher priority but also if there are no spare part and no maintenance support available to undertake the maintenance operations. This is illustrated in Fig. 30.7. The spare part management is an important part of the maintenance philosophy and it may have a strong impact on the production availability of a production system: too much spare parts is costly but a lack of spare parts delays the restorations of the system and this is also very costly. When a spare part is available, it may be necessary to mobilise a maintenance support (e.g. a specific tool, a specific maintenance specialist, a support vessel) before undertaking the maintenance. This may be a time-consuming procedure as, for example, mobilising a dynamic positioning support vessel to undertake a subsea intervention. In fact, for a detailed analysis of the curative maintenance, many other elements have to be considered:

30.1 Miscellaneous Dynamic Aspects

463

Failure to be repaired

Order spare part

No

Spare part available

Yes

Maintenance No Maintenance support support available mobilisation Yes

Repair can start

No

Failure with higher priority

Yes

Fig. 30.7 Spare part management and maintenance support mobilisation

• the work order management and the correlative administrative delays described in Chap. 4; • the diagnostic of the failure in order to proceed to a relevant maintenance. When operating remote installations, this can be difficult and time-consuming. For example, when a gas cloud is detected around an unmanned platform, it is necessary to remotely stop the leak before sending a maintenance team on board; • the shift crew rotation and the rhythm of the work; • the suspension of maintenance operations during the night because the workers do not work at night (this introduces a dynamic aspect practically not tractable with analytical methods); • the transportation (e.g. by car, boat, helicopter) to the location where the failure has to be repaired; • the meteorological conditions which can have an impact on transportation delays and operations (e.g. when using a dynamic positioning vessel); • etc. Figure 30.8 illustrates that, when the maintenance is planned, it is planned in a period of time when it cannot worsen an already degraded situation. For example: • In Figs. 30.1 or 30.2, the periodic proof test of the flow transmitter FT1 should be delayed when FT2 is under repair (and vice versa) in order to avoid making unavailable the safety system already in a degraded state. • In Fig. 30.3, the preventive maintenance of P1 should be delayed when P2 is under repair (and vice versa) in order to avoid making unavailable the pumping system already in a degraded state.

464

30 Introduction to Dynamic Systems and Stochastic Processes

Planned maintenance

Wait for a better situation

Yes

Potential detrimental impact

No

Planned maintenance starts

Fig. 30.8 Planned maintenance organization

• In Figs. 30.4 and 30.5, the preventive maintenance of PU1 should be delayed when PU2 is under repair (and vice versa) in order not to lose the remaining part of the production which is still available. But when PU3 is under repair, this may be an opportunity to undertake the preventive maintenance of PU1 or PU2 because the production is already lost. The various examples above give an overview of dynamic aspects related to the way a system is operated and maintained. They range from very simple (quasi static) to very complicated (even complex) situations which have to be considered when detailed analyses are performed.

30.2 Notion of Stochastic (Random) Processes All the examples described above have several states and they move (jump) from state to state according to the events which occur (failures, repairs, planned actions, etc.). Figure 30.9 gives an example of the behaviour of a simple system made of two redundant binary components A and B. Such a system has 4 states which can be gathered into two different classes: up (available) and down (unavailable). Observing it over a period of time leads to a chronogram like this drafted in Fig. 30.9. This chronogram is a so-called trajectory of the underlying random (stochastic) process related to the failure and repairs of components A and B. It represents a history which can be observed among all the possible histories of the system over this period of time. In this example, the system is repaired even when it fails completely and this chronogram is typical of the behaviour of a repaired item when its availability is considered. When the system reliability is considered, only the first complete failure is of interest (see Sect. 4.9) and the chronogram has to be reduced to this proposed in Fig. 30.10 because the behaviour after the first failure does not matter when reliability

30.2 Notion of Stochastic (Random) Processes

465

States Up

A

A

B Down Time T

Fig. 30.9 Typical behaviour of a binary system when availability is considered

States Up A

Down Time T Fig. 30.10 Typical behaviour of a binary system when reliability is considered

calculations are performed. This truncation after the first failure introduces a systemic dependency as the components of the systems are repaired only if a complete failure has not occurred before: in the example, state AB and state AB can be repaired but not state A B. This is one of the main sources of difficulties for calculating the reliability or the unreliability of repaired items. It has already been analysed with the Boolean approaches (Chap. 22) and will also be analysed in detail in the context of dynamic models. Figure 30.11 illustrates the case of a safety system with dangerous states inhibiting the safety action and safe states favouring the safety action (see Chap. 4). To simplify this figure, all the safe states have been gathered in a single one (state class Up). Therefore, such a system has three state classes and no longer belongs to the binary items: this is a multistate system. Figure 30.12 represents the behaviour of a production system with many degraded production states between the perfect state and the completely failed state. This is the

466

30 Introduction to Dynamic Systems and Stochastic Processes

States Up A

Dangerous Safe

S Time T

Fig. 30.11 Typical behaviour of a safety system with dangerous and safe states

States (flow rate) E1

Perfect state

E2 E3

Degraded states

En

Failed states Time T

Fig. 30.12 Typical behaviour of a production system

typical behaviour of systems like these analysed in Figs. 30.4 or 30.5 in Sect. 30.1.1 above. Again, this is the typical behaviour of a multistate system.

30.3 Dynamic Methods and Tools

467

30.3 Dynamic Methods and Tools Figure 30.13 shows the place of the dynamic models within the whole corpus of methods and tools implemented to perform probabilistic calculations (see Chap. 6). The two approaches developed in Chaps. 31 and 32 are identified on the right-hand side of the figure: • Markovian approach; • Stochastic Petri nets which belong to the behavioural approaches. Both of them are state-transition models and, mathematically speaking, they are particular cases of finite state automata (Wikipedia FSM 2020; Carroll and Long 1989) (i.e. automata with a finite number of countable states). Markovian approach (Chap. 31): this is the oldest and simplest one. It is based on analytical calculation. It allows to easily model multiphase and multistate items (see Chaps. 5 and 6). Unfortunately, the size of the model (i.e. the number of states) increases exponentially with the number of components of the modelled system and so this approach is limited to small complex systems. Therefore, it is interesting for a pedagogical purpose (e.g. to explain the various reliability concepts introduced in Chap. 4) and to be used in association with other modelling techniques like fault trees (FT-driven Markov processes) or reliability block diagrams (RBD-driven Markov processes) as described in Chap. 27. This is illustrated by the rectangle in grey in

Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Specific formulae

Boolean approaches RBD FT ET

Static models

Dynamic models

Markovian approaches

Monte Carlo simulation

Behavioural approaches

Markov graphs Petri nets State-transition model (Finite state automata)

Generic tools

Fig. 30.13 Dynamic models within the corpus of methods and tools

State of the art

468

30 Introduction to Dynamic Systems and Stochastic Processes

Fig. 30.13. In the simplest case, this can be used to develop the simplified formulae mentioned on the left-hand side of the figure. Stochastic Petri nets (Chap. 33): they allow to push the modelling beyond the limits of the Markovian approach but the price to pay for that is to abandon the analytical calculations to implement the Monte Carlo simulation (Chap. 32) and to accept longer calculations. The size of the model increases linearly with the number of components of the modelled system and then very big systems can be handled by using this approach. Petri nets are more and more used as they allow to mix random and deterministic events and their modelling powerfulness is virtually endless. Compared to the analytical approach, the Monte Carlo simulation has had the reputation to be a dirty approach for a long time and often this is still the feeling in the university context. If this position was relevant 30 or 40 years ago because even the main frame computers were not fast enough to perform accurate Monte Carlo simulation, the situation has changed due to the incredible increase of the computation speed. This opportunity has been seized by engineers and the Monte Carlo simulation became one of their favourite techniques to handle complicated models. This is also the case for reliability engineers for whom this technique is now available on simple personal computers. However, if the analytical approach is beloved by the university (specially the Markovian approach), it seems generally unclear for engineers and, if the Monte Carlo simulation is in line with the engineer way of thinking, the university is often suspicious about it. Even if this opposition tends to soften nowadays, this is regrettable as, in fact, these two techniques are complementary with regards to the probability calculations: • analytical calculations: the accuracy increases when the probabilities decrease; • Monte Carlo simulation: the accuracy increases when the probabilities increase. Therefore, the efficiency of these techniques depends on the probability level to be calculated and this is the job of the analyst to choose the relevant one: the analytic technique should be preferred for safety related calculations where the probabilities of failure are normally (and fortunately) low whereas the Monte Carlo simulation should be preferred for non-safety related calculations (e.g. production availability) where the probabilities of failure may be high and even equal to 1, in which case failure frequencies have to be considered.

30.4 Systems Typology to Select a Relevant Approach As said above and as explained in Chap. 22, even the static (Boolean) approaches are able, to some extent, to handle simple dynamic aspects. Therefore, a question that the analyst has to ask to himself when he undertakes a study is if he has to use a dynamic model or if he can reasonably use a simpler model. This depends both on

30.4 Systems Typology to Select a Relevant Approach

469

the nature of the system itself and of the type of results to be obtained. In fact, two determinant axles can be identified: • the components of the system are independent from each other; • the components of the system are repaired or non-repaired. The mathematical approaches used to calculate probabilities are generally based on the strong assumption that the components are independent. This means that they behave independently from each other and that the state of a given component has no impact on the behaviour of the other components. Of course, this is a little bit theoretical as no industrial system really completely complies with this assumption which is true only to some extent. When the dependencies are light, they can be neglected but when they are not negligible, they have to be explicitly modelled. Except in particular cases, the failures of industrial systems are generally repaired when they occur and therefore these systems are dynamic to some extent. Nevertheless, in most of the cases, steady states are reached and, in some way, they become static. For example, the steady state average unavailability of an item with a failure rate λ and a repair rate μ is equal to λ/μ when λ  μ (see Chap. 22). Therefore, if the process failure/repair is dynamic, the process related to the average unavailability becomes static as soon as the steady state is reached. The faster the failures are detected and repaired and the faster the steady state is reached. This is this property which allows to use, to some extent, fault trees or reliability block diagrams to handle repaired systems. Table 30.1 compares the usefulness of FTs with regards to dynamic models according to the two features identified above. When dealing with safety systems, the probabilities of failure are low and, from a mathematical point of view, the common approximations used to calculate the probabilities work well. In addition, safety systems are generally repaired in priority and this reduces the dependences due to a limited number of repair teams. Therefore, the fault trees work rather well for unavailability calculations of safety systems and this is why they have been successfully used for that purpose since a long time. Unreliability calculations with FTs are more difficult due to the systemic dependency introduced by the consideration of the first complete failure only (see Fig. 30.10). Nevertheless, when the failures are quickly detected and repaired, approximations are available (see Chap. 22) which work pretty well. Table 30.1 Typology of systems Case

Independent components

Repaired components

Fault tree

Dynamic models

1

Yes

No

Exact results

Sledgehammer!

2

Yes

Yes

Approximated results

Useful

3

No

No

Doubtful result

Necessary

4

No

Yes

Doubtful result

Necessary

470

30 Introduction to Dynamic Systems and Stochastic Processes

Therefore, the implementation of dynamic models is mainly interesting for cases 3 and 4 of Table 30.1 for which the use of a Boolean approach is likely to produce irrelevant results. Of course, dealing with multistate systems (e.g. for a production availability purpose) implies to use dynamic models in any case. It is of utmost importance for the analyst to know the pros and cons as well as the possibilities and limits of the various approaches to make an informed choice when selecting one of them for a particular study.

References Birolini A (2014) Appendix 7 in reliability engineering—theory and practice, 7th edn. Springer Science and Business Media, Dordrecht Carroll J, Long D (1989) Theory of finite automata with an introduction to formal languages. Prentice Hall, Englewood Cliffs Çinlar E (2013) Introduction to stochastic processes. In: Dover books on mathematics, Reprint edn Cocozza Thivent C (1997) Processus stochastiques et fiabilité des systèmes. Springer-Verlag, Berlin Coleman R (1974) Stochastic processes. Springer. ISBN 978-0-04-519017-1 Karlin S, Taylor HM (1975) A first course in stochastic processes. Academic Press Karlin S, Taylor HM (1981) A second course in stochastic processes. Academic Press Wikipedia FSM (2020) https://en.wikipedia.org/wiki/Finite-state_machine. Accessed Sept 2020 Žitkovi´c G (2010) Introduction to stochastic processes—lecture notes. Department of Mathematics, The University of Texas at Austin

Chapter 31

Markovian Modelling

31.1 Basis of the Classical Markov Approach 31.1.1 Introduction and Overview of the Markovian Approach The Markovian approach is a popular approach widely used to perform probabilistic calculations in numerous scientific fields. This includes the reliability engineering and its use for dependability purpose is described in IEC 61165 (2006), for functional safety purpose inISO/TR 12489 (2013) and for both in Signoret (2005). The Markovian approach is based on the Markov processes which are a particular class of random processes developed at the beginning of the twentieth century by the mathematician Andreï Andreïovitch Markov (1856–1922). The specific property of Markov processes is that they are memoryless: the future of the process from a time t depends only on the state of the process at this time t and not at all of the way this state has been reached. As illustrated in Fig. 31.1, the Markovian approach belongs to the analytical state-transition approaches. The principle is to identify the states of the system under consideration and to analyse how the system moves from a state to another. This leads to behaviours similar to these represented by the chronograms proposed in Chap. 30. Its graphical representation (called Markov graph) is a very interesting feature which allows to illustrate in a simple way the core concepts described in a theoretical way in Chap. 4. This is done hereafter in 31.3 for many of them. The graphical representation is also a big advantage for the analysts but it may encourage them to use this approach as a black box. Like any other modelling and calculation techniques, this can lead to irrelevant results and a minimum knowledge of © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_31

471

472

31 Markovian Modelling

Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Specific formulae

Boolean approaches RBD FT ET

Static models

Dynamic models

Monte Carlo simulation

Markovian approaches

Behavioural approaches

Markov graphs

Petri nets

State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 31.1 Location of the Markovian approach among the various probabilistic models

the underlying mathematics is needed: analytical calculations are described hereafter in 31.4. The memoryless property is another important feature. It opens the way to model multiphase systems, i.e. systems with several functioning phases occurring in sequence. This is very useful to model the periodically tested systems encountered when dealing with safety analyses in general and functional safety in particular (see Chaps. 6 and 36). If the Markovian approach deals primarily with the probability of the system to be in a given state at a given time t, it can be easily extended to the calculation of the accumulated time spent in a given state. This opens the way to the modelling of multistate systems, i.e. systems with more than two classes of states. This allows to take degraded states into consideration and this is very effective to model, e.g., production systems and extend the concept of availability to this of production availability (see Chaps. 5 and 35). Exercises related to the Markovian approach are proposed and developed in Chap. 34. The list of these exercises is provided in Sect. 31.9 with a brief description of each of them and the links toward the relevant sections or subsections are indicated. In order to illustrate this chapter with relevant curves, all the Markovian calculations have been achieved by using the free version of the GRIF workshop software package (GRIF-Workshop 2020).

31.1 Basis of the Classical Markov Approach

473

31.1.2 Graphical Representation of Markov Process The system illustrated in Fig. 31.2 is very simple but it is sufficient to explain the principle of the Markovian approach. It is proposed in the form of a generic reliability block diagram which can model any system made of two redundant components (i.e. two similar pumps, two similar sensors, etc.). Only one repair team is available and therefore, when both components are failed at the same time, only one failure can be repaired at once and a maintenance policy has to be defined: it has been chosen to repair component B first. This implies that when A is failed it is repaired only if B is in up state. The state of A depends on the state of B and this introduces a systemic dependency between components A and B. Due to this dependency, the system becomes dynamic and the conventional static models like reliability block diagrams of fault trees are no longer able to model it. Fortunately, this is rather simple when the Markovian approach is undertaken. A Markovian process is a state-transition process which can be represented by a state-transition graph showing the states of the system and the transitions between these states. The graphic conventions to draw such a graph are the following: • states are represented by circles; • transitions are represented by arrows. Therefore, the first step is to identify and draw the various system states and the second one to identify the transitions between the states. When the system is made of a small number of binary items, it is rather easy to identify the states as a simple truth table can be built. This is shown in Fig. 31.2: the system made up of 2 components A and B has 22 = 4 states: AB, AB, AB, A B. When the system is less simple and comprises many states, it is better to identify the states step by step when building the Markov graph: 1. Identify and draft the nominal state; 2. Identify the states toward which the system can jump from this state and – draft these new states; – draft the corresponding transitions.

Fig. 31.2 Simple dynamic system: 2 redundant components with repair priority

474

31 Markovian Modelling

3. Choose one state not analysed yet and continue at step n°2; 4. Stop when no more states and transitions are identified. Using this algorithm with the example presented in Fig. 31.2 leads to start with state AB and to draw it (step 1): • from state AB component A can fail and this leads to state AB: this allows to draw this new state and the transition from AB to AB; • from state AB component B can fail and this leads to state AB: this allows to draw this new state and the transition from AB to AB; • all possible transitions from AB being identified, a new starting state can be chosen: for example, AB; • from state AB component B can fail and this leads to state A B: this allows to draw this new state and the transition from AB to A B; • from state AB component A can be repaired and this leads to state AB: this state already exists and only the new transition from AB to AB is drawn; • the same analysis can be done with state AB: no new states are identified and the two new transitions from AB to A B and from AB to AB are drawn; • then state A B is considered: no new state is identified and from this state only B can be repaired and this leads to a transition from A B to AB; • no more state being left to analyse, then the process stops and the Markov graph is completed. The above algorithm can be used to automatically generate large Markov graphs from models of higher level like those based on Petri nets (see Chap. 33) or formal languages (see, for example, the AltaRica language (Rauzy et al. 1998; Boiteau et al. 2006; Brameret et al. 2015)). When the states and the transitions are identified, a state-transition graph is obtained but, to really obtain a Markov graph, there is still to indicate the probability that, when in a given state, the system jumps out of this state to another. This is done by assigning a transition rate, αi,j (t), to each transition: Transition rate from state i to state j, α i,j (t): parameter such as αi,j (t) · dt is the probability to move from i to j within the interval [t, t + dt], provided the system is in state i at time t. Two cases of Markov processes have to be considered: • Semi-Markovprocesses: this is the general case where αi,j (t) depends on the way state i has been reached. They are very difficult to handle in an analytical way and Monte Carlo simulation is generally used for that purpose (see Chap. 32). • Homogeneous Markov processes: in this case the transition rates are constant, i.e. αi,j (t) ≡ αi,j , and the calculations are easier and can be undertaken analytically. Both semi and homogeneous Markov processes are memoryless but, in the present chapter, only the homogeneous Markov processes (Markov processes in short) are considered. In this case, the transition rates related to failures are equivalent to constant failure rates and transition rates related to repair are equivalent to repair rates (see Chap. 4).

31.1 Basis of the Classical Markov Approach

475

Fig. 31.3 Markov graph related to the system presented in Fig. 31.2

Applying the above rules on the system presented in Fig. 31.2 leads to the Markov graph presented in Fig. 31.3. In this Markov graph, the systemic dependency due to the single repair team and the maintenance policy results only in the absence of transition from state E 4 to E 2 . This is extremely simple for a dependency impossible to model with the Boolean approaches. On this graph, the states have been split into two classes: • up state class: states E 1 , E 2 and E 3 ; • down state class: state E 4 . This splitting into two classes is the basis for using the Markov approach for calculating the various probabilistic parameters relevant within reliability analyses. Exercise 31.1 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.2 Mathematical Foundations 31.2.1 Basic Formula for Time-Dependent Calculations A Markov process is a stochastic process (also called random process) that satisfies the Markov property i.e., if the system is in a given state i at time t, the probability for the system to be in a specific state j in the future depends on state i only. As said above, this property is also named the memoryless property (or sometimes memorylessness): given the present state, the past and the future of the process are independent. Therefore, the history of the system until time t does not matter with regards to the states which will be reached in the future. This is a very important property which will be used for modelling multistate items (see 31.5.3). A Markov graph like this in Fig. 31.3 describes a Markov process where the time is continuous and the number of states is countable. As the transition rates are constant,

476

31 Markovian Modelling

Fig. 31.4 Basis for establishing the basic Markovian formula

it is related to a homogeneous Markov process. It embeds all the sequences of events which can be observed over a given period as, for example: E1 → E2 → E1 → E3 → E4 → · · · Each sequence is a trajectory of the stochastic process (i.e. an history of the related system). This constitutes a so-called Markov chain. When the Markov process is homogeneous, the resulting Markov chains are also homogeneous. Markov chains are presented in IEC 61165 (2006) and mathematical developments are available, for example, in Çinlar (1975), Pagès and Gondran (1986) or Leroy (2018). Let us consider a system S with n countable states and let us note: • Pr i (t) the probabilities of S to be in state i at time t; • αk,i (t) the transition rate from state k to state i. The Markov graph related to such a system and with regards to a given state i is illustrated in Fig. 31.4. The idea to establish the general basic formula is to use the transition rates and the probabilities of the system states at time t to calculate the probabilities of the system states at time t + dt. From the definition of a transition rate, αk,i · dt is the conditional probability to move from state k to state i = k within the interval [t, t + dt]. Therefore, the probability to move to state i from any state k is equal to the sum of the possible jumps toward i: 

αk,i · Prk (t) · dt

(31.1)

k=i

In the same way, the probability to move from state i to another state j = i is equal to the sum of the possible jumps out of i: [



αi,j ] · Pri (t) · dt = αi · Pri (t) · dt

(31.2)

j=i

 where αi = j=i αi,j is the transition rate out of state i. Then the probability to remain in state i over the interval [t, t + dt] is equal to:

31.2 Mathematical Foundations

477

(1 − αi · dt) · Pri (t)

(31.3)

Finally, the probabilities Pr i (t + dt) of the system to be in a state i at the time t + dt is the sum of: • the probability to move to state i (Formula 31.2); • the probability to stay in state i (Formula 31.3). This leads to the following formula: Pr i (t + dt) =



αk,j · Prk (t) · dt + (1 − αi · dt) · Pri (t)

(31.4)

k=i

It has to be noted that the probability of double jump within the same increment of time dt is of the second order compared to a single jump. Then this is negligible and has been neglected. From Formula (31.4), it comes:  αk,j · Prk (t) · dt – Pr i (t + dt) − Pri (t) = −αi · dt · Pri (t) + k=i



 Pri (t + dt) − Pri (t) = −αi · Pri (t) + αk,j · Prk (t) dt k=i

And finally, this gives the derivative in state i at time t.

dPri (t) dt

of the probability for the system to be

 dPri (t) = −αi · Pri (t) + αk,j · Prk (t) dt

(31.5)

k=i

It has to be noted that the above calculations are based on the assumption that the probabilities are differentiable. This assumption is generally verified with usual probabilistic laws in general and, in particular, in the case of homogeneous Markov processes. Formula (31.6) provides n different equations which can be presented in a vectorial form: − → d Pr(t) − → = M · Pr(t) dt with: − → – Pr(t): the column vector of probabilities of the states of the system. – M: the square (n, n) Markovian matrix.

(31.6)

478

31 Markovian Modelling

Fig. 31.5 Markovian matrix related to the Markov graph in Fig. 31.2

Formula (31.6) is the vectorial form of a set of first order homogeneous differential equations with constant coefficients. This set mathematically defines the Markovian process. Figure 31.5 gives an example of the Markovian matrix related to the Markov graph in Fig. 31.2. Element mij in line i, column j = i corresponds to the transition rate from state i to state j. Element mii on the diagonal of the matrix is equal to (1 − αi ) where αi is the sum of the transition rates leaving state i. The sum mij of the elements on the same column is equal to 0 and this is the characteristic of a Markovian matrix. This is also the characteristic of a singular matrix whose determinant is equal to 0: this means the matrix is not invertible and the set of differential equations is not free. In addition, each state being linked only to few other states, numerous mi,j are equal to 0. About one third of the elements are equal to zero in Fig. 31.5 but this proportion increases rapidly when the number of states increases and this leads to a so-called sparse matrix. This implies that appropriate mathematical techniques have to be used to calculate the state probabilities from Formula (31.6). They are briefly described in Sect. 31.4 but Formula (31.6) can be analysed in a general way hereafter. If it was a scalar differential equation and provided that initial conditions were given, the solution would be a simple exponential law. In the vectorial space, it is the same but with an exponential of matrix: − → − → Pr(t) = et·M Pr 0

(31.7)

This formula has almost the same property as in the scalar world. The difference is that the set of the matrices forms only a non-commutative ring and not a true algebra. Therefore, the order of the terms has to be taken into account in the calculations. Except for this difficulty, this formula can be handled as an ordinary exponential as this is done in the following formula where t has been replaced by (t − θ ) + θ : − → − → − → − → Pr(t) = e[(t−θ )+θ ]·M Pr 0 = e(t−θ )·M eθ·M Pr 0 = e(t−θ )·M Pr(θ )

(31.8)

This illustrates the fact that, when the probability vector of the states is known for a given time θ, this is sufficient to calculate what is going to happen from this time  ) becomes a new initial condition. Therefore, this confirms the memoryless as P(θ property of the Markov processes. The knowledge of the state probabilities leads directly to the probability to be in up state or to the probability to be in down state and whether this leads to availability or reliability results is discussed in the chapters hereafter.

31.2 Mathematical Foundations

479

Beyond the state probabilities, Formula (31.6) can also be used to calculatethe accumulated sojourn times (ASTs) spent in the various states over a given period of time. This can be done by noticing that Pri (t) · dt is the proportion of the time spent in state i within the interval [t, t + dt]. Therefore, the accumulated sojourn time spent in this state over [0, t] is the sum of all the Pri (t) · dt over this interval, i.e. the integral of Pri (t) · dt over [0, t]. This leads to the following vectorial equation: −−→ AST (T ) =

T

− → Pr(t) · dt

(31.9)

0

 As defined above, αi = j=i αi,j is the transition rate out of state i and, as a property of the exponential law, MSTi = 1/αi is the mean sojourn time into state i. For example, when αi is the failure rate of a component, this mean sojourn time is the mean time to failure (MTTF) of this component. Therefore, dividing the accumulated sojourn time in state i by the mean sojourn time in the same stategives the mean visit number MVNi (T ) in this state over a period of interest [0, T ]: MVNi (T ) = αi · ASTi (T )

(31.10)

The average visit frequency of state i is then: wi (T ) =

αi · ASTi (T ) MVNi (T ) = T T

(31.11)

31.2.2 Basic Formula for Asymptotic Calculations After a transient period, the probabilities of a classical Markov process reach asymptotic values. This is the indication that the system has reached a steady state where the probability to go into a given state is equal to the probability to go out of this state. The duration of the transient period is of about three times the shorter transition time (1/α ij ) which, generally, is related to the repair of a failure. Then, when the failures are quickly detected and repaired, the transient period is very short and, over a longer period of time, the probabilities can be considered to be constant. When the steady state is reached, the derivative functions of the state probabilities go to zero and Formula (31.5) becomes: −αi · Pri (∞) +



αk,j · Prk (∞) = 0

(31.12)

k=i

Then the set of differential Eq. (31.6) is replaced by the following simple algebraic Cramer system:

480

31 Markovian Modelling

− → M · Pr(∞) = 0

(31.13)

This system is simpler to solve than the differential equations and this is why it is often used to simplify the calculations of the asymptotic probabilities Pri (∞). As said above, matrix M is not invertible and the system of n equations is not free as every equation can be found by combining the (n − 1) other equations. The solution is to replace one of the equations by another equation independent of the other n equation. This is provided by the sum of the probabilities which is equal to 1: i=1 Pr i (∞) = 1. This leads to replace by 1: – the coefficient αn,j of the last line of matrix M;  – the coefficient n of vector 0. This leads to a new matrix N and a new vector ν and to a new vectorial equation: − → B · Pr(∞) = ν

(31.14)

As B is now invertible, the above equation allows to calculate the asymptotic probabilities by multiplying on the left by B−1 the two members of the equation: − → B−1 · B · Pr(∞) = B−1 · ν . This leads to: − → Pr(∞) = B−1 · ν

(31.15)

This new equation is related to a very classical numerical problem for which many algorithms are available. When the asymptotic values are found, then the asymptotic availability and unavailability can be calculated by the following formulae: A(∞) =



PriU (∞)

(31.16)

PriD (∞)

(31.17)

iU

U (∞) =

 iD

In the above formulae, iU represents the up states and iD the down states. With regards to the basic definitions, these formulae are analysed in more detail in 31.3. The asymptotic values can also be used to calculate the accumulated sojourn times and the long-range average availability and unavailability. Let us consider a time interval [T 1 , T 2 ] located far enough from the origin of time in order that −−→ the steady state is reached. In this case, Formula (31.9) becomes AST (T1 , T2 ) = − → Pr(∞) · (T2 − T1 ) and the average availability and unavailability are obtained with the following formulae:

31.2 Mathematical Foundations

A(T1 , T2 ) =

481

  1 ASTiU (T1 , T2 ) = PriU (∞) T2 − T1 i i

(31.18)

  1 ASTiD (T1 , T2 ) = PriD (∞) T2 − T1 i i

(31.19)

U

U (T1 , T2 ) =

D

U

D

In the above formulae, T2 > T1 → ∞ and iU represents the up states and iD the down states. Therefore, when the transient period is elapsed, the asymptotic availability and unavailability are equal to the average availability and unavailability. When the availability is high, the transient period is short and the above formulae are often accurate enough to calculate the availability and the average availability of a system.

31.3 Link with Basic Definition 31.3.1 Preamble The probabilities of the states and accumulated sojourn times in the states are the basis to make the link with the basic definitions and the calculations of the reliability parameters defined in Chap. 4. Coming back to Fig. 31.3, a question arises: which parameter can be calculated by using this Markov graph? More precisely, is it the availability or the reliability of the modelled system? The answer to this question is a key point when handling the Markovian approach and the readers are invited to think about it a few minutes before looking at the answer explained hereafter.

31.3.2 Availability As shown in Fig. 31.6, the Markov graph models a system which is repaired after a complete repair. Therefore, this Markov graph is an availability Markov graph and the probability for the system to be in the up state is the availability A(t) of this system and the probability to be in the down state is the unavailability U (t). This can be calculated from the probabilities PriA (t) to be in state E i of the availability Markov graph at time t: A(t) = Pr1A (t) + Pr2A (t) + Pr3A (t)

(31.20)

U (t) = Pr4A (t)

(31.21)

482

31 Markovian Modelling

Fig. 31.6 Availability Markov graph related to the system presented in Fig. 31.2

Figure 31.7 is an illustration of the evolution of the state probabilities of the availability Markov graph (Fig. 31.6) when A and B have the same failure and repair rates. The probability to be in state E 3 is slightly higher than this to be in state E 2 because of the maintenance policy. In this figure, the scale (ordinates) has not been respected because of the differences between the values which are too large. If iU represents the up states, iD the down states and PriA (t) the state probabilities calculated by an availability Markov graph, the formulae for availability and unavailability calculations can be easily generalized to: A(t) =



PriAU (t)

(31.22)

PriAD (t)

(31.23)

iU

U (t) =

 iD

Fig. 31.7 Example of state probabilities and resulting availability and unavailability (scale not respected)

31.3 Link with Basic Definition

483

As said in the previous chapter, after a transient period, the probabilities reach asymptotic values indicating that the system has reached a steady state. In this case, the probability to go into a given state is equal to the probability to go out of this state, the derivative functions of the state probabilities go to zero and Formula (31.12) becomes:  αk,j · PrkA (∞) = 0 (31.24) −αi · PriA (∞) + k=i

Then the set of differential equation is replaced by a simple Cramer system which is simpler to solve to find directly the asymptotic values Pri (∞). This property is often used to simplify the calculations. The duration of the transient period is of about three times the mean time to restore. When the failures are quickly detected and repaired, the transient period is very short and, over a long period of time, the probabilities can be considered to be constant. According to Formula (31.9), the same graph allows to calculate the accumulated sojourn time ASTi (0, t) over the interval [0, t] and this leads to the calculationof the accumulated up time (AUT ) and of the accumulated down time (ADT ) of the modelled system: AUT (0, t) = AST1 (0, t) + AST2 (0, t) + AST3 (0, t)

(31.25)

ADT (0, t) = AST4 (0, t)

(31.26)

It has to be noted that both AUT (0, t) and ADT (0, t) increase and go to infinity when t goes to infinity. The average availability A(0, t) is the ratio of the accumulated up time AUT (0, t) over a given period [0, t] divided by the duration t of this period and this leads to: A(0, t) =

AST1 (0, t) + AST2 (0, t) + AST3 (0, t) AUT (0, t) = t t

(31.27)

In the same way, the average unavailability over [0, t] is given by: U (0, t) =

UDT (0, t) AST4 (0, t) = t t

(31.28)

The duration t being equal to the sum of the accumulated sojourn times in all the states over this period, this allows to generalize the above formulae to:  A AUT (0, t) i ASTiU (0, t) = U A(0, t) = A t i ASTi (0, t) And:

(31.29)

484

31 Markovian Modelling

Fig. 31.8 Comparison between the unavailability U (t) and the average unavailability U (0, t)

 A UDT (0, t) i ASTiD (0, t) = D U (0, t) = A t i ASTi (0, t)

(31.30)

In the above formula, ASTiA (0, t) indicates an accumulated sojourn time calculated with an availability Markov graph. Figure 31.8 makes the comparison between the unavailability U (t) and the average unavailability U (0, t). They tend to the same asymptotic value but the process is rather slow due to the impact of the unavailability during the transient period which is lower than the asymptotic value. Therefore, before the Markov process has entered in the steady state (i.e. within the transient period), the unavailability and the average unavailability have very different numerical values. When the Markov process has entered in the steady state (i.e. beyond the transient period), the unavailability and the average unavailability have the same numerical value (Fig. 31.8). The MUT and MDT can also be obtained from UT (0, t) and DT (0, t) provided that the mean number of failures over the interval [0, t] is calculated: this is explained hereafter in the section about failure frequency. Exercises 31.6, 31.7 and 31.12 related to this subsection are described in Sect. 31.9 and their solutions can be found in Chap. 34.

31.3.3 Reliability The reliability of a system being related to the first failure of the considered system, it is necessary to remove from the availability Markov graph presented in Fig. 31.6 all the sequences of events coming back to the up state class after a complete system failure. This is easy just by removing the transition from E 4 to E 3 , as done in Fig. 31.9. This simple modification transforms the Markov graph into a reliability Markov graph and this has profound impacts on the behaviour of the underlying Markov process: • state E 4 has become an absorbing state (i.e. when it is reached it is no longer possible to jump out of it);

31.3 Link with Basic Definition

485

Fig. 31.9 Reliability Markov graph related to the system presented in Fig. 31.2

• the repairable components A and B are now repaired only if a complete system failure has not occurred. This introduces a systemic dependency between all the components of the system for which the state repaired/non-repaired depends on the states of all the other components. From a probabilistic calculations point of view, the reliability Markov graph allows now to calculate the reliability R(t) and the unreliability F(t) of the modelled system. This gives for the analysed example: R(t) = Pr1R (t) + Pr2R (t) + Pr3R (t)

(31.31)

F(t) = Pr4R (t)

(31.32)

It can be observed that the formulae are exactly the same as these used above for availability and unavailability calculations. Therefore, this is the nature of the Markov graph (availability or reliability Markov graph) which determines if availability/unavailability or reliability/unreliability are calculated and not the formulae which are used. The exponent R has been used to make the difference between the state probabilities, PriR (t), calculated with the reliability Markov graph and the state probabilities, PriA (t), calculated with the availability Markov graph. Figure 31.10 is an illustration of the evolution of the state probabilities of the reliability Markov graph (Fig. 31.9) when A and B have the same failure and repair rates. In this case, the graph is symmetrical and states E 2 and E 3 have the same probability. The behaviour is very different from the availability Markov graph as the asymptotic values are 0 for the up states E 1 , E 2 and E 3 and 1 for the down state E 4 . In this case, the steady state is reached when time goes to infinity and all the probability is accumulated in the absorbing state at this time. It has to be noted that the probabilities of states E 2 and E 3 evolve from 0 (initial conditions) and then increase to a maximum before decreasing to 0 (a big overall repair time has been used to show that in the figure).

486

31 Markovian Modelling

Fig. 31.10 Example of state probabilities and resulting reliability and unreliability

If iU represents the up states, iD the down states and PriR (t) the state probabilities calculated by a reliability Markov graph, the formulae for reliability and unreliability calculations can be easily generalized to: R(t) =



PriRU (t)

(31.33)

PriRD (t)

(31.34)

iU

F(t) =

 iD

The reliability increases from 1 to 0 when the unreliability decreases from 0 to 1 when time goes from 0 to infinity. This allows to respect the scale (ordinates) in Fig. 31.10. Again, when this steady state is reached, the derivative functions of the state probabilities go to zero but this is useless for the calculations as the asymptotic values (0 and 1) are already known. Again, the graph allows to calculate the accumulated up time AUT (0, t) and the accumulated down time ADT (0, t) as in the availability case: AUT (0, t) = AST1 (0, t) + AST2 (0, t) + AST3 (0, t)

(31.35)

ADT (0, t) = AST4 (0, t)

(31.36)

If the accumulated down time goes to infinity when time goes to infinity, the accumulated up time tends to an asymptotic value and this asymptotic value is the mean time to failure (MTTF) of the system modelled by the reliability Markov graph. This is illustrated in Fig. 31.15 and discussed in Sect. 31.3.8. Exercises 31.1 and 31.11 related to this subsection are described in Sect. 31.9 and their solutions can be found in Chap. 34.

31.3 Link with Basic Definition

487

31.3.4 Vesely Failure Rate and Failure Frequency The Vesely failure rate is also named conditional failure intensity and the failure frequency, unconditional failure intensity (see Chap. 4). They can be calculated by considering the critical states of the availability Markov graph illustrated in Fig. 31.11. In this figure, states E 2 and E 3 are critical because they are distant of the down state by only one transition. Therefore, when in state E 2 , the failure of A is a critical failure with regards to the system availability and, when in E 3 , the failure of B is a critical failure with regards to the system availability. By definition of the instantaneous unconditional failure intensity (instantaneous failure frequency), w(t) · dt is the probability to move from an up state to a down state between t and t + dt given it is in up state at t = 0. The condition is so light that the term “unconditional” is used to name the term. In particular, there is no condition about how many failures have occurred before t. Then this parameter can be calculated from an availability Markov graph. For the example in Fig. 31.11 this leads to: w(t) = λa · Pr2A (t) + λb · Pr3A (t)

(31.37)

In this formula, Pr Ai (t) is the probability of state i of the availability Markov graph. It can be easily generalized to: w(t) =



αiC ,jD · PriAC (t)

(31.38)

iC , jD

In the above formula, iC describes the critical up states and jD the down states. The instantaneous conditional failure intensity (Vesely failure rate) has the same definition but with a stronger condition: to be in up state, i.e. available at time t. This leads to: Fig. 31.11 Critical states of an availability Markov graph with regards to failures

488

31 Markovian Modelling

λV (t) = w(t)/A(t)

(31.39)

w(t) = λV (t) · A(t)

(31.40)

And, therefore:

When the steady state is reached, the Vesely failure rate as well as the availability reach asymptotic values and: w(∞) = λV (∞) · A(∞)

(31.41)

The average failure frequency w(0, T ) can be calculated from w(t) by a classic T integral w(0, T ) = T1 0 w(t) · dt. This parameter is called PFH (probability of failure per hour) in the functional safety standards (see Chap. 36). Using the relationship between the probabilities and the accumulated sojourn state, the average failure frequency can be calculated, in the case of the example, as:   w(0, t) = λa · AST2A (0, t) + λb · AST3A (0, t) /t

(31.42)

And this can be generalized to: w(0, t) =



αiC ,jD · ASTiAC (0, t)/



iC , jD

ASTiA (0, t)

(31.43)

i

The average failure frequency allows also to calculate the mean failure number over a given period: k(0, t) = t · w(0, t)

(31.44)

When the steady state is reached, the average failure frequency becomes equal to the asymptotic failure frequency. The number of failures for a time period [t, t + T ] located after the transient period is equal to: k(T ) = T · w(∞)

(31.45)

31.3.5 Failure Rate and Failure Density The failure rate and the failure density (see Chap. 4) can also be calculated by considering the critical states of the reliability Markov graph illustrated in Fig. 31.12. Again, states E 2 and E 3 are critical because they are distant of the down state by

31.3 Link with Basic Definition

489

Fig. 31.12 Critical states of a reliability Markov graph

only one transition. Therefore, when in state E 2 , the failure of A is a critical failure with regards to the system reliability and, when in E 3 , the failure of B is a critical failure with regards to the system reliability. By definition of the failure density, f (t) · dt is the probability for the system to fail between t and t + dt given it is in up state at t = 0 and that the failure is not repaired. This condition corresponds to this modelled by a reliability Markov graph and for the example in Fig. 31.12 this leads to: f (t) = λa · Pr2R (t) + λb · Pr3R (t)

(31.46)

In this formula, PriR (t) is the probability of state i of the reliability Markov graph. It can be easily generalized to: f (t) =



αiC ,jD · PriRC (t)

(31.47)

iC , jD

In the above formula, iC describes the critical up states and jD the down states. The failure rate has the same definition but with a stronger condition: to have been in up state all over [0, t], i.e. reliable over [0, t]. This leads to: λ(t) = f (t)/R(t)

(31.48)

f (t) = λ(t) · R(t)

(31.49)

And, therefore:

490

31 Markovian Modelling

31.3.6 Comparison λ(t) Versus λV (t) and f (t) Versus w(t) As shown in Fig. 31.13, even if they are calculated with the similar Formulae (31.37) and (31.46), the failure frequency f (t) and the failure density w(t) have very different behaviours: • The failure density increases from 0 to a maximum value and then decreases to zero. This is due to the fact that only the first failure is considered and therefore, when time increases, the probability to have a 1st failure before t increases and the probability to have a failure after t decreases. In other words, when the system has already failed, it cannot fail again. • The failure frequency increases until an asymptotic value is reached when the steady state is established. As shown in Fig. 31.13, the behaviours of the failure rate and of the Vesely failure rate are similar to this of the failure frequency: they increase until asymptotic values are reached when the steady state is established. The parameters used to perform the calculation have been chosen to see the difference between these three parameters which are numerically close. When the failures of E 2 and E 3 are very quickly repaired, the asymptotic availability is close to 1 and the three parameters converge toward almost the same value and this is why: • they are often mixed up; • the Vesely failure rate, rather easy to calculate from Boolean models, is used to assess the reliability from fault trees or reliability block diagram models instead of the failure rate itself which cannot be calculated. This provides a generally good conservative approximation. Fig. 31.13 Comparison failure rate/Vesely failure rate and failure density/failure frequency

31.3 Link with Basic Definition

491

Fig. 31.14 Critical states of an availability Markov graph with regards to repair

31.3.7 Repair Intensities The same developments made for the failure intensities can be made for the repair intensities. The difference is just to consider the states critical for repairs and the unavailability instead of the availability. For the example in Fig. 31.14, only E 4 is critical with regards to the repairs and this leads to: • ρ(t) = μb · Pr4A (t) for the repair frequency; • μV (t) = ρ(t)/U (t) for the conditional repair intensity;  • ρ(0, T ) = T1 0T ρ(t) · dt for the average repair frequency; • kr (0, t) = t · ρ(0, t) for the mean number of repairs. These formulae could be developed and generalized exactly in the same way as this has been done for the failure intensities but this would be useful only for the transient period because, when the steady state is reached, the failure and repair intensities are equal: ρ(∞) = w(∞), μV (∞) = λV (∞) and kr (t1 , t2 ) = k(t1 , t2 ) when the system is in the steady state at t 1 . Due to the absorbing state, the reliability Markov graph presented in Fig. 31.9 has no transition from down states to up states and, therefore, has no critical state with regards to repair. This implies that, in the case of a reliability Markov graph, all the above parameters are equal to zero.

31.3.8 MUT, MDT, MTBF and MTTF The mean up time (MUT) andthe mean down time (MDT) are classically associated with renewal processes in their steady states. The use of availability Markov processes allows to extend their definition to any period of time which may include or not the transient period where the steady state is reached yet.

492

31 Markovian Modelling

Mean up time MUT(0, t) and MUT(∞) According to the above remark, the mean up time over a given period can be defined as the ratio of the accumulated uptime over this period by the mean number of failures observed in the same period of time: MUT (0, t) = AUT (0, t)/k(0, t)

(31.50)

For the example in Fig. 31.6, this leads to: MUT (0, t) =

AST1A (0, t) + AST2A (0, t) + AST3A (0, t)   λa · AST2A (0, t) + λb · AST3A (0, t)

(31.51)

This can be generalized in:  iU

MUT (0, t) =  iC , jD

ASTiAU (0, t)

αiC ,jD · ASTiAC (0, t)

(31.52)

When the effect of the transient period has vanished, i.e. for a period of time [t1 , t1 + T ] with t1 far enough from the origin of time, MUT (t1 , t1 + T ) reaches an asymptotic value MUT (∞) and the rather complicated formula above can be simplified to: MUT (∞) =



 iU

PiAU (0, t)

w(∞) · T

=

A(∞) w(∞)

(31.53)

According to Formula (31.41), w(∞) = λV (∞) · A(∞) and then: MUT (∞) =

1 λV (∞)

(31.54)

Mean down time MDT(0, t) and MDT(∞) The analysis undertaken for MUT can also be done for the mean down time MDT (0, t): MDT (0, t) = ADT (0, t)/k(t)

(31.55)

This is mainly useful when the steady state has been reached and, for an interval [t1 , t1 + T ] beyond the transient period, this gives: MDT (∞) =

1 U (∞) = w(∞) μV (∞)

(31.56)

31.3 Link with Basic Definition

493

In the above formula, μV (∞) is the asymptotic value of the conditional repair intensity. Mean time between failures MTBF(0, t) and MTBF(∞) By definition, the mean time between failures (MTBF) for a given period is equal to the sum of the mean up time and of the mean down time over this given period: MTBF(0, t) = MUT (0, t) + MDT (0, t)

(31.57)

It is also equal to the ratio of the length of this period divided by the number of failures during this period of time: MTBF(0, t) =

t 1 t = = k(0, t) t · w(0, t) w(0, t)

(31.58)

Therefore, the MTBF for a given period is simply equal to the opposite of the average failure frequency over the same period. When the steady state is reached, this leads to: MTBF(∞) =

1 w(∞)

(31.59)

Replacing 1 by A(t) + U (t) = 1 in the above formula allows to find again the formula announced at the top of this chapter: MTBF(∞) =

A(∞) + U (∞) = MUT (∞) + MDT (∞) w(∞)

(31.60)

Mean time to failure MTTF(0, t) and MTTF(∞) For a reliability Markov process (see 31.3.3), the accumulated down time ADT (0, t) goes to infinity when time goes to infinity, but the accumulated up time AUT (0, t) tends to an asymptotic value. The accumulated up time AUT (0, t) is lower than t and this is the mean time to failure of the modelled system given that the failure occurs before t. MTTF(0, t) = AUT (0, t)

(31.61)

When t increases, MTTF(0, t) also increases and it reaches and asymptotic value MTTF(∞) which is the conventional mean time to failure (MTTF) of the system modelled by the reliability Markov graph. This is illustrated in Fig. 31.15. The formula of the down time can be generalized to: AUT (0, t) =

 iD

ASTiRD (0, t)

(31.62)

494

31 Markovian Modelling

Fig. 31.15 Example of convergence of the accumulated up time toward the MTTF

and: MTTF = AUT (0, ∞) =



ASTiRD (0, ∞)

(31.63)

iD

Therefore, to calculate the MTTF, it is necessary to calculate the accumulated sojourn times in the up states for a time long enough for allowing them to converge toward their asymptotic values. Otherwise, they can also be calculated by using the formulae proposed in Sect. 31.4.1.

31.4 Analytical Calculations of Markov Processes Resolving a set of first order linear differential equations is a classical problem and several classical methods (see Pagès and Gondran 1986) are available for this purpose: Laplace transform, matrix inversion or numerical methods. They are briefly explained below before explaining a less conventional approach based on the exponential of matrix development (the matrix exponentiation approach) which has proven to be very effective with regards to Markov calculations.

31.4.1 Classical Calculation Techniques 31.4.1.1

Laplace Transform

    The Laplace transform L f (t) of a function f (t) (with t ≥ 0) is defined by L f (t) = ∞ −st dt. The basic property is to change a derivative 0 f (t) · e  of the transformation   df (t) into a simple multiplication: L dt = s · L f (t) − f (0). The Laplace transform is used as follows: • Transforming the differential equations into linear equations.

31.4 Analytical Calculations of Markov Processes

495

Fig. 31.16 Simple example to implement the Laplace transform

• Transforming the linear equations into a sum of elementary polynomial fractions. • Use the inverse Laplace transform to revert to the original domain. The principle of the Laplace transform is implemented hereafter on the simple example presented in Fig. 31.16. In this example, Pr1 (t) is the probability of the up state and P1 (s) its Laplace transform. Under the assumption that Pr1 (t) = 1, the Laplace transform applied to the μ+s gives: s · P1 (s) − 1 = μs − (λ + μ) · P1 (s) ⇒ P1 (s) = s·(s+λ+μ) .

dPr1 (t) dt

Then the P1 (s) has to be represented as a sum of polynomials fractions: P1 (s) = b ≡ as + (s+λ+μ)

μ+s s·(s+λ+μ)

Now coefficients a and b have to be evaluated: μ – multiplication by s and then s = 0 gives a = λ+μ ; – multiplication by (s + λ + μ) and then s = −(λ + μ) gives b =

λ . λ+μ

Tables of inverse Laplace transform (Rade and Westergren 2004) can now be used to obtain the solution: 

b = b · e−(λ+μ)t . L−1 as = a and L−1 s+λ+μ Pr1 (t) is obtained as Pr1 (t) =

μ λ+μ

Pr2 (t) is obtained as 1 − Pr1 (t) =

λ e−(λ+μ)·t . λ+μ μ λ − λ+μ e−(λ+μ)·t 1 − λ+μ

+

=

λ [1 λ+μ

− e−(λ+μ)·t ].

The last formula is a well-known formula for the unavailability U (t) = 1−Pr1 (t) of a repaired item: U (t) =

λ [1 − e−(λ+μ)·t ] λ+μ

(31.64)

When time goes to infinity, this leads to the asymptotic value of the unavailability and of the availability: U (∞) =

μ λ and A(∞) = λ+μ λ+μ

(31.65)

The main drawback of this method is that it is not so easy to implement even in the case of simple Markov processes. The calculation of the coefficients of the polynomial fractions (a and b above) can be difficult to calculate when the poles of the fractions (i.e. the value of s for which the denominators are equal to 0) are multiple (i.e. several poles for a single denominator) and/or complex numbers. This can cause numerical problems to come back to the real number domain.

496

31.4.1.2

31 Markovian Modelling

Matrix Inversion

The use of the matrix inversion is a classical way to resolve systems of equations and this technique can be used for handling Markov processes in steady states. − →  This is an equation in a matrix form According to Formula (31.13), M· Pr(∞) = 0. which cannot be solved as the Markovian matrix M is not invertible because each equation in the set of linear equations can be calculated from the others. Therefore, the first step is to replace matrix M by an equivalent but invertible matrix  B where one of the equations is replaced by the sum of the state probabilities i Pri (t) = 1. This leads to a matrix B such that bij = mij for i < n and bnj = 1 and to a vector V − → such that vi = 0 for i < n and vn = 1. The above equation becomes: B · Pr(∞) = V . − → As matrix B is invertible, Pr(∞) can be obtained as: − → Pr(∞) = B−1 · V

(31.66)

The matrix inversion can also be used to calculate the MTTF related to a reliability Markov graph. −−→ − → − → In Sect. 31.4.2 it is demonstrated that AST (∞) = Pr(∞) − Pr(0). Again the ASTi (∞) cannot be calculated by using this equation as M is not invertible. If the absorbing state is state n, the last column of matrix M is a column of zeros. − → In addition, the elements of Pr(∞) are equal to 0 except for the last one. This allows to remove the last column and the last line of matrix M in order to obtain a system of −−→ −−→ equations related only to the accumulated up states: M  · AST (∞) = M  · AUT (∞) = − → − → −Pr (0). Matrix M is a (n − 1) × (n − 1) square matrix and vector Pr (0) a column vector with n − 1 elements. As M is now invertible, this leads to: −−→ → −1 − AUT (∞) = −M  · Pr (0)

(31.67)

This equation provides the accumulated times AUTi spent over [0, ∞] in any up states of the reliability Markov process. Then their sum leads to the calculation of the MTTF: MTTF =

n−1 

AUTi (∞)

(31.68)

i=1

The inversion of the above matrices M or M needs the calculation of the eigen values of these matrices (Wikipedia Eigenvalues 2020) and this is the main drawback of this method because, for systems with high reliability or availability, one of this eigen values is far smaller than the other ones and this has a strong impact on the accuracy of the calculations, which is then poor for large systems. This approach has been explained above for the steady state: it can also be used to perform the calculations during the transient period but this is more complicated.

31.4 Analytical Calculations of Markov Processes

31.4.1.3

497

Numerical Calculations

Many algorithms are available to handle a set of linear equations or a set of differential equations. For example, the Runge-Kutta methods (Wikipedia Runge-Kutta 2020) are widely used for the last purpose. However, the probability range of the system states of a Markov process is very wide and the low probabilities are often difficult to calculate with enough accuracy. It is regrettable as they are generally of great interest because related to the unavailability or the unreliability of the modelled system. This is the main drawback of this method with regards to Markov processes.

31.4.2 Matrix Exponentiation 31.4.2.1

State Probability Calculations

The following algorithm based on the exponentiation of a matrix has been developed in 1983 by one of the authors of this book (Signoret 1983) to perform the calculations during the transient period of a Markov process. It is based on the series expansion of the exponential of a matrix and on the memoryless property of the Markov processes.  (t · M)k − → − → − → ] · Pr 0 Pr(t) = et·M Pr 0 = [ lim k→∞ k!

(31.69)

k

Formula (31.69) can be calculated by recurrence, as illustrated in Fig. 31.17. Therefore it can be written as: ∞

− − → → Pr(t) = Pr (k) (t)

(31.70)

k=0

− → − → In this formula, (k) indicates the range in the series decomposition, Pr (0) (t) = Pr 0 − → − → and Pr (k) (t) = t·M Pr (k−1) (t). It is interesting from an algorithmic point of view as k

Fig. 31.17 Recurrence to be used for the calculation of the exponential of a matrix

498

31 Markovian Modelling

− → − → the term Pr (k) (t) can be calculated by recurrence from the term Pr (k−1) (t) previously calculated. In addition, it replaces the calculation of the product matrix × matrix (n4 calculations) implied by the exponential of the matrix by a simpler calculation matrix × vector (n2 calculations). Therefore, this is very economical from a computation time point of view. However, the above formula is tractable only if it converges quickly toward the results in order to obtain the result with a good approximation. Like for the develop (−λ·t)k which converges quickly ment in series of the scalar exponential e−λ·t = ∞ k=0 k! j when λ · t ≤ 1, the exponential etM also converges quickly if t · mi ≤ 1 ∀ i and j. Beyond the convergence, this also prevents numerical problems from arising when intermediate calculations become too large to be kept integrally in memory and lead to incorrect results. This arises with large numbers exceeding the number of digits stored in a computer memory. For example, for a 32 bit computer, only 7 significant digits are kept and the number 1234567.89 will be stored as 1.234567 E6 when this is 0.89 which is useful for the probabilistic calculations. With a 64 bit computer, 16 significant digits are kept: this reduces the problem without really solving it. The memoryless property of the Markov process can be used to avoid the problem by: – splitting time t in q intervals so that t = q · ; sup – choosing  = 1/mij such that  is the shortest sojourn time in a state of the Markov process; – writing Formula (31.7) as: − → − → Pr(t) = eq··M · Pr(0)

(31.71)

− → Then, Pr(t) can be calculated by steps of : − → − → Pr() = e·M · Pr(0) − → − → Pr(2 · ) = e·M · Pr() − → − → Pr(3 · ) = e·M · Pr(2) …  − → − → Pr(q · ) = e·M · Pr (q − 1) The principle of calculation is illustrated in Fig. 31.18: starting from the right and − → going to the left until Pr(q · ) is calculated. It has been empirically verified that, if x is the computing time needed to calculate directly for a time t with a given accuracy, the computing time for a time , and for the same accuracy, is equal to about x/q. Therefore, the faster convergence compensates for the q calculations. In addition to provide accurate results, the above calculations also allow to draw curves related to the time-dependent evolution of the state probabilities.

31.4 Analytical Calculations of Markov Processes

499

Fig. 31.18 Principle of calculation with steps of time

The state probabilities calculated in this way can be used to assess the probability parameters (availability, reliability, failure rates and intensities, etc.) as described in 31.3.2–31.3.7.

31.4.2.2

Accumulated Sojourn Time Calculations

According to Formulae (31.9) and (31.69), the accumulate sojourn times can be obtained from the integral of the probability vector: −−→ AST (t) =

T

t  (τ · M )k − → · Pr 0 k→∞ k!

− → Pr(τ ) · dtτ = lim

0

0

The primitive of (τ · M )k being

τ k+1 k+1 ∞

 −−→ AST (t) = k=0

 (k) (t) = As P

(t·M )k k!

(31.72)

k

· M k , the above equation gives:

(t · M )k − t → · · Pr 0 k +1 k!

(31.73)

 0 , the formula can finally be written as: ·P ∞

 −−→ −−→ AST (t) = AST (k) (t)

(31.74)

k=0

with −−→ AST (k) (t) =

t − → · Pr (k) (t) k +1

(31.75)

The accumulated sojourn times calculated in this way can be used to assess the probability parameters (MUT, MDT, MTBF and MTTF) as described in 31.3.8.

500

31 Markovian Modelling

It has to be noted that the calculations of the state probabilities and of the accumulated sojourn times can be performed in parallel and this is very effective from a computational point of view.

31.5 Advanced Modelling 31.5.1 Failure on Demand and Zero-Duration State In addition to failing while running, an item can also fail upon demand. This is the case of an item B operated in cold standby which has to start when the redundant running item A fails. This failure mode is characterized by a constant probability which cannot be included straightforwardly in a Markov graph which includes only transitions characterized by transition rates. Figure 31.19 illustrates this situation: • in state E 1 , the item A can fail (transition rate λa ); • when it fails, B is demanded to start and the system enters in a state (small circle in dotted line in the figure) from which: 1. either B fails to start (probability γ) and the system moves to state E 4 ; 2. or B starts properly (probability 1 − γ) and the system moves to state E 3 . The state represented by a small circle in dotted line in Fig. 31.19 has a very short duration and it is named transient, zero-duration time, transparent or even nonpermanent state by opposition to the genuine Markovian states which are sometimes named permanent states. It is not a Markovian state but, due to its very short time, it has no impact on the sojourn times in the other states and it can be eliminated to obtain the equivalent Markov graph presented in Fig. 31.20, which contains only transitions with constant transition rates. However, the zero-duration time states are generally kept in the Markov graph because they clarify the behaviour of the system which is modelled. This is illustrated in Fig. 31.21 for a system with 3 redundant components where A is normally running and B and C are used in standby. Fig. 31.19 On demand failure and zero-duration state

31.5 Advanced Modelling

501

Fig. 31.20 Markov graph equivalent to this in Fig. 31.19

Fig. 31.21 Modelling of standby items failing in cascade

The graph on the left-hand side of Fig. 31.21 is obviously far more explicit than the equivalent Markov graph on the right-hand side. In addition, this decreases the risk of mistake when seizing the transition rates. Exercise 31.5 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.5.2 Sequence Modelling 31.5.2.1

Availability Versus Reliability Markov Processes with Regards to Event Sequences

A Markov graph embeds, in a compact way, all the sequences of events which can occur over a given period of time. Each particular sequence can be represented by a chronogram (see Chap. 30) which represents a particular history of the system and mathematically speaking a trajectory of the underlying random processes. For example, the reliability Markov process proposed in Fig. 31.22 embeds the following event sequences to reach state E 3 from state E 1 : – – – –

E1 → E2 → E3 E1 → E2 → E1 → E2 → E3 E1 → E2 → E1 → E2 → E1 → E2 → E3 etc.

502

31 Markovian Modelling

Fig. 31.22 Reliability and availability Markov graphs of 1oo2 system

The sequences stop when the absorbing state is reached. The situation is different with an availability Markov process because there is no absorbing state and the availability Markov graph proposed in Fig. 31.22 embeds the following event sequences to reach state E 3 from state E 1 : – – – – – – –

E1 → E2 E1 → E2 E1 → E2 E1 → E2 E1 → E2 E1 → E2 etc.

→ E3 → E3 → E1 → E3 → E1 → E3

→ E2 → E2 → E2 → E2 → E2

→ E3 → E3 → E3 → E2 → E3 → E1 → E2 → E3 → E1 → E2 → E3

Even if the reliability Markov graph embeds less sequences than the availability Markov graph, their number is infinite in both cases. This makes difficult to feel which sequences are preponderant with regards to the contribution to the system unreliability over a given period [0, T] or to the system unavailability at a given time T. This is analysed hereafter in order to evaluate the contribution of the shortest sequences to the system reliability and unavailability. From the above sequences it is also possible to calculate how they participate to the system failure rate or failure intensity and this is also analysed hereafter.

31.5.2.2

Probability Calculation of a Sequence of Events

A general sequence of k − 1 transitions is presented in Fig. 31.23: this sequence starts from state E1s and ends in state Eks . The first step to calculate the probability of such a sequence is to establish the probability density of a jump from Ei to Ej given the system is in state Ei :  – αi = αi,j is the transition rate out of state Ei (see Sect. 31.2.1). j=i

– αi · e−αi ·t dt is the probability to move out of Ei between t and t + dt given being in state Ei at time t. – γi,j = αi,j /αi is the probability that the system moves specifically to Ej when it jumps out of Ei (this is a property of the exponential law). And finally, the probability density to jump from Ei to Ej , given being in state Ei , is given by:

31.5 Advanced Modelling

503

Fig. 31.23 General sequence of events

fi,j (t) = αi,j e−αi ·t

(31.76)

This is also the probability density of the random variable τi,j of the sojourn time in state Ei given that the system moves to Ej . With regards to the sequence in Fig. 31.23 and for simplifying the notations, let us note: s ; – τis = τi,i+1 the sojourn time in state Eis before jumping to Ei+1 s s s – fi (t) = fτi (t) the failure density of τi ; k−1  s τi the time elapsing to realize the sequence from E1s to Eks . – τs = i=1

As τ is a sum of random variables, its density fτss (τ s ) is equal to the convolution (noted “*”) of the densities of each of the random variables in the sum: s



s

s τ fτss τ s = Pr τ s ≤ T = f1s ∗ f2s ∗ · · · ∗ fk−1

(31.77)

The sequence E1s to Eks occurs within the interval [0, T ] only if τ s is smaller than T and if no jump out of Eks occurs between the interval [τ s , T ]. Therefore, the T probability P s (T ) of such a sequence is equal to P s (T ) = 0 fτss (t) · e−αk (T −t) dt. Again, a convolution product is found and: Pr s (T ) = fτss (T ) ∗ e−αk ·T

(31.78)

transform is useful to solve the above equation. Let fˆi,j (s) = theαLaplace

The use of i,j and this leads to: L αi,j e−αi ·t = αi +s P s (s) =

α1,2 · α2,3 . . . αk−1,k (α1 + s)(α2 + s) . . . (αk−1 + s)(αk + s)

(31.79)

The above formula can be decomposed in simple elements: P s (s) =

s Ck−1 Cks C2s C1s + + ··· + + (α1 + s) (α2 + s) (αk−1 + s) (αk + s)

(31.80)

504

31 Markovian Modelling

When all the αi are different, the coefficients Cis are easy to calculate by: – multiplying each side by (αi + s); – replacing s by −αi . This gives: k−1

Cis

j=1 αj,j+1 = k (α − αi ) j=1 j j = i

(31.81)

For more accurate numerical results, Cis should be calculated as: Cis =

αk

αi,i+1

1 −

k

αj

αi

αi,i+1

j=1 j = i

αj,j+1

1 −

αi αj,j+1

(31.82)

and finally: Pr s (T ) =

k 

Cis · e−αi ·T

(31.83)

i=1

– Eks is an absorbing state (reliability graph) then αk is equal to zero and e−αk ·T is equal to 1 in the above formula. This implies that P s (∞) = Cks . Therefore, the contribution of a given sequence to the system unreliability increases when time increases until a maximum value, Cks , is reached. – Eks is not an absorbing state (availability graph) then P s (∞) = 0. Therefore, the contribution of a given sequence to the system unavailability decreases when time increases. Availability Markov graph Let us come back to the example of Fig. 31.22 and consider the single direct sequence E1 → E2 → E3 of the availability Markov graph. The following coefficients are obtained: – α1,2 = 2λ and α1 = 2λ – α2,3 = λ and α2 = λ + μ – α3 = 2μ Applying Formula (31.81) allows to calculate coefficients C 1 , C 2 and C 3 of this sequence comprising three events: – C1 =

2λ2 (α2 −α1 )(α3 −α1 )

=

2λ2 (λ+μ−2λ)(2μ−2λ)

=

2λ2 2(μ−λ)2

=

λ2 (μ−λ)2

31.5 Advanced Modelling

– C2 = – C3 =

2λ2 (α1 −α2 )(α3 −α2 ) 2λ2 (α1 −α3 )(α2 −α3 )

= =

505 2λ2 2λ2 2λ2 = (λ−μ)(μ−λ) = − (μ−λ) 2 (2λ−λ−μ)(2μ−λ−μ) 2λ2 2λ2 λ2 = = 2(λ−μ)(λ−μ) (2λ−2μ)(λ+μ−2μ) (λ−μ)2

This leads to the formula giving the probability of occurrence of this sequence: Pr s (T ) = Pr(T ) = C1 · e−2λ·T + C2 · e−(λ+μ)·T + C3 · e−2μ·T

(31.84)

The comparison between the unavailability and the probability of the direct sequence E1 → E2 → E3 is done in Fig. 31.24. They reach very different asymptotic values on the long term (right-hand side of the figure) and they have similar values only on the very short term: as shown in the figure, the difference is already more than 9% for the tenth of the mean sojourn time spent in state E 1 . Therefore, the probability of the direct sequence from the up to the down state cannot be used to approximate the unavailability of the modelled system. Reliability Markov graph Let us come back to the example of Fig. 31.22 and consider the reliability Markov graph. The single direct sequence is also E1 → E2 → E3 but the coefficient α3 is now equal to zero: – α1,2 = 2λ and α1 = 2λ – α2,3 = λ and α2 = λ + μ – α3 = 0 Applying Formula (31.81) allows to calculate coefficients C 1 , C 2 and C 3 of this sequence comprising three events: – C1 = – C2 = – C3 =

2λ2 (α2 −α1 )(α3 −α1 ) 2λ2 (α1 −α2 )(α3 −α2 ) 2λ2 (α1 −α3 )(α2 −α3 )

= = =

2λ2 2λ2 λ = −2λ(μ−λ) = (λ−μ) (λ+μ−2λ)(−2λ) 2 2 2λ 2λ − (2λ−λ−μ)(λ+μ) = (μ−λ)(λ+μ) 2λ2 λ = (λ+μ) 2λ(λ+μ)

Fig. 31.24 Comparison between unavailability and the shortest event sequence

506

31 Markovian Modelling

Fig. 31.25 Comparison between unreliability and the shortest event sequence

This leads to the formula giving the probability of occurrence of this sequence: Pr s (T ) = Pr(T ) = C1 · e−2λ·T + C2 · e−(λ+μ)·T + C3

(31.85)

λ Therefore, this probability converges toward Pr s (T ∞) = (λ+μ) which is the asymptotic unavailability of a single repaired component. The comparison between the unreliability and the probability of the direct sequence E1 → E2 → E3 is done in Fig. 31.25. Like in the availability case, they reach very different asymptotic values on the long term (right-hand side of the figure) and they have similar values only on the very short term: as shown in the figure, the difference is already more than about 5% for the tenth of the mean sojourn time spent in state E 1 . Again, the probability of the direct sequence from the up to the down state cannot be used to approximate the unreliability of the modelled system. The conclusion of this section is that good approximations cannot be provided by using the direct sequences only: this implies that, even on the short term, looped sequences have also to be considered. The probabilities of looped sequences can be calculated as above by using the same principle based on convolution products and Laplace transform. Nevertheless, when loops are considered, some coefficients α3 appear several times in the equations. This introduces poles with order of multiplicity greater than 1, which makes difficult the calculation of coefficient C i needed to determine the inverse Laplace transform. Specific computer software packages have been developed on this basis for reliability calculations (Bouissou and Muffat 2003) but it is beyond the scope of this book to develop such complicated calculations which, however, are useful if the probability of a given sequence is needed for a specific purpose.

31.5.2.3

Equivalent Failure Rates for Quickly Repaired Systems

If the calculation of the probabilities of the shortest direct sequence is not sufficient to provide good approximations of the system unreliability or unavailability, these

31.5 Advanced Modelling

507

sequences are, on the contrary, very effective to estimate the equivalent failure and repair rate of the modelled system under the assumption of quick detection and repair of the occurring failures. In Sects. 31.3.4, 31.3.5 and 31.3.7 is explained how to calculate the system failure rate, Λ(t), Vesely failure rate, λV (t), and the repair intensity, μV (t), of a system by considering the critical states (with regards to system failure or with regards to system repair) of the Markov graph modelling this system. This allows to model the whole system as equivalent macro-components as represented in Fig. 31.26 for reliability and availability calculations. Unfortunately, the equivalent failure and repair rates obtained in this way are time-dependent and the Markov graphs in Fig. 31.26 are no longer homogeneous. Markov graphs and the formulae developed above cannot be used. Fortunately, when the failures of the components of the modelled system are quickly detected and repaired, Λ(t), λV (t) and μV (t) converge rather quickly toward asymptotic values providing equivalent failure and repair rates usable beyond the transient period: Λeq = Λ(∞) ≈ λV (∞) and Meq = μV (∞). In this case, the event sequences can be used to calculate approximations of the failure and repair rates of the modelled system as well as its asymptotic availability or unavailability. These calculations are based on the observation of the mean sojourn times, MSTi = 1/αi , into the various states of the modelled system. For example, using the availability Markov graph proposed in Fig. 31.6 with λa = 1.0 × 10−4 h−1 , λb = 2.0 × 10−4 h−1 , μa = μb = 0.1 h−1 leads to the following mean sojourn times: – – – –

MST1 MST2 MST3 MST4

= 1/(λa + λb ) = 3333 h. = 1/(λa + μb ) = 9.99 h. = 1/(λb + μa ) = 9.98 h. = 1/μb = 10 h.

Among the above sojourn times, one of them (MST1 ) is far greater than the others: the time spent in the initial state E 1 where only failures occur is 333 times greater than the states where repairs actually occur (E 2 , E 3 and E 4 ). Therefore, in this example, the times spent in states E 2 , E 3 and E 4 are negligible with regards to the time spent in state E 1 and can be assimilated to zero-duration states (see 31.5.1). This behaviour is typical of a system where all components failures are quickly detected and repaired when they fail: the system stays most of the time in state E 1

Fig. 31.26 Macro-component with time-dependent failure and repair rates

508

31 Markovian Modelling

and, when it jumps out of this state toward E 2 or E 3 , it leaves almost instantaneously E 2 or E 3 to come back to E 1 or to reach E 4 .  Due to a property of the exponential laws γi,k = αi,k / j=i αi,j = αi,k /αi is the probability to jump from i to k when the system jumps out of state i. With the above parameters, this leads to, for example: – γ2,4 = λa /(λa + μb ) = 9.99 × 10−4 and γ2,1 = μb /(λa + μb ) = 0.999 – γ3,4 = λb /(λb + μa ) = 2.00 × 10−3 and γ3,1 = μa /(λb + μa ) = 0.998. Therefore, when the system reaches state E 2 , it jumps out of it almost immediately and moves back to state E 1 with a probability 1000 times higher than moving to state E 4 . In a similar way, when it moves to state E 3 , it jumps out of it almost immediately and moves back to state E 1 with a probability 500 times higher than moving to state E4. This behaviour is illustrated in Fig. 31.27. It can be used to estimate the contributions to the failure rate of sequence E1 → E3 → E4 and of sequence E1 → E2 → E4 : b – contribution of sequence E1 → E3 → E4 : λa λb λ+μ ; a λa – contribution of sequence E1 → E2 → E4 : λb λa +μb .

Then, the approximation of the overall equivalent failure rate Λeq of the system is given by the sum of the two above transition rates: Λeq = λa

λb λa + λb λb + μa λa + μb

(31.86)

With the previous values, this leads to ΛS = 3.99 × 10−7 h−1 . This calculation does not make the difference between the availability and the reliability Markov graph because, with the assumption of a quick repair, the genuine failure rate and the Vesely failure rate have almost the same values. Then, the unreliability of the modelled system can be calculated as: Fig. 31.27 Sequences leading to the absorbing state

31.5 Advanced Modelling

509

F(t) ≈ 1 − e−Λeq ·t

(31.87)

In this case, the Markov graph is reduced to the simple Markov graph with only a failure transition presented at the top of Fig. 31.28. In this figure, the comparison between the exact and the approximated results of the modelled system unreliability is made. On the left-hand side where the mean overall repair time is short (MORT = 10 h, i.e. μa = μb = 0.1 h−1 ), it is practically impossible to distinguish the approximation from the exact value. On the right-hand side, the mean overall repair time has been increased (MORT = 1000 h, i.e. μa = μb = 0.001 h−1 ) to show a difference between the two calculations. Therefore, provided that the faults are actually repaired, the approximation is very robust and holds even if the repairs are not really very fast. The above result can be generalized for large Markov graphs by identifying all the direct event sequences like this illustrated in Fig. 31.29 from the perfect state to

Fig. 31.28 Accuracy of the approximation of reliability calculations according to the repair rate

Fig. 31.29 Generalization of a sequence s of events E1s → E2s → E3s · · · → Eks : reliability case

510

31 Markovian Modelling

the absorbing state. Thecontribution to the equivalent failure rate of such a sequence s s s · kj=1 γj,j+1 . Then the approximation of the equivalent is equal to: α s = α1,2  failure rate is obtained by adding the contributions of all the sequences: Λeq = s α s . 31.5.2.4

Asymptotic Unavailability and Equivalent Repair Rate for Quickly Repaired Systems

Let us consider the availability Markov graph in Fig. 31.6. In this case, the direct sequences are the same as for the reliability Markov graph but, as illustrated in Fig. 31.30, state E 4 is no longer an absorbing state. The transition rate out of E 4 being μb , the mean sojourn time in this state is equal to MST4 = 1/μb and, therefore, the contribution of the above sequences to the asymptotic value of the unavailability can be approximated by: b – contribution of sequence E1 → E3 → E4 : λa λb λ+μ · a

1 μb

a – contribution of sequence E1 → E2 → E4 : λb λa λ+μ · b

1 . μb

Then, the approximation of the asymptotic unavailability of the system is given by the sum of the two above contributions: U (∞) ≈ [λa

Λeq λb λa 1 + λb ]· = λb + μa λa + μb μb μb

(31.88)

Under the assumption of a quick detection of failures and repairs, the asymptotic Λ value of the unavailability can be approximated by U (∞) ≈ Meqeq where Meq is the equivalent repair rate of the modelled system. Therefore, for the above example, the equivalent repair rate is given by Meq = μb . With the above equivalent failure and repair rates, the Markov graph can be reduced to the Markov graph on the left-hand side of Fig. 31.31 with only two Fig. 31.30 Sequences leading to the down state

31.5 Advanced Modelling

511

Fig. 31.31 Accuracy of the approximation of availability calculations

transitions. In this figure, the comparison between the exact and the approximated results of the modelled system unavailability is made: the approximation is conservative and converges toward the same asymptotic value U (∞) of the unavailability. However, the fitting is not as good as in the reliability case for the transient period. The above result can be generalized for large Markov graphs by identifying all the direct event sequences like this illustrated in Fig. 31.32 from the perfect state to the down state class. In this case, the system can jump out of the down state class to the up state class. The contribution to the equivalent rate of such a sequence is the same as  sfailure s s · kj=1 γj,j+1 and its contribution to the asymptotic in the reliability case, α s = α1,2 unavailability is equal to U s (∞) = α s /αks s ,k s −1 . Considering that U (∞) = Λeq /(Λeq + Meq ) leads to the following parameters:  – equivalent failure rate: Λeq = s α s  – asymptotic unavailability: U (∞) = s U s (∞) – asymptotic availability: A(∞) = 1 − U (∞) – equivalent repair rate: Meq = Λeq ·

1−U (∞) U (∞)



Λeq . U (∞)

Fig. 31.32 Generalization of a sequence s of events E1s → E2s → E3s · · · → Eks : availability case

512

31 Markovian Modelling

When the equivalent repair rate is easier to calculate than the equivalent failure rate, it should be calculated first. And then the equivalent failure rate is obtained by U (∞) ≈ Meq · U (∞). Λeq = Meq 1−U (∞) This general approach is useful to define macro-components made of several components (see Sect. 31.6.1.3) in order to reduce the size of a large Markov graph.

31.5.3 Multistate Modelling and Production Availability As explained in Chap. 5, a classical Markov graph comprises only two classes of states (e.g. up and down) but this can be extended to more than two classes when intermediate states exist between the perfect state and the completely failed state. These intermediate states are not perfect nor completely faulty. Such systems are called multistate systems and the concept of efficiency has been introduced in Chap. 5 to make the difference between the various classes of states. A pumping system with a nominal production rate of 120 m3 /h is illustrated in Fig. 31.33. It is made of two pumps with different pumping capacities. The corresponding Markov graph is similar to the availability Markov graph already analysed in the previous chapter but the state efficiencies εi have been indicated. – state E 1 could provide 144 m3 /h but is limited to 120 m3 /h and then its efficiency ε1 is of 100%; – state E 2 could provide 60 m3 /h and its efficiency ε2 is of 50%; – state E 3 could provide 84 m3 /h and its efficiency ε3 is of 70%; – state E 4 provides nothing and its efficiency ε4 is of 0%. Then state E 3 has a better efficiency than state E 2 and this explains why pump B is repaired first when the system is completely failed: this allows to retrieve an efficiency of 70% rather than only 50% if A was repaired first.

Fig. 31.33 Modelling a multistate system with a Markov graph

31.5 Advanced Modelling

513

As shown in Chap. 5, if ρ is the maximum production rate (120 m3 /h), the expected instantaneous production rate Pdr(t) in m3 /h is given by: Pdr(t) = 120 · [Pr1 (t) + 50% · Pr2 (t) + 70% · Pr3 (t)]

(31.89)

The above formula can be written Pdr(t) = 120 · εS (t) where εS (t) is the timedependent system efficiency. As shown in Chap. 5, this is also the instantaneous productivity Pdy(t) of the production system, i.e. the ratio of the instantaneous production rate by the maximum production rate: εS (t) = Pdy(t) =

Pdr(t) = [Pr1 (t) + 50% · Pr2 (t) + 70% · Pr3 (t)] 120

(31.90)

This can be easily generalized to: Pdr(t) = ρ ·



εi Pi (t) = ρ · εS (t)

(31.91)

Prod (t)  εi Pri (t) = ρ i

(31.92)

i

and εS (t) = Pdy(t) =

When conventional availability calculations are performed, ρ is equal to 1 and the states are split into only two classes, εi = 1 for the up states and εi = 0 for the down states. Then the two above formulae give the same result and εS (t) is the same as the availability A(t) of the modelled system. When this is applied to a production system, εS (t) = Pdy(t) is called production availability (see ISO 20815 Ed. 2.0 2018). Figure 31.34 illustrates the comparison between the availability A(t) and the production availability (efficiency) Pdy(t) of the system presented in Fig. 31.25. Both decrease from 1 and tend toward asymptotic values. The production availability is lower than the availability and this is because the efficiencies of the degraded state are lower than 1 in this case. More information about multistate systems can be found in Chap. 5 about: Fig. 31.34 Comparison between availability and production availability

514

– – – –

31 Markovian Modelling

accumulated production Apd (T ); average productivity Pdy(T ) over a given interval, [0, T ]; equivalent production time Teq (T ); extension to state efficiency related to incomes and costs.

Exercise 31.13 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.5.4 Multiphase Modelling Exercise 31.4 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.5.4.1

Introduction and Principle

Until now, it has been considered that the system behaviour was modelled by the same Markov process all over the period of interest. However, in the case of actual industrial systems, this period is often split into several phases which cannot be modelled by the same Markov graph. This is the case when: – the level of redundancy changes (e.g. a system redundant in one phase becomes non-redundant in another phase); – the nature repaired/non-repaired changes (e.g. a system non-repaired in winter is repaired during the other seasons); – components are periodically tested (e.g. safety systems); – the production capacity changes (e.g. the demand in gas is higher in winter); – several of the above cases are combined; – etc. Illustrations of some of the above examples are given hereafter.

31.5.4.2

System Repaired or Non-repaired According to the Phase

The multiphase system illustrated in Fig. 31.35 is based on the example of the production system analysed above (Fig. 31.33). Now it is considered that this system is located in a rough environment such as it is not repaired in winter and repaired in the other seasons. Therefore, the system alternates between one phase of 9 months (duration τ ) where it belongs to the repaired systems and a phase of 3 months (duration θ ) where it belongs to the non-repaired systems. This is a typical multiphase system with recurring phases: i.e. the two different phases are repeated again and again and, for example over 5 years, the alternation repaired/non-repaired is observed 5 times.

31.5 Advanced Modelling

515

Fig. 31.35 Example of a production system with recurring phases

The Markov graph of Fig. 31.33 is kept as it is within the repaired phases, and all the repair transitions are removed for the non-repaired phase. As the states do not change, the same names have been kept from one phase to another. After having described the behaviour of the system in the various phases, the next step is to describe what happens when the system goes from one phase to the next one, i.e. how the k + 1 phase is linked to phase k. This is done by introducing the concept of linking matrix: k,k+1 is the probability that Linking matrix, C k,k+1 : matrix such as the coefficient ci,j k+1 k state Ei at the end of phase k gives state Ej at the beginning of phase k + 1. Therefore, the linking matrix C k,k+1 allows to calculate the initial condition of phase k + 1 from the state probabilities at the end of phase k of duration τk .

− → − →k+1 Pr (0) = C k,k+1 · Pr k (τk )

(31.93)

According to the memoryless property of the Markov processes, the knowledge  k+1 (0) is necessary and sufficient to calculate what happens in phase k + 1. of Pr In Fig. 31.35, the linking between the phases is very simple as state Eik gives the same state Eik+1 when moving from phase k to phase k + 1. Then, in this case, the k,k+1 k,k+1 = 1 and ci,j = 0, ∀i = linking matrix C k,k+1 is a square diagonal matrix (ci,i j). Formula (31.93) is a recurrent formula which can be started as soon as the initial − → − → conditions Pr 1 (0) = Pr 0 of the first phase are defined: the process is the following: − → − → 1. define the initial condition of phase 1: Pr 1 (0) = Pr 0 ; 2. phase 1 becomes the current phase k; 3. use the Markov graph describing the behaviour in phase k to calculate the − → parameters of interest and the state probabilities at the end of phase k: Pr k (τk ); − →k+1 4. calculate the initial condition of phase k + 1: Pr (0); 5. phase k + 1 becomes the current phase k; k = k + 1; 6. go to 3;

516

31 Markovian Modelling

7. stop when the whole period of interest has been covered. − → Since Pr 1 (0) is defined, all the calculations described above in the present chapter can be performed within each of the phases: availability and reliability, average availability, failure intensity, efficiency, etc. However, the phases having now finite durations, the asymptotic values as previously defined are no more relevant. Nevertheless, with recurring phases, the average value may converge toward asymptotic values (see Figs. 31.42, 31.43, 31.44, 31.45 and 31.46). Figure 31.36 illustrates the behaviour of the multiphase-multistate production system presented in Fig. 31.35. During the first phase, the behaviour is similar to Fig. 31.34: both availability A(t) and production availability Pdy(t) decrease from one and reach asymptotic values. When the system enters a non-repaired phase, both decrease from the previous asymptotic values. When entering the following repaired phase, both increase until the asymptotic values are reached again. And so on. The production availability is ever lower than the availability for the same reason as explained above: the efficiency of the degraded states is lower than 1. As shown in Fig. 31.36, the curves do not reach asymptotic values when t goes to infinity. Nevertheless, an asymptotic limit shape is reached since the 2nd phase as shown in Fig. 31.37, where the average production availability Pty(t) over [0, t] is represented in dotted line. This figure also shows that A(θ ) and A(τ ) reach asymptotic values as well as Pty(θ ) and Pty(τ ). Therefore, in two consecutive phases located far from the origin of time, A(θ + τ ) and Pty(θ + τ ) converge toward asymptotic values. The multiphase Markov graph in Fig. 31.35 can be used to calculate the unreliability F(t) of the modelled system. For doing that, state E 4 has to be made absorbing in any phases. The transitions from E 4 to E 3 have to be removed in phase 1 and 3 and the resulting Markov graph is shown in Fig. 31.38 on the left. As E 4 is already an absorbing state in phase 2, the Markov graph for this phase can be kept as it is.

Fig. 31.36 Availability and production availability of a production system with recurring phases

31.5 Advanced Modelling

517

Fig. 31.37 Production availability and average values

Fig. 31.38 Unreliability of the multiphase Markov process illustrated in Fig. 31.35

The resulting reliability F(t) is shown in Fig. 31.38. The slope is less steep in phases 1 and 3 because states E 2 and E 3 are repaired but, anyway, as any unreliability curve, F(t) is an increasing (or rather a non-decreasing) curve evolving from 0 to 1.

31.5.4.3

System Changing of Level of Redundancy

The system modelled in Fig. 31.39 is non-redundant in phases 1 and 3 and redundant in phase 2. The states are the same but the splitting between up and down states is different. On this figure, dotted lines are used to split the states between the two classes and the down states are highlighted in grey: E 2 and E 3 which are down states in phases 1 and 3 are up states in phase 3. There is no difference with the previous example with regards to the linking matrices. In addition, it has been considered that, in the non-redundant phase, state E 2 has an efficiency of only 95% in order to make the difference between availability and production availability in this phase.

518

31 Markovian Modelling

Fig. 31.39 System with change in redundancy according to the phase

Fig. 31.40 Availability and production availability of a production system with redundancy changes

The availability A(t) and production availability Pdy(t) of the production system are illustrated in Fig. 31.40. From a probabilistic point of view, there is a single Markov process over all the phases and then the state probabilities converge toward their asymptotic values. The jumps observed in the figure are only due to the change in the splitting between up and down states.

31.5.4.4

Periodically Tested Systems

The behaviour of a simple periodically tested system is presented in Fig. 31.41. In the first phase, the system has only two states: available (A) and faulty (F). When it fails, the fault remains hidden until a test is performed. When a test is performed, the fault is revealed and the repair starts at once. Therefore, a third state (repair) is considered in the second phase. From state R the system moves to state A when the repair is finished but, if a failure occurs from A, it remains hidden until being revealed by the next test. Then, as indicated by the linking matrix, when the test is performed, state A gives A, state F gives R and state R gives R. Therefore, the probability to be

31.5 Advanced Modelling

519

Fig. 31.41 Simple periodically tested system

in state R at the beginning of the next phase is the sum of the probability to be in state F and of the probability to be in state R at the end of the previous phase. The next phases and linking matrices are similar to the second phase. Therefore, a periodically tested system can be modelled by a multiphase Markov process and, in the simplest case, this is simply the same phase which is linked to itself in a recurrent way. The unavailability U (t) of such a periodically tested system is illustrated in Fig. 31.42. In this figure, the average unavailability U (t) has been drawn in dotted line. It does not converge toward an asymptotic value but toward an asymptotic limit shape which remains the same from a phase to the next one. The average unavailability U (0, T ) over the period of interest [0, T ] is indicated by a small circle. This average unavailability U (0, T ) is called PFDavg (average of the probability of failure on demand) in the functional safety standards (IEC 61508 2010). The system modelled in Fig. 31.43 is the same as above but it has been considered that the test can provoke a failure with a probability γ . Therefore, the only change is with the linking matrix which is presented on the left-hand side of the figure: now the probability to start a repair at the beginning of one phase is the sum of the probability to be in F, the probability to be in R and γ times the probability to be in A at the end Fig. 31.42 Unavailability of a simple periodically tested system

520

31 Markovian Modelling

Fig. 31.43 Unavailability of a simple periodically tested system with probability due to the test itself

Fig. 31.44 Unavailability of a simple periodically tested system with human failure

of the previous phase. This changes the shape of the unavailability curve U (t) as a jump equal to γ is observed each time a test is performed. The system modelled in Fig. 31.44 is the same as in Fig. 31.43 but a probability of human failure 1 − ζ has been considered. In this case, when a test is performed, a pre-existing fault or a fault created by the test itself is detected by the maintenance team with a probability ζ ≤ 1. Such a non-detected fault remains undetected until the performance of the following test at the end of the following phase. Like in the previous examples, the unavailability U (t) has no asymptotic value but, after a sufficient number of phases, it reaches a similar shape from phase to phase. The above example should not be mixed up with the test coverage which is illustrated in Fig. 31.45. In this case, the faults are split between those which can be detected (i.e. covered) by the test and these which cannot be detected (i.e. not covered).

31.5 Advanced Modelling

521

Fig. 31.45 Unavailability of a simple periodically tested system with imperfect coverage

This is illustrated by the Markov graph on the left-hand side of Fig. 31.45: when a failure occurs, it leads to a covered fault (F c ) with a probability η and to a non-covered fault (F nc ) with a probability 1 − η. This Markov graph is valid for the successive phases of the model. The linking matrix is similar to this in Fig. 31.41 but the new state (F nc ) has been added. This matrix is also valid to link the successive phases of the model. As the non-covered faults are not detected and then, not repaired, the trend of the unavailability curves is to increase continuously. No asymptotic limit shape is reached: when time goes to infinity, unavailability and average unavailability go to 1. Nevertheless, a fault which is not covered by one test can be covered by another. For example, this is the case of a valve where partial stroking tests are performed to verify that it is not stuck in position and full stroking tests are performed to verify that it is able to close and tight when closed. The partial stroking has no impact on production whereas the full stroking stops the production. Therefore, partial stroking tests can be performed more frequently than full stroking tests. This is illustrated in Fig. 31.46. The Markov model is similar to this in Fig. 31.45. The difference is that, for full stroking tests, the linking matrix has to be modified as shown in Fig. 31.46. This figure illustrates the case where the partial stroking tests are performed every three

Fig. 31.46 Unavailability of a simple periodically tested system with imperfect coverage

522

31 Markovian Modelling

months and the full stroking every year. The behaviour is similar to this in Fig. 31.45 during the first year where only partial stroking tests are performed. When this year has elapsed, then a full stroking test is performed and the faults not covered by partial stroking are revealed: repair can start at once and the unavailability decreases. Therefore, as all faults are repaired, the average unavailability now converges toward an asymptotic limit shape.

31.6 Reducing the Size of the Markov Models A system with n binary components has potentially 2n states. This implies 4 states for 2 components, 32 states for 5 components, 1024 states for 10 components, 1,148,576 states for 20 components, etc. For modelling an industrial system with 300 components, this leads to 2.04 × 1090 states. This is the so-called combinatory explosion of the number of states. Obviously, the Markovian approach is directly tractable only for small systems with a limited number of components. Fortunately, some techniques are available to decrease the number of states to be handled: – aggregating similar states; – using the Markovian approach in combination with Boolean techniques.

31.6.1 Aggregation of States 31.6.1.1

Using Symmetries

Contrary to the availability Markov graph (see Fig. 31.6) which is not symmetrical because component B is repaired first, the reliability Markov graph in Fig. 31.9 becomes symmetrical as soon as components A and B have the same failure and repair rates. This symmetry is illustrated in Fig. 31.47. In order to see how the states can be gathered, it is necessary to analyse the set of differential equations related to the underlying Markov process:

Fig. 31.47 Example of transformation of a symmetrical reliability Markov graph

31.6 Reducing the Size of the Markov Models

(1)

dPr1 (t) dt

= −2λ · Pr1 (t) + μ · Pr2 (t) + μ · Pr3 (t)

(2)

dPr2 (t) dt

= λ · Pr1 (t) − (λ + μ) · Pr2 (t)

(3)

dPr3 (t) dt

= λ · Pr1 (t) − (λ + μ) · Pr3 (t)

(4)

dPr4 (t) dt

= λ · Pr2 (t) + λ · Pr3 (t)

523

Gathering probabilities Pr2 (t) and Pr3 (t) in the above formulae leads to a new set of differential equations: (1)

dPr1 (t) dt

(2)

d [Pr2 (t)+Pr3 (t)] dt

(3)

dPra (t) dt

= −2λ · Pr1 (t) + μ · [Pr2 (t) + Pr3 (t)] = 2λ · Pr1 (t) − (λ + μ) · [Pr2 (t) + Pr3 (t)]

= λ · [Pr2 (t) + Pr3 (t)]

This set of differential equations describes another Markov process with only 3 states instead of 4. This is the result of the aggregation of states E 2 and E 3 into a single state E 2 where states A·B and A·B are merged. This new graph is drawn in Fig. 31.47 and in this graph A and B are now undifferentiated: 2C means 2 components in up state, 1C means 1 component in up state and 0C, zero component in up state. The same approach can be applied for a symmetrical availability Markov graph as illustrated in Fig. 31.48. This example is similar to the availability Markov graph except that it has been considered that the number of repair teams is not limited. Therefore, the systemic dependency between A and B has disappeared in this model. The transition rate from E 3 to E 2 is equal to twice the repair rate of one component because, when in state E 3 , the two components are repaired in parallel and this multiplies by 2 the probability that the repair of one of them finishes between t and t + dt. With the two above examples, the number of states is reduced only by one. This is not very much! The benefit increases when the number of similar components increases. When there is only a single repair team, one way to obtain a symmetrical model is to consider that the first component failed is also the first component repaired (this is the FIFO—first in, first out—concept).

Fig. 31.48 Example of transformation of a symmetrical availability Markov graph

524

31 Markovian Modelling

Fig. 31.49 Example of an availability Markov graph with a FIFO politic for repairs

Figure 31.49 illustrates a case where the FIFO politic is applied. The down state has been split according to whether A or B has failed first and therefore the original graph has now 5 states. Nevertheless, the aggregated equivalent Markov graph has still 3 states: the only difference is that the transition rate from E 3 to E 2 is now equal to μ instead of 2·μ. This principle can be extended to more than two similar components as illustrated in Fig. 31.50. All the examples in Fig. 31.50 are related to 3 similar components. Then the whole graph before aggregation has at least 23 = 8 states (more than that when FIFO repairs are modelled). Therefore, after aggregation, the number of states has been divided by 2. These four Markov graphs model the following cases: – 3 out of 3 logic with 3 repair teams: the three components are active at the same time and can fail independently from each other (e.g. in state E 2 , 3 components can fail, then the transition rate is 3·λ). They also can be repaired independently from each other (e.g., in state E 3 , 2 components can be repaired, then the transition rate from E 3 to E 2 is 2·μ). The whole system is down as soon as at least 1 component fails and E 2 , E 3 and E 4 are down states. – 2 out of 3 logic with a single repair team: the three components are active at the same time and can fail independently from each other (i.e. this is the same as in the previous case above). There is only one single repair team and therefore the transition rates from E 4 to E 3 and E 3 to E 2 are equal to μ i. The whole system is down as soon as at least 2 components fail and E 3 and E 4 are down states. – 2 out of 3 logic with two active components, one in standby position and 2 repair teams: under the assumption that a component in standby cannot fail, the transition rate from E 1 to E 2 is 2·λ. As there are two repair teams, only two components can be repaired at the same time and the transition rate from E 4 to E 3 is 2·μ. The down states are the same as above (E 3 and E 4 ). – 1 out of 3 logic with one active component, two in standby position and a single repair team: only one component can fail at a given time and only one component can be repaired at the same time. Only state E 4 is a down state.

31.6 Reducing the Size of the Markov Models

525

Fig. 31.50 Example of availability Markov graph with aggregated states

The principles described above can be easily extended to more than three similar components. Exercise 31.8 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.6.1.2

Fusion of Absorbing States

In the section just above, all the down states have been modelled because the Markov graphs were related to availability calculations.

526

31 Markovian Modelling

When dealing with reliability calculations, only the first system failure is of interest and what happens after this first failure does not matter. Therefore, all the down states can be gathered into a single absorbing state. This has been done in the Markov graphs presented in Fig. 31.51, which are the reliability Markov graphs related to the availability Markov graphs presented in Fig. 31.50. The larger the number of down states, the bigger the reduction of the number of states: – the 3 out of 3 logic reliability Markov graph is reduced to two states; – the 2 out of 3 logic reliability Markov graphs are reduced to thee states; – the 1 out of 3 logic reliability Markov graph still has 4 states. Again, the principles described above can be easily extended to more than three similar components.

Fig. 31.51 Example of reliability Markov graph with aggregated states

31.6 Reducing the Size of the Markov Models

527

Fig. 31.52 Macro component made of a series of individual components

31.6.1.3

Creation of Macro Components

Section 31.5.2.4 explains how to calculate the asymptotic equivalent failure and repair rates from a given Markov graph. This can be used to define macro components in order to decrease the size of the Markov graphs. Components in series (Fig. 31.52) When the availability is high, the asymptotic unavailability of an individual compothe asymptotic unavailability of the macro nent is given by ui = λi /(λi + μi ). Then component is given by U = 1 − (∞)   i (1 − ui ). In the case of ui  1, then (1 − u ) ≈ 1 − u and U ≈ (∞) i i i i i ui . Similarly, if Λeq is the equivalent failure rate and Meq the equivalent repair rate of the macro component, its asymptotic unavailability is given by U (∞) = Λeq /(Λeq + M eq ). The equivalent failure rate of such a system is simply the sum of the failure rates of the individual components: Λeq =

n 

λi

(31.94)

i=1

Therefore, the equivalent repair rate can be estimated as Meq Λeq [1 − U (∞)]/(U (∞)).

=

With the assumption of low unavailability, (∞)  1, the formula can be simplified Λeq Λ Λ =  equi =  eqλi . to Meq ≈ U (∞) i

i λi +μi

The above assumption implies λi  μi and finally the equivalent failure rate is obtained as: Λeq i=1 λi /μj

Meq ≈ n

(31.95)

This last formula is explained in more details in reference (Pagès and Gondran 1986). It is applied in the example illustrated in Fig. 31.53 for a macro component made of two components A and B organized in series. In the middle of Fig. 31.53 is presented the Markov graph related to the system made of the two components A and B. On the right-hand side of the figure is presented

528

31 Markovian Modelling

Fig. 31.53 Macro component (Mc) made of a series of 2 components (A and B)

Fig. 31.54 Comparison between exact result and approximations for the macro component presented in Fig. 31.53

the Markov graph of the equivalent macro component made of these two components in series. Figure 31.54 makes the comparison between the unavailability calculated with the original Markov graph in the middle of Fig. 31.53 (in bold line) and the unavailability calculated by using the Markov graph related to the macro component (in dotted line). The approximation is conservative (which is an important desirable property for an approximation) and the results are very close even in the transient period before the asymptotic values are reached. Therefore, the above proposed approximation works pretty well for components organized in series. Components in parallel (Fig. 31.55) As for the series system, when the availability is high, the asymptotic unavailability of an individual component is given by ui = λi /(λi + μi ). In case of parallel configuration (hot redundancy), the asymptotic unavailability of the macro component is given by U (∞) = i ui . As for the series system, if Λeq is the equivalent failure rate and Meq the equivalent repair rate of the macro component, its asymptotic unavailability is given by U (∞) = Λeq /(Λeq + Meq ). The equivalent repair rate of such a system is simply the sum of the repair rates of the individual components:

31.6 Reducing the Size of the Markov Models

529

Fig. 31.55 Macro component made of n redundant sub-components

Meq =

n 

μi

(31.96)

i=1

Therefore, the equivalent failure rate can be estimated as Λeq =

U (∞) M . 1−U (∞) eq

With the assumption of low unavailability, U (∞)  1 then Λeq ≈ U (∞) · Meq and finally: Λeq = Meq ·

n

i=1

λi λi + ·μi

(31.97)

This last formula is explained in more details in Pagès and Gondran (1986). It is applied in the example illustrated in Fig. 31.56 for a macro component made of two components A and B organized in parallel (i.e. they are redundant and operate in hot redundancy). In the middle of Fig. 31.56 is presented the Markov graph related to the system made of the two redundant components A and B. On the right-hand side of the figure is presented the Markov graph of the equivalent macro component made of these two components. Figure 31.57 makes the comparison between the unavailability calculated with the original Markov graph in the middle of Fig. 31.56 (in bold line) and the unavailability calculated by using the Markov graph related to the macro component (in dotted line).

Fig. 31.56 Macro component made of 2 redundant components

530

31 Markovian Modelling

Fig. 31.57 Comparison between exact result and approximation for the macro component presented in Fig. 31.56

The approximation is conservative (which is an important desirable property for an approximation), but opposite to the series system the results are not close in the transient period before the asymptotic values are reached (after 3 or 4 times the lower individual component overall repair rate). Therefore, the above proposed approximation works pretty well on the long range when the asymptotic values have been reached. Other configurations of components The document in reference (Pagès and Gondran 1986) provides general information about estimating the equivalent failure and repair rates of macro components to be used when developing Markov graphs related to large systems. The principle is based on the identification and use of the critical states with regards to failure for the equivalent failure rate and of the critical states with regards to repair for the equivalent repair rate. This allows to obtain the formulae described above for series and parallel configurations but also for any other configuration (e.g. components organized in r out of n logic), provided that some conditions of independency, maintainability (quick repair of faulty components) and low failure probability are met. Gathering two components in a single one divides the size of the graph by two and, therefore, replacing several components by a single macro component with equivalent failure and repair rates is very effective to decrease the size of the resulting Markov graph. However, the approximation is generally more accurate for the long range when the steady state is reached than for the transient period.

31.6.2 FT and RBD-Driven Markov Processes The technique of aggregating the states is effective when the components are reasonably similar but it allows to contain the combinatorial explosion only to some extent and not really to solve the problem for a large number of components. In this case, it may be necessary to switch to Monte Carlo simulation (Chap. 32) and Petri net modelling (Chap. 33). But before doing that, the use of the Markov approach in combination with the Boolean models has to be considered.

31.6 Reducing the Size of the Markov Models

531

Let us consider the Markov graph on the left-hand side of Fig. 31.58. This Markov graph is not symmetrical because components A and B have not the same parameters, therefore, the states cannot be aggregated as this has been done in the previous section. However, component A behaves independently of component B and vice versa. This implies that this Markov graph is the result of the combination of the two simple individual Markov graphs drawn on the right-hand side of Fig. 31.58. This is the combination of such graphs in a global Markov graph which leads to the combinatorial explosion of the number of states described at the beginning of this Sect. 31.6. This can be avoided if the combination is achieved in a way where the global Markov graph is not actually built and this is possible by using a reliability block diagram or a fault tree (see Chap. 27) when the components (i.e. the individual Markov graphs) are independent. Figure 31.59 gives an example of such combinations between the Markovian approach and the Boolean approaches:

Fig. 31.58 Example of Markov graph related to independent components

Fig. 31.59 Examples of RBD and FT-driven Markov processes

532

31 Markovian Modelling

– on the left-hand side, a reliability block diagram (RBD) models the logic of the system and the individual Markov processes provide the availabilities, Aa (t) and Ab (t), of blocks A and B of the RBD; – on the right-hand side, a fault tree (FT) models the logic of the system and the individual Markov processes provide the unavailabilities, Ua (t) and Ub (t), of the primary events A and B of the FT. This leads to the following definitions: RBD-driven Markov process: combination of a reliability block diagram modelling the success logic of the modelled system and individual independent Markov processes modelling the availabilities of the RBD blocks. FT-driven Markov process: combination of a fault tree modelling the failure logic of the modelled system and individual independent Markov processes modelling the unavailabilities of the FT primary events. These approaches are described in more detail in Chap. 27. They allow to reduce drastically the size of the Markov models and they are very effective to model binary systems made of a large number of reasonably independent components like, e.g., safety systems (see Chap. 36). When the components are not independent or when efficiency calculations have to be made, these techniques are no longer usable but, fortunately, in this case, the Monte Carlo simulation (Chap. 32) and Petri net (Chap. 33) techniques are still available to overcome the difficulty, as the size of the models are linear with regards to the number of components instead of exponential.

31.7 Specific Modelling 31.7.1 CCF Modelling 31.7.1.1

Beta-Factor Model

A common cause failure (CCF) leading to the failure of several components at the same time is easy to model within a Markov graph under the assumption that this failure occurs with a constant failure rate λcc . This is illustrated in Fig. 31.60 with the transition from E 1 to E 4 where both components A and B fail due to a common cause failure. In order to simplify the model, the assumption is that, with regards to the repair operations, it does not matter if the components fail due to independent or common causes. When this assumption is not realistic, the distinction has to be made between the faults due to independent causes and the faults due to common causes and more states have to be considered in the Markov graph.

31.7 Specific Modelling

533

Fig. 31.60 Simple common cause failure modelling (β-factor model)

The common cause failure can also occur in states E 2 to E 3 as represented in the figure where it participates to the transitions from E 2 to E 4 and from E 3 to E 4 . The simple CCF model proposed in Fig. 31.60 is similar to the well-known βfactor model (see Chap. 5) widely used for similar components: if the failure rate of one item A is λ, then λa = (1 − β) · λ is the independent failure rate of this item and λccf = β · λ is the common cause failure rate for several items similar to A. Exercise 31.2 related to this subsection is described in Sect. 31.9 and its solution can be found in Chap. 34.

31.7.1.2

Shock Model

Figure 31.61 illustrates another popular common cause failure model: the shock model (see Chap. 5). This approach splits the CCFs between lethal and non-lethal failures. In case of lethal failure, the impacted components fail immediately and this is similar to the β-factor model analysed above. In this case of non-lethal failure, the impacted components receive a shock (hence the name of the approach) which makes them fail with a constant conditional failure probability γ . Then, when the shock occurs, an impacted component remains in up state with the constant conditional failure probability (1 − γ ). Fig. 31.61 Common cause failure modelling: non-lethal shock model

Up

Down

E3

λa

μa

γ.(1− γ )γ 2

λsh E1 AB Shock

Shock

AB

μb AB E4

γ.(1−γ )

μb λb

AB E2

Shock

534

31 Markovian Modelling

Non-lethal shocks with a non-lethal arrival rate λsh are represented in Fig. 31.61 by using a zero-duration time state: when the non-lethal common cause failure occurs from state E 1 , both components A and B receive a shock and the result is threefold: – both A and B fail and the system jumps from E 1 to E 4 with the probability γ 2 ; – A fails and B does not fail and the system jumps from E 1 to E 3 with the probability γ · (1 − γ ); – A does not fail and B fails and the system jumps from E 1 to E 2 with the probability γ · (1 − γ ). When the non-lethal CCF occurs from state E 3 , this can make component B fail with a failure rate λsh · γ . And it is similar from state E 2 : component A fails with a failure rate λsh · γ .

31.7.1.3

Semi-catastrophic Model

Figure 31.62 illustrates the case of smoother common cause failures which do not lead to the immediate failure of the impacted components but only to the immediate increase of their failure rates (see Chap. 5). In Fig. 31.62, the failure rate is multiplied by α when B fails and vice versa. This can be used to model, for example, two similar photocopiers: when A fails, the workload of B is multiplied by 2 and the resulting failure rate is also likely to be multiplied by 2. This is a way to model dependent failures and the model presented in Fig. 31.62 is sometimes named semi-catastrophic model (see Chap. 5). Such a model can be used, for example, to model the increase of the failure rate of electronic components within a computer when the fan fails and the temperature increases. It has to be noted that using a Markov graph is the simplest way to model this kind of dependent failures. This is impossible with the Boolean models and implies to introduce special mechanisms when Monte Carlo simulation is implemented (see Chap. 32). Fig. 31.62 Common cause failure modelling: semi-catastrophic model

31.7 Specific Modelling

535

Fig. 31.63 Change in logic instead of repair of a failure

31.7.2 Maintenance Modelling Throughout the previous sections, the maintenance has been taken into consideration thanks to the repair rate and the linking matrices between the phases of multiphase models. Nevertheless, the Markovian approach allows to model more complicated maintenance operations, as illustrated in the following examples.

31.7.2.1

Change of Operating Logic

Let us consider a simple instrumented system made of a block comprising three similar flow transmitters and a logic solver like this boxed by a dotted line in Fig. 31.63. This system is operated in the following way: – 2 out of 3 logic (noted 2oo3) as nominal operating state; – periodically tested to detect potential faults; – switched to 1 out of 3 logic (noted 1oo3) when a pressure transmitter is detected to be faulty (this allows to delay the repair without impeding the safety but this increases the risk of spurious actions); – repaired when two pressure transmitters are detected to be faulty; – repair achieved by replacing the whole block by a new one. This is typically a small complex system for which Markov modelling is very effective: the change of logic from 2oo3 to 1oo3 (i.e. to 1oo2 for the two remaining pressure transmitters) and the systemic dependency due to the repair of several blocks at the same time cannot be modelled by a static model like RBD or FT. The Markov graph modelling the above system is presented on the left-hand side of Fig. 31.64. It is split into three parts: – states E 1 to E 4 where it is operated in 2oo3 before the failures are detected by test. In this case, E 3 and E 4 are down states; – states E 5 to E 7 where it is operated in 1oo3 before a double failure has been detected. In this case, only E 7 is a down state;

536

31 Markovian Modelling

Fig. 31.64 2 out of 3 logic changed to 1 out of 2 logic instead of repair

– state E 8 where at least two failures have been detected and the system is under repair. The linking matrix is on the right-hand side of the figure. When the failure of one of the sensors is detected by a test, the system is switched from 2oo3 to 1oo3 logic and this is done by the jumps from state E 2 to state E 5 . When the faults of 2 or 3 sensors are detected by a test (states E 3 , E 4 , E 6 , E 7 , E 8 ), then the repair is undertaken and this is achieved by replacing the whole block by a new one (transition from E 8 to E 1 ), which implies that the logic is switched from 1oo3 to 2oo3.

31.7.2.2

Maintenance Support Mobilisation

The example in Fig. 31.65 is related to a subsea production platform with 4 producing wells and a control unit exporting the production outside the platform. It is operated as follows:

Fig. 31.65 Subsea production platform and maintenance support

31.7 Specific Modelling

537

– the 4 wells are similar and the production is proportional to the number of wells in up state; – failures are remotely detected and a dynamic positioning vessel (rig) is needed to perform the repair of faulty components; – when a failure is detected, the rig has to be mobilised and this takes time (several days or weeks); – the platform being located in a rough environment (e.g. North Sea), no repair can occur in winter as the dynamic positioning vessel cannot stay above the platform due to rough sea conditions; – when a repair is performed, it is done by standard exchange of components and the duration is independent of the number of components to be repaired. Therefore, this system is a typical multistate multiphase system. When analysing the system in detail, 4 phases can be identified: – phase 1: when a failure occurs, it cannot be repaired; – phase 2: when a failure occurs, it cannot be repaired but the mobilisation procedure can be launched in order to be ready to repair as soon as possible when it will be possible for the dynamic positioning vessel to operate; – phase 3: when a failure occurs, it can be repaired and there is enough time to mobilise the rig; – phase 4: the repair of a failure occurred in the previous phase can continue but a new failure cannot be repaired because there is not enough time left to mobilise the rig. The Markov graphs related to these 4 phases are presented in Fig. 31.66. In these Markov graphs, λ is the failure rate of a well; λc the failure rate of the central unit, ω the mobilisation rate of the rig (i.e. 1/ω is the mean time of mobilisation) and μ the repair rate of the system as a whole. The lengths of the phases are shown in Fig. 31.67.

Fig. 31.66 Multiphase Markov process related to the production subsea platform

538

31 Markovian Modelling

Fig. 31.67 The four phases related to the multiphase Markov process in Fig. 31.66

– Phase 1 starts at the beginning of winter and lasts until the rig can be mobilised to be ready to undertake the maintenance at the beginning of spring. – Phase 2 takes place at the end of winter and lasts 1/ω (i.e. the mean time to mobilise the rig). – Phase 3 starts at the beginning of spring and lasts until it no longer remains enough time to mobilise the rig and achieve a repair before winter begins. – Phase 4 takes place at the end of autumn and lasts 1/ω + 1/μ (i.e. the sum of the mean time to mobilise the rig and of the mean time to repair). Operating in this way is a good compromise to have, at the beginning of phase 3, the rig on hand to perform the repair of a failure occurred in winter as soon as spring begins and this minimizes the probability to have the rig ready too early: states 3Wm, 2Wm, 1Wm and 0Wm in phase 2. This also minimizes the risk to mobilise a rig in phase 4 and to be unable to finish the repair before winter begins. The linking matrices are represented in Fig. 31.68. According to the previous explanations, states 4W to 2W give the same states, states 3Wm to 0Wm give state R when the repair becomes possible, R gives state R from phase 3 to phase 4. The same state R gives state 0W from phase 4 to the following phase 1 because the platform is left in safe condition (i.e. shut down) when a repair in progress is not finished when winter begins. From a production point of view, the efficiency is 100% in state 4W, 75% in states 3W and 3Wm, 50% in states 2W and 2Wm, 25% in states 1W and 1Wm and 0% in states 0W, 3Wm and R. Therefore, the above development about efficiency and production availability can be applied here, and the multistate multiphase Markov graph described in Figs. 31.66, 31.67 and 31.68 allows to model in a rather realistic

Fig. 31.68 The four phases related to the multiphase Markov process in Fig. 31.66

31.7 Specific Modelling

539

manner a subsea production platform and to calculate its production availability over several sequences of recurring phases (i.e. several years).

31.7.3 Cold, Hot and Mixed Redundancy 31.7.3.1

Modelling of Redundancy

Examples of cold and hot redundancies have already been given above when the failure on demand (31.5.1) and aggregation of states (31.6.1) have been described. Figure 31.69 illustrates the example system made of two redundant components A and B, A being in cold standby position and B being normally active and having the priority for repair. When B fails, A is started and this is successful with a probability (1-γ ) and unsuccessful with a probability γ : a zero-duration state has been used to model this behaviour. When A and B are both in up state, A comes systematically back to the cold standby position. Then, the Markov process is not symmetrical with regards to A and B and the states cannot be aggregated. When A and B are similar (same failure and repair rate and same failure upon demand) and when it does not matter to know which of them is operating or in cold standby position, the corresponding Markov graph becomes symmetrical. Then, states E 2 and E 3 can be aggregated as this has been done on the left-hand side of Fig. 31.70. This Markov graph could be compared to the aggregated graph in Fig. 31.49 where the two components are operated in hot redundancy. The difference is that the transition rate from E 1 to E 2 is now equal to λ·(1 − γ ) instead of 2·λ) and that the transition from E 1 to E 3 is now equal to λ·γ instead of 0. Beyond the cold and hot standby redundancies, a third option is possible: the mixed redundancy which can be implemented when more than 2 components are redundant. This is illustrated on the right-hand side of Fig. 31.70 in the case of three redundant components: one component is actually operating, one component is in

Fig. 31.69 Cold standby redundancy and probability of failure on demand

540

31 Markovian Modelling

Fig. 31.70 Aggregated Markov graph for 1oo2 and 1oo3 logics

hot standby (i.e. ready to start) and one component is in cold standby position. When the operating component fails, then it is replaced by the hot standby component and the cold standby component is started to reach the hot standby position. When the cold standby component is started, this is successful with a probability (1 − γ ) and unsuccessful with a probability γ and the modelling principle is similar as above.

31.7.3.2

Discussion About the Interest of Redundancy for Non-repaired Components

Hot, cold and mixed redundancy are expected to decrease the probability of failure of a system and are widely used for this purpose. Nevertheless, this is mainly effective when the failures are quickly detected and repaired. The usefulness of redundancy is analysed hereafter in the case of non-repaired components. Figure 31.71 describes the Markov graphs related to a single component, a system operated in 1oo2 logic with one component in hot redundancy and a system operated in 1oo2 logic with one component in cold redundancy. For the cold redundancy, a probability of failure to start on demand, γ , has been introduced for the component in standby position.

Fig. 31.71 Non-repaired 1oo1, 1oo2 (hot standby redundancy) and 1oo2 (cold standby redundancy) Markov graphs

31.7 Specific Modelling

541

Fig. 31.72 Comparison of the failure rates of 1oo1 and 1oo2 logics: non-repaired components

The question is to see if this is interesting from the probability of failure point of view to replace a single non-repaired item by 2 similar non-repaired items operated in 1oo2 logic with cold or hot redundancy. The comparison can be done by considering the failure rates related to the three Markov processes. This comparison is done in Fig. 31.72. For the redundant systems (1oo2 logic), the failure rates are lower than the failure rate λ of the single component but on the long range they converge toward this failure rate λ. This convergence occurs once the probability to have lost one component is high and this is faster for the hot standby than for the cold standby even in the case of a very high probability of failure on demand (γ = 0.5). Therefore, the benefit of this redundancy is not really obvious: after a duration equal to twice the MTTF (i.e. 1/λ) of the individual component, the failure rate of the hot standby system is about 0.93 λ and the failure rate of the cold standby system (with γ = 0) is about 0.67 λ. The benefit is better for the cold standby system but this does not really change the order of magnitude of the failure rate. Then the idea can be to implement a 2oo3 logic which is a popular solution often used to improve reliability to see if better results are obtained. This is done in Fig. 31.73 which describes the Markov graphs related to a single component, a

Fig. 31.73 Non-repaired 1oo1, 2oo3 (hot redundancy) and 2oo3 (cold redundancy) Markov graphs

542

31 Markovian Modelling

Fig. 31.74 Comparison of the failure rates of 1oo1 and 2oo3 logics: non-repaired components

system operated in 2oo3 logic with one component in hot redundancy and a system operated in 2oo3 logic with one component in cold redundancy. Again, the comparison can be done by considering the failure rates related to the three Markov processes. This comparison is done in Fig. 31.74. As shown in Fig. 31.74, on the long term both 2oo3 logics are worse than 1oo2 logics and even than the simple 1oo1 logic. Both of them converge toward a failure rate equal to twice the failure rate (2λ) of the single components: this is normal as, when one of the components has failed, the system becomes equivalent to a system made of two components in series with the same failure rate λ. On the short term, the 2oo3 with cold redundancy is better than the single component for a longer time than the 2oo3 with hot redundancy: about half the MTBF for the cold redundancy and only one third of the MTTF for the hot redundancy. Therefore, only for short periods of time, this is not a good idea to replace a nonrepaired single component by a more sophisticated redundant system made of nonrepaired components. This implies that, when dealing with non-repaired items, it is better to rely on the intrinsic reliability of the components rather than on redundancy. Exercises 31.3, 31.5, 31.9 and 31.10 related to this subsection are described in Sect. 31.9 and their solutions can be found in Chap. 34.

31.8 Limitation and Conclusions Although very flexible and powerful, the Markovian approach is unfortunately limited by the use of exponential laws only and above all by the combinatorial explosion of the number of states which occurs when the number of components of the modelled system increases: 4 states for 2 binary components, 1024 states for 10 components, more than 10 billions states for 30 components. Approximations are available but it is illusory to think to replace 10 billions states by only a few states. Therefore, this technique is mainly useful for modelling small complex systems and also for explaining some basic dependability concepts (reliability, availability, failure frequency, failure rate, failure intensity, etc.).

31.8 Limitation and Conclusions

543

However, when the components of a system are independent, it becomes possible to mix the Markovian approach (to model the component availabilities) with other techniques like fault trees, FTs, or reliability block diagrams, RBDs (to model the logic of the whole system). The FT-driven Markov processes or the RBD-driven Markov processes allow to overcome the combinatorial explosion of the number of states and to handle industrial size systems (from dozens to hundreds of states). But this works only under the strong assumption of independent components. When this assumption does not hold, the stochastic Petri nets described in Chap. 33 and the Monte Carlo simulation described in Chap. 32 have to be used.

31.9 Associated Exercises Thirteen exercises related to this chapter are proposed in Chap. 34. They are based on a pumping system with two redundant pumps and a valve in series: • Exercise 31.1 related to Sects. 31.1.2 and 31.3.3: identify the various system states, split the states between up and down state classes and build the corresponding reliability Markov graph of the pumping system when there is no limitation with regards to the number of repair teams and when there is only a single repair team. Write the equations for assessing the unreliability of the system. • Exercise 31.2 related to Sect. 31.7.1.1: same exercise as exercise 31.1 with a common cause failure on the pumps. • Exercise 31.3 related to Sect. 31.7.3: same exercise as exercise 31.1 when pump P1 is running and pump P2 is kept in standby position. • Exercise 31.4 related to Sect. 31.5.4: same exercise as exercise 31.3 when P1 and P2 are alternatively running and in standby position. The change occurs every month. • Exercise 31.5 related to Sects. 31.5.1 and 31.7.3: same exercise as exercise 31.3 with a probability of failure, γ , to start on demand of P2 when P1 fails. • Exercise 31.6 related to Sect. 31.3.2: same exercise as exercise 31.1 but for the availability Markov graph of the pumping system when there is no limitation with regards to the number of repair teams. Write the equations for assessing the availability and the unavailability of the system. • Exercise 31.7 related to Sect. 31.3.2: same exercise as exercise 31.6 but when there is a single repair team repairing in priority valve V, pump P1 and pump P2. • Exercise 31.8 related to Sect. 31.6.1: same exercise as exercise 31.6 but considering that P1 and P2 are similar, their states can be aggregated to simplify the graph. • Exercise 31.9 related to Sect. 31.7.3: same exercise as exercise 31.8 with repair priority as for exercise 31.6.

544

31 Markovian Modelling

• Exercise 31.10 related to Sect. 31.7.3: same exercise as exercise 31.9 when one of the pumps is operated in standby position and can fail to start on demand. • Exercise 31.11 related to Sect. 30.3.3: extend exercise 31.1 to calculate the unreliability over 20 years and the failure rate of the pumping system. Compare the asymptotic failure rate to the approximation provided in this chapter. • Exercise 31.12 related to Sect. 31.3.2: extend exercise 31.9 to calculate the unavailability and the failure frequency over 500 h and the average unavailability and failure frequency over 1 year of the pumping system. Compare the asymptotic unavailability to the approximation provided in this chapter. • Exercise 31.13 related to Sect. 31.5.3: extend exercise 31.7 to calculate the production availability and the average production availability of the pumping system when the production capacity (efficiency) of P1 is of 90% and this of P2 of 10%.

References Boiteau M, Dutuit Y, Rauzy A, Signoret J-P (2006) The AltaRica data-flow language in use: assessment of production availability of a MultiStates system. Reliab Eng Syst Saf (RESS) 91(7):747–755. https://doi.org/10.1016/j.ress.2004.12.004. Elsevier Bouissou M, Muffat S (2003) Automatisation de l’étude de sûreté de fonctionnement de systèmes dynamiques complexes à l’aide de méthodes markoviennes. http://www.rennes.supelec.fr/sic/ JOURNEES/00_10_12/Bouissou_trsp.pdf. Accessed Sept 2020 Brameret P-A, Rauzy A, Roussel J-M (2015) Automated generation of partial Markov chain from high level descriptions. Reliab Eng Syst Saf (RESS) 139:179–187. https://doi.org/10.1016/j.ress. 2015.02.009. Elsevier Çinlar E (1975) Introduction to stochastic processes. Prentice Hall, Englewood Cliffs, p 1975 GRIF-Workshop (2020) MARKOV module. Funded and developed by TOTAL. http://grif-worksh op.fr/. Accessed Sept 2020 IEC 61165 Ed. 2 (2006) Application of Markov techniques. International Electrotechnical Commission (IEC), Geneva IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/electronic/programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva ISO 20815 Ed. 2.0 (2018) Petroleum, petrochemical and natural gas industries. Production assurance and reliability management. International organization for standardization (ISO), Geneva, Switzerland Leroy A (2018) Production availability and reliability. Use in the oil and gas industry, 1st edn. Wiley-ISTE, London Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering. Springer Rade L, Westergren B (2004) Ed. 5.0 mathematics handbook for science and engineering. Springer Rauzy A, Signoret J-P et al (1998) The AltaRica language. In: Proceedings of European safety and reliability association conference ESREL’98. Balkema, Rotterdam Signoret J-P (1983) Processus de Markov. DEP/SES/ARF/JPS/Co/83.024 Société Nationale Elf Aquitaine (Production). Private internal document Signoret J-P (2005) Analyse des risques des systèmes dynamiques: approche markovienne, SE 4071. Techniques de l’ingénieur, Paris

References

545

Wikipedia Eigenvalues (2020) https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors. Accessed Sept 2020 Wikipedia Runge-Kutta (2020) https://en.wikipedia.org/wiki/Runge-Kutta_methods. Accessed Sept 2020

Chapter 32

Monte Carlo Simulation

32.1 Introduction to Monte Carlo Simulation Some modelling difficulties encountered when dealing with complex industrial systems have already been identified in Chap. 30. They can be linked to the dynamic aspects of system operation and maintenance, systemic dependencies between the parts, spare part provisioning, etc. (see Chap. 30, Figs. 30.3 to 30.8). The list of such difficulties is virtually endless. If, to some extent, some of them can be covered by the Markovian approach, the limits are reached as soon as non-exponential laws have to be considered or when the number of components exceeds a few units because of the explosion of the number of states. Therefore, in many cases, the analytical approach is completely overtaken and no longer relevant to handle industrial systems without unreasonable approximations. Fortunately, in such a case the situation is not hopeless provided that it is accepted to make the qualitative jump leading to abandon the comforting analytical calculations for the mysterious and tormented world of random generators and statistical estimations known as Monte Carlo simulation. Its position within the corpus of dependability methods and tools is illustrated in Fig. 32.1. This is an efficient way to overcome the analytical approach shortcomings provided that an accurate behavioural model is available to support the simulation. Any model can be used to this end: e.g. reliability block diagrams (Chap. 15), fault trees (Chap. 16) or even Markov graphs (Chap. 31). However, the Petri net models described in Chap. 33 have proven to be very effective for this purpose.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_32

547

548

32 Monte Carlo Simulation Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Boolean approaches RBD FT ET

Specific formulae

Static models

Dynamic models

Markovian approaches

Monte Carlo simulation

Behavioural approaches

Markov graphs Petri nets State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 32.1 Monte Carlo simulation within the corpus of methods and tools

32.2 History and Principle Named Monte Carlo byVon Neumann (1951) in reference to the games of chance, the Monte Carlo simulation is rather simple: it consists in replacing a single analytical calculation by a statistic on a big number of random histories of the modelled system. A nut tree which is shaken with a stick to harvest nuts is a good picture of the process: the nut tree represents the behavioural model, the nuts represent the events which can occur and the stick represents the random numbers used to animate the behavioural model. Therefore, as the more mature nuts fall down first, the most probable events occur also first in the simulation. This highlights an interesting property of the Monte Carlo simulation: as the most probable events arise first, the simulation is self-approximating. There is no need, as for the other approaches, to remove the negligible parts of the model to make it manageable, with Monte Carlo simulation the negligible events just do not or seldom appear during the simulation. The idea to replace analytical calculation by a statistic process is not really new as it has been invented since the eighteenth century by the French naturalist Buffon to evaluate the number π by throwing a lot of needles onto a wooden floor made of wooden strips (Fig. 32.2). It is possible to demonstrate that the probability of a hit (the needle crosses two 2.L (see Wikipedia Buffon (2020)). Therefore, if wooden strips) is equal to p = π.D n needles have been thrown and k hits have been observed, p can be estimated as p ≈ kn . This results in π ≈ 2 · kn · DL . This experiment works but converges slowly and only few significant digits of π can be obtained when it is undertaken manually. However, it can be implemented on computers and simulators are available on Internet (search for “Buffon needle”) which show how this works.

32.2 History and Principle

549

Fig. 32.2 Buffon’s needles thrown on a wooden floor

Wooden strips

Hit

Needle

Mathematically speaking, the principle is the following: • Building a statistical process (game) governed by rules where randomness takes place (e.g. the wooden floor and the needles); • Performing simulation (e.g. throwing one needle on the floor); • Attribution of values (scores) to the random variables (e.g. one for a hit and zero otherwise) which depends on the course of the game. Each realization of the above statistical process is called history (or trajectory) of the underlying random process. When a sufficient number of such histories are gathered, a statistical sample is obtained which can be used to estimate statistical results (e.g. the estimation of p ≈ k/n which, in turn, leads to an estimation of π ). As said above, a Markov graph can be used to support a Monte Carlo simulation and, in order to illustrate the principle in more detail, the Markov graph presented in Fig. 32.3 is used hereafter. This Markov graph is extracted from Chap. 31 and it is related to a system made of two components organized in parallel (e.g. two pumps with different pumping capacities). Up

λa E3 AB

50%

70% B Pumps

70%

μa

A Flow in

Down

Flow out

E1 AB

μb 0% AB E4

100%

μb λb

λb

50% E2

AB

Fig. 32.3 Example of Markov graph as support for Monte Carlo simulation

μa

λa

550

32 Monte Carlo Simulation

A A Random delays B B History

Firings AB

100%

Up AB AB

70% 50%

Down AB

0% T1 T2

T3

T4

T5 T6 T7

T8

T

Fig. 32.4 Example of Monte Carlo simulation of one history from the Markov graph in Fig. 32.3

The realization of one history of the system (i.e. a trajectory of the random process modelled by the Markov graph) over a period of time [0, T ] is illustrated in Fig. 32.4 and performed as follows: • Determine the initial conditions: state E 1 and T0 = 0 => The current state is E 1 , the current time T 0 and the timetable is equal to [T 0 , T ]. • From state E 1 two events can occur: the failure of A and the failure of B: – Select a random number and, according to the failure rate λa , calculate the delay δ1A before the failure of A which, then, occurs at T1A = T0 + δ1A ; – Select a random number and, according to the failure rate λb , calculate the delay δ1B before the failure of B which, then, occurs at T1B = T0 + δ1B . => Remove T0 from the timetable and input T1A and T1B . As T1A < T1B the timetable becomes [T1A , T1B , T ]. • Choose the first time in the timetable (T1A ) and trigger the corresponding event (failure of A). Then state E3 is reached. => T1A becomes the current time. • From state E 3 , as the date of the failure of B is already known, only one new event has to be considered: the repair of A: – Select a random number and, according to the repair rate μa , calculate the delay δ2A before the repair of A which, then, occurs at T2A = T1A + δ2A .

32.2 History and Principle

551

=> Remove T1A from the timetable and input T2A . As T2A < T1B the timetable becomes [T2A , T1B , T ]. • Choose the first time in the timetable (T2A ) and trigger the corresponding event (failure of A). Then state E1 is reached. => T2A becomes the current time. • From state E 1 , as the date of the failure of B is already known, only one new event has to be considered: the failure of A: – Select a random number and, according to the failure rate λa , calculate the delay δ3A before the failure of A which, then, occurs at T3A = T2A + δ3A . => Remove T2A from the timetable and input T3A . As T1B < T3A the timetable becomes [T1B , T3A , T ]. • Choose the first time in the timetable (T1B ) and trigger the corresponding event (failure of B). Then state E2 is reached. • Etc. The rest of the simulation is illustrated in Fig. 32.4. And the simulation goes on until the first time in the timetable is equal to T as for example [T , T5A , T5B ] in Fig. 32.4. Then, a history is obtained and it can be represented by the chronogram at the bottom of Fig. 32.4. A single history is not representative from a statistical point of view and several such histories (102 , 103 , 104 , etc.) have to be simulated until a representative sample of the event of interest is obtained. This is illustrated in Fig. 32.5 where, due to the lack of place, only three histories are presented: this is undoubtfully insufficient from a statistical point of view but sufficient to explain the principle of the various possible statistical estimations. In addition, in order to simplify the

100% Up 70% 50% Down 0% 100% Up 70% 50% Down 0% 100% Up 70% 50% Down 0%

300

300

400

300

200

History 1

100 200

100 400

100

2000 h

500

200

History 2

200 100

100

500 2000 h

500

200

200

200 100

400 300

100

Fig. 32.5 Example of histories obtained from Monte Carlo simulation

History 3

552

32 Monte Carlo Simulation

explanations, the figures have been rounded, while in the actual simulation they are obtained with many significant digits (e.g. 198.546825719 instead of 200). From the histories (i.e. random process trajectories) illustrated in Fig. 32.5 the classical parameters over the time interval [0, T ] can be estimated from simple statistical calculations: • Unreliability: before T, the system has had two first failures over the three histories ⇒ F(T ) ≈ 2/3 = 0.67. • Instantaneous unavailability: at time T, the system is unavailable once over the three histories ⇒ U (T ) ≈ 1/3 = 0.33. • Average unavailability: the system is unavailable during 500+300+100 = 900 h over the three histories of 2000 h ⇒ U¯ (T ) ≈ 900/(3 × 2000) = 0.15. • Mean number of failures: the system has had one failure 3 times over the two histories ⇒ Nf (T ) ≈ 3/3 = 1. • Mean failure frequency: the system has had one failure 3 times over the three histories of 2000 h ⇒ w(T ¯ ) ≈ 3/(3 × 2000) = 2.0 × 10−3 . • Mean down time: over the three histories the system has had three failures for an accumulated down time of 500 + 300 + 100 = 900 h ⇒ MDT ≈ 900/3 = 300 h. • Mean up time: over the three histories the system has had three failures for an accumulated up time of 6000 − 900 h = 5100 ⇒ MUT ≈ 5100/3 = 1700 h Beyond the classical parameters, other parameters can be estimated like: • Average production availability: the system has spent 3200 h at 100%, 900 h at 70%, 1000 h at 50% and 900 h at 0% over the three histories of 2000 h ⇒ PA(T ) ≈ (3200 × 100% + 900 × 70% + 1000 × 50%)/(3 × 2000) = 72%. • Average production unavailability: ⇒ PU (T ) = 1 − PA(T ) ≈ 28%. • Maintenance load: the system has spent 2800 h in maintenance operations and this leads to a maintenance load of ML(T ) = 2800/(3 × 2000) = 47%. In addition, it has to be noted that the same results are also obtained for components A and B individually. Therefore, on the contrary of the analytical calculations which are focused on single results, the Monte Carlo simulation provides plenty of various results at the same time. However, in order to be able to generate the above histories, it remains to find a way to generate the random delays in accordance with the probabilistic laws governing the related events. This is explained in the section hereafter.

32.3 Generation of Probabilistic Laws

553

32.3 Generation of Probabilistic Laws 32.3.1 General Principle for Generating Random Delays Even if the mathematics related to the Monte Carlo simulation are rather simple compared to analytical calculation, it is however necessary to introduce some probabilistic concepts to explain how the Monte Carlo simulation works. The process starts by considering a random variable X and its CDF (cumulated distribution function) FX (x) = Pr(X ≤ x). In our examples, this can be the unreliability function F(t) which is the CDF of the random variable TTF (time to failure): F(t) = Pr(TTF ≤ t). Then F(t) could be written F(t) ≡ FTTF (t). The Monte Carlo simulation process starts with the following change of random variable: Z = FX (x). This implies that: – Z ∈ [0, 1] – Pr(Z ≤ z) = Pr(FX (x) ≤ z) FX (x) is a probabilistic distribution, then this is an increasing (or rather a nondecreasing) function and it can be inverted and this gives: – – – –

Pr(Z ≤ z) = Pr[X ≤ FX−1 (z)] Pr(Z ≤ z) = Pr[X ≤ x] Pr(Z ≤ z) = z FZ (z) = z

The last result, FZ (z) = z, is the characteristic of a random variable uniformly distributed over the interval [0, 1]. It has to be noted that if FZ (z) is uniformly distributed over the interval [0, 1], the complementary value 1 − FZ (z), which is also uniformly distributed over the same interval, can be used instead. This is useful to simulate samples of times to failure of an item: the unreliability F(t) being the CDF of the time to failure (TTF) (see Chap. 4), the reliability R(t) = 1 − F(t) can be used instead. This is the core of the Monte Carlo simulation principle: a sample of n values xi = FX−1 (zi ) of any random variable X can be obtained from a sample of n values of a random variable Z uniformly distributed over [0, 1]. This is illustrated in Fig. 32.6. Therefore, the Monte Carlo simulation consists merely in generating uniformly distributed samples which are, in turn, transformed into samples of the wanted distribution. This is extremely simple especially when the inverse distribution is easy to find.

554

32 Monte Carlo Simulation

1

FZ(z)

1

1

FX(x)

Cumulated distribution functions (CDF) X

Z

Probability density functions (PDF) 0 z2

z1

z3 1

x2

Uniformly distributed random numbers

x1 x3

Sample distributed according to FX(x)

Fig. 32.6 Principle to obtain a sample distributed according to a given CDF from uniformly distributed random numbers

32.3.2 Random Number Generation As explained above, the Monte Carlo simulation of a random variable X is done through a uniformly distributed random variable Z. Of course, the quality of the resulting sample of X values is closely linked to the quality of the generated sample of Z values. The design of uniformly distributed random numbers has been and is still the subject of researches. Many techniques have been tried based on: • physical random phenomena like radioactive source, white noise of a Zener diode; • games like coin toss (heads or tails), dices, roulette; • digits of π (of which several billons are known nowadays) as illustrated in Fig. 32.7: 1 billon of decimals provides 83 millions of random numbers with 12 significant digits; • and, of course computed based random number simulation. 0.141592653589

0.793238462643

0.383279502884 etc.

Random numbers

3. 141592653589 793238462643 383279502884 197169399375 105820974944 592307 8164062862089986280348253421170679821480865132823066470938446095505822317 2535940812848111745028410270193852110555964462294895493038196442881097566 593344612847564823386783165271201909145648566923460348610454326648213 ..... Fig. 32.7 The decimals of π used as random generator

32.3 Generation of Probabilistic Laws

555

Even if some techniques like the decimals of π provide random numbers of very good quality, Von Neumann has tried to generate them directly from computers using formulae of the form zi+1 = g(zi ) and this has led to the design of the congruential pseudo random number generators. They are of the form: zi+1 = (a.zi + c) mod (m)

(32.1)

In this formula, mod(m) represents the congruence modulo m, i.e. the rest of the division of (a.zi + c) by m. This is a mathematical equivalent of the roulette: in this game, when the ball is launched, it makes several turns before it lands into one of the 37 slots. With such a device, if it is not faked, it is impossible to predict in which slot the ball is going to land and the probability to reach any of the slot is the same (i.e. 1/37 = 2.7%): this gives a good example of a discrete random variable with an equidistribution of each of the 37 possible outcomes. The greater the number of turns, the more uniform the distribution. In Formula 32.1, the length of the ball trajectory is given by (a.zi + c) and m represents the number of turns before the ball lands. This provides values comprised in the interval [0, m-1] which can be normalized by dividing by m to obtain uniformly distributed random values within the interval [0, 1]. By similarity with the roulette, the greater (a.zi + c), the more uniform the distribution. It has to be noted that, when using a computer, an equivalent formula can be used: zi+1 = FRAC(a.zi + c)

(32.2)

This formula gives directly the rest comprised into [0, 1] of the division of (a.zi + c) by the greatest codable number of the computer which, therefore, plays the role of m. The value of this number depends on the number of digits (e.g. 32 or 64 bits) implemented in the computer. The quality of the random numbers generated in this way is closely linked to the values of a, c and m and also to the choice of the seed z0 (i.e. the first value used to initialize the generator) or the format used in computer to code the numbers (number of octets) on which they are implemented. In addition, they have cycles: as soon as a number already found is found again, the same series of numbers is generated and the numbers are no longer uniformly distributed. For all these reasons, they are called pseudo random generator to remind that they are not perfect. Most of computer equipment (PCs, main frames, pocket calculators or even mobile phones) propose random number generators by using functions like RND, RAND, RANDOM,… The quality of such generators is generally sufficient to cover the needs of simple Monte Carlo simulations. However, statistical tests should be performed to verify the uniformity of the generated random numbers and the absence of cycles before using them for performing Monte Carlo simulations on complex industrial systems. These tests are beyond the scope of this book, for more information see Law et al. (1991) or L’Ecuyer (1990).

556

32 Monte Carlo Simulation

32.3.3 Simulation of Typical Probabilistic Laws 32.3.3.1

Simulation of Random Delays

The following paragraphs are devoted to the simulation, from a sample of values zi of a uniformly distributed random variable Z, of samples of values δi of a random variable  governed by various distributions F (δ). Exponential Law Exponential law being the most used law within the dependability field, it seems natural to begin with it to show how to obtain an exponentially distributed sample from a uniformly distributed sample. Let us consider an item with a failure rate λ: its unreliability is given by F(t) = 1 − e−λ.t which is also, as said above in 32.3.1, the distribution of its times to fail FTTF (t). Therefore, according to its definition above, the random variable  is equal to the time to fail and, in this case, F (δ) ≡ FTTF (t). If Z is a uniformly random variable, the random variable change Z = F (δ) leads to zi = 1 − e−λ.δi and then to 1 − zi = e−λ.δi . The inversion of the exponential gives ln(1 − zi ) = −λ.δi and then the delay before the item failure is obtained as δi = −ln(1 − zi )/λ. However, if zi is uniformly distributed over [0, 1], this is the same for 1 − zi . Therefore, after another random variable change from Z to 1-Z, the exponential distribution can be simulated by: δi = −

ln(zi ) λ

(32.3)

Uniform Law The uniform law is useful to simulate the occurrence of an event which can occur, without other information, within a given interval [T 1 , T 2 ]. In this case, the random variable  related to its time of occurrence is uniformly distributed between T 1 and T 2 and a sample of its distribution F (δ) can be obtained by using the following formula: δi = (T2 − T1 )zi + T1

(32.4)

When zi evolves from 0 to 1, δi evolves from T 1 and T 2. Weibull Law The formula of a Weibull distribution with a shape parameter equal to β and a scale β parameter η = (1/λ)1/β is given by F (δ) = 1 − e−λ.δ . Therefore, a sample of a random variable  governed by this distribution can be β obtained by using the following formula: zi = 1 − e−λ.δi .

32.3 Generation of Probabilistic Laws

557

β

Then 1 − zi = e−λ.δi and the random variable Z can be replaced by the random β variable 1-Z which is also uniformly distributed: zi = e−λ.δi . β β Then ln(zi ) = −λ.δi and δi = −ln(zi )/λ and finally the sample of F (δ) can be obtained by using the following formula: 

−ln(zi ) δi = λ

 β1 (32.5)

Erlang Law A generalized Erlang law is related to a sum of n exponentially distributed random n  variables:  = k where k is governed by Fk (δ) = 1 − e−λk .δ . k=1

Therefore, a sample of any of the random variables k can beobtained by using Formula 32.3 established for the exponential law: δk,i = −ln zk,i /λk . Then, the sample of F (δ) can be obtained by using the following formula: δi =

n 

δk,i

k=1

  n  ln zk,i =− λk

(32.6)

k=1

For an ordinary Erlang law, all the λk are equal to a common value λ and the n n  ln z ln(zk,i ) = − k=1 λ ( k,i ) . And finally, the sample above formula becomes: δi = − λk k=1

of F (δ) can be obtained by using the following formula:   ln z1,i .z2,i · . . . · zn,i δi = − λ

(32.7)

Therefore, the calculation of a single value δi of an Erlang law sample requires to generate n different uniformly distributed random numbers zk,i if the Erlang law is related to the n independent exponentially distributed random variables. Lognormal Law The lognormal distribution is more complicated to simulate than the others as it requires to simulate the normal distribution first and there is no analytical inverse function of the normal law which is only tabulated. It is beyond the scope of this book to give the detail of the demonstration (see Pagès and Gondran (1986)) but an element νi of a sample of a random variable governed by a standard normal distribution N (0, 1) can be obtained by generating two uniformly distributed random numbers z1,i and z2,i : νi =



    −2 · ln z1,i · Cos 2π.z2,i

(32.8)

558

32 Monte Carlo Simulation

A lognormal law can be characterized, for example, by its mean value m and its error factor q5% (value such as, if Med is the median value, [Med /q5% , Med · q5% ] is the centered confidence interval at 90%). If the lognormal random variable is the  logarithm of a normal random variable μ, σ 2 , then the corresponding mean value μ 2 and standard deviation σ can be calculated by σ = ln(q5% )/1.64 and μ = ln(m)− σ2 . Finally,  a sample of lognormal law which is the logarithm related to a normal law N μ, σ 2 can be obtained with the following formula: δi = eσ.νi +μ

32.3.3.2

(32.9)

Simulation of Constant Delays

When performing a Monte Carlo simulation, constant delays have often to be taken into account to model events which occur in a deterministic way. The following paragraphs are devoted to the simulation of such delays which do not need the use of uniformly distributed random numbers. Constant Delays A constant delay τ is very simple to simulate as its value can be used directly in the simulation. δ=τ

(32.10)

Delays for Periodically Tested Components Another example of deterministic delays is given by the delays elapsing between the occurrence of a hidden failure and its detection by a proof test. This is illustrated in Fig. 32.8. At the top of the figure is illustrated the case of periodical tests performed with a test interval equal to τ. In this case, when a hidden failure occurs at a time t, the τ

τ

0

0

θ

Instant of a failure t

Tests

Test staggering

τ

τ

Fig. 32.8 Delay to detect a hidden failure for periodically tested items

Detection of the failure

32.3 Generation of Probabilistic Laws

559

delay δ before it is detected is equal to: δ = τ − t mod (τ )

(32.11)

At the bottom of the figure, the current test interval is the same but the first test interval is different (θ instead of τ ) in order to stagger the tests of similar items. In this case, when a hidden failure occurs at a time t, the delay δ before it is detected is equal to: δ = τ − (t − θ ) mod (τ )

(32.12)

The effect of staggering the tests is illustrated in Fig. 32.8 for two failures occurring at the same time t. It has to be noted that these delays are not constant but, as soon as the hidden failure has occurred at time t, their values become deterministic.

32.3.3.3

Link Between Random and Deterministic Delays

According to what has been explained above, there is no difference in a Monte Carlo simulation between handling random and deterministic events within the same mathematical framework. This is a very important characteristic of this approach as this is very difficult and generally completely impossible with the analytical approaches. In fact, the link between random and deterministic events can be done by assimilating the distribution, Pr( ≤ δ), of the delay  to a Dirac distribution δ(τ ), which leads to:

0 if δ < τ, Pr( ≤ δ) ≡ δ(τ ) = (32.13) 1 if δ ≥ τ Therefore, within the Monte Carlo simulation approach, the deterministic and random delays can be easily unified by considering the Dirac distribution. As this is extremely difficult to do with the analytical approaches, this is a very important feature of Monte Carlo simulation.

32.4 Accuracy of Results 32.4.1 Accuracy Related to Monte Carlo Itself A grievance frequently raised about Monte Carlo simulation is that it is not accurate and does not provide relevant results and this is why this technique has been discarded, mainly in the academic context, for a long time, and that position is still regularly

560

32 Monte Carlo Simulation

encountered. However, when analysing the problem in detail, it is easy to demonstrate the inanity of this position which is more intuitive than rational: in fact, the accuracy of a Monte Carlo simulation can be easily monitored by calculating the confidence interval of the results, which shows clearly if they are accurate or not. This simply results from elementary statistical considerations. Let us consider that h histories have been simulated where a random variable X has been observed n times. Therefore, a statistical sample of n different values (x i ) is gathered and can be used to estimate the mathematical expectation (mean value) mn of X as well as its variance σn2 related to these n observations: 1 xi n i=1

(32.14)

1 (xi − mn )2 n i=1

(32.15)

n

mn = n

σn2 =

When the size of sample n increases, then mn and σn2 become closer and closer to the exact values m and σ 2 of the random variable X. These values are deterministic and the central limit theorem can be used. This theorem says that mn considered as a σ2 random variable tends to a normal law N m, n . Therefore, when n increases, mn becomes closer and closer to m and its variance σn decreases (and tends to 0 when the size of the sample goes to infinity). This allows to measure the accuracy of mn by the probability C(e) to be located within an interval [m − e, m + e] around the exact value m. Mathematically speaking, this is expressed as: 2

C(e) = Pr{mn ∈ [m − e, m + e]} = α%

(32.16)

The interval [m − e, m + e]is the confidence interval at α% of the estimation mn and the length of this interval is equal to 2e. As m − e ≤ mn ≤ m + e implies mn − e ≤ m ≤ mn + e, the exact value m has also α% chances to belong to the interval [mn − e, mn + e]. Parameter e can be calculated by using the error function, erf (.) which is a tabulated function currently used in the statistic domain related to the normal law: it allows to calculate e in function of n, σn2 and α%. When α% is equal to 90%, this leads to the important following result: √ e90%,n = 1.64 × σn / n

(32.17)

Therefore, √m has a 90% √  chance to be in the interval mn − 1.64.σn / n, mn + 1.64.σn / n and the length of this interval decreases when n increases but rather slowly as n has to be multiplied by 4 to divide the confidence interval by 2. This explains why frequent events are easier to handle than

32.4 Accuracy of Results

561

h=400 h=1600 h=6400

Pseudo rror factor

Confidence interval

h=100

h=25600

Exact result

h=100 h=400 h=1600 h=6400

h=25600

1 h

h

Fig. 32.9 Evolution of the accuracy according to the number of histories

rare events: for the same number h of simulations, the number n of observations is very much greater for frequent events (e.g. n = h when the event is observed in each history) than for rare events (e.g. n = h/104 when the event is observed only once in 104 histories). By analogy with the lognormal law, the confidence interval at α% can be measured by a pseudo error factor qα% defined as the square root of the ratio of the upper bound by the lower bound of the interval:

qα%,n

    1 +  mn + eα%,n    =  =   1 − m −e n

α%,n

eα%,n mn eα%,n mn

    

(32.18)

When n goes to infinity, eα%,n goes to zero and, therefore, the pseudo error factor goes to 1. Note: the true error factor q related to the lognormal law is such that the lower bound of the confidence interval is equal to the median value divided by q and the upper bound by the median value multiplied by q. Of course, the relationship with the median value holds only for log normal law but this is a practical way to obtain a normalized measure of the confidence interval. Figure 32.9 illustrates the evolution of the accuracy of the simulation when the number of histories increases. This example is related to the unavailability U (T ) of a repaired item characterized by a constant failure rate λ and a constant repair rate μ and at a given instant T. Calculations have been performed by using the Petri net module of GRIF-Workshop (2020). In Fig. 32.9, the number of histories, h, has been multiplied by 4 from a simulation to the next one. In this example, λ and μ have been chosen to have U (T ) ≈ 1% and the result is that the number of times the system is observed to be unavailable at T is of about n ≈ h/100.

562

32 Monte Carlo Simulation

32.4.2 Qualitative Appreciation of the Accuracy In Fig. 32.9, 100 histories lead to a very inaccurate result but the accuracy progressively increases when the number of histories increases: the 90% confidence interval shrinks gradually and the pseudo error factor tends slowly to 1. Then this can be used to monitor the simulation by observing how the results evolve when the number of simulations increases: • At the beginning, when h is low, n is low, mn varies very much and the confidence interval is very wide and even unrealistic (the lower bound is negative for 100 histories in the example). • Then, when h increases, n increases, the variations of mn and the confidence interval become smaller. • When approaching the good result, the variations of mn become very small from a simulation to another. Therefore, if h histories are scheduled, it is a good idea to look at the results for e.g. h/2, 2h/3 or 3h/4 histories in order to verify that the results are closer and closer and determine if more simulations are needed or not.

32.5 Uncertainty Propagation A general source of uncertainty on the results of calculations is the uncertainties related to the input data. For example, the estimation of failure or repair rates obtained by statistical estimations from the field feedback (see Chap. 38) are subject to exactly the same problem of accuracy as the Monte Carlo simulation results analysed above. Such a parameter estimated in this way has no deterministic value: it is a random variable which is more or less scattered around an average value and which is characterized by a probability density function (PDF) more or less peaked or flat according to the quantity of information gathered about it. The more acute the PDF, the more accurate the value and vice versa. Many distributions can be used to model data uncertainty but, in the dependability field, the uniform law over an interval, characterized by a lower bound and an upper bound, and the lognormal law, characterized by an average value and an error factor, are generally used for this purpose. The normal law, very often used in other contexts, is generally not suitable because it encompasses negative values incompatible with the calculation of probabilities. Taking into account the data uncertainty (see Chap. 38) is a quasi-impossible task with the analytical approaches but fortunately this is rather simple by using the Monte Carlo simulation and this is applicable to any model (see Chap. 25 for an application with Boolean models), as illustrated in Fig. 32.10.

32.5 Uncertainty Propagation

563

A Correlated parameters

B

System model

C

X Previous calculation

Fig. 32.10 Principle of uncertainty simulation

Let us consider an analytical model calculation of an output x, from input values a, b, c and d. When a, b, c and d are deterministic, this calculation is done straightforwardly by applying the analytical model. When a, b, c and d are the values of independent random variables with known probabilistic distributions, the process is the following: 1. Generate random numbers and use the respective distributions of the random variables A, B and C to calculate the input values ai , bi and ci . 2. Calculate the outcome x i as a function of ai , bi and ci by applying the model. 3. Redo steps 1 and 2 many times and until the full X distribution is obtained. When the random variables are not independent from one another, this is more difficult and beyond the scope of this book. Nevertheless, the case of correlated parameters illustrated in Fig. 32.10 is easy to handle: a single distribution is used to simulate both a and b and, within a given Monte Carlo history, the same random value is used for both of them. This is useful to model the failure rates of items coming from the same providers which are likely to be of the same quality (i.e. excellent, average, poor) at the same time. This is a kind of common cause failure (lineage CCFs) identified in Chap. 5. When the calculations of the model itself is based on Monte Carlo simulation, this does not change the principle except that two Monte Carlo simulations have to be combined: one to take the uncertainty into account and one to perform the calculation itself: this is a double Monte Carlo simulation. The simplest way to do that is to perform step 1 to obtain a set of input parameters, to perform a single simulation with these parameters at step 2 and then to go back to step 1. This is very effective but, as each history is achieved with a different set of input parameters, the uncertainty on the result depends both on data uncertainties and simulation uncertainties and it may be difficult to make the difference between them. Fortunately, two ways are available to measure the impact of data uncertainties: • Perform a sufficient number of simulations in order that the confidence interval converge toward a constant value which is the reflect of the data uncertainties. • Perform two different Monte Carlo simulations with the same number of histories, one with and the other without data uncertainties and then compare the confidence intervals: the difference is due to data uncertainties.

564

32 Monte Carlo Simulation

Therefore, the Monte Carlo simulation is very effective to take the data uncertainties into consideration both on analytical models and when it is already used for the main calculations.

32.6 Parameters Changing When Conditions Change 32.6.1 Introduction and Context In actual life, it often happens that the occurrence of an event impacts the occurrence of another one (Signoret 1997; Labeau 2001). This is, for example, the case of the probability of failure of a computer which increases due to the temperature rise when its fan fails. This is also the case of the probability of failure of a photocopier belonging to a pool of several machines which increases due to the additional workload when one of them fails. This kind of dependent failures has been described in Sect. 5.1.3.2 and a Markov model proposed in Chap. 31 under the name of semi-catastrophic common cause failure model. Such situations belong to the framework of dynamic transitions and within a Monte Carlo simulation this implies that the instant of occurrence of an event A is modified when another event B occurs. Mathematically speaking, this means that, before the occurrence of B, the time to occurrence of A is governed by a failure distribution, f (t), which is replaced by another distribution, g(t), after B has occurred. The distribution of A can return to f (t) when the effect of B disappears and this process can be re-iterated several times before event A actually occurs. Several approaches have been proposed to link the distributions, as for example the probabilistic continuity and the temporal continuity approaches, which have been analysed in Dutuit et al. (2016) when f (t) and g(t) are modelled by Weibull laws with different parameters. Furthermore, it has been demonstrated in Dutuit et al. (2018) and in Dutuit et al. (2020) that this can be extended to a more general and unified model which is proposed and analysed hereafter.

32.6.2 Updating Occurrence Dates (Principle) According to 32.3.1, the unreliability, F(t), of an item is also the distribution of its time to failure (TTF). Therefore, a sample of the times to failure of this item can be generated by using the following principle: • Fire a random number z from a pseudo random number generator. • Calculate the time to failure as: t = F −1 (z). This is illustrated graphically in Fig. 32.11 where two random numbers z0 and z1 have been fired at time t = 0 for simulating the times to failure of two similar items

32.6 Parameters Changing When Conditions Change 0

1

z1 Wb0

z0

Time

t0

0

1

z1

0

565

t1

z0 0

1 Wb1

Change Wb0

Update Time

t0 T1 t'1 t1

Fig. 32.11 Simulation of time to failures in case of change of the failure distribution

with the same TTF distribution. From these two random numbers, item 1 is expected to fail at t0 and item 2 at t1 . On the left-hand side of the figure, there is no distribution change. Only one interval (0) is defined where item 1 fails at t0 and item 2 at t1 during the Monte Carlo simulation. On the right-hand side of the figure, a distribution change occurs at time T1 higher than t0 but lower than t1 . Two intervals (0) and (1) are defined and within the Monte Carlo simulation: • Item 1 having already failed during interval (0) at t0 , the change at T1 has no impact on its behaviour; • Item 2 having not failed during interval (0), it is going to fail in interval (1) and its time to failure has to be updated from t1 to t1 due to the distribution change. This can model, for example, the behaviour of an item subject to constraints tighter than normal from time T1 . Weibull distributions have been chosen for analysing hereafter the distribution changes more in depth because they allow to: • • • •

Model a great variety of situations; Easily update the time to failure in an analytical way; Encompass the exponential distributions as a particular case; Use a simple spreadsheet (e.g. EXCEL) to achieve the calculations. Let us use the following notations:

• W bi ≡ W bi (λi , βi , t) for the Weibull distribution with parameters λi , βi and t; β

• Fi (t) = 1 − e−λi .t i for the corresponding CDF (see 32.3.3.1); • Λi (t) = λi · βi · t βi −1 for the corresponding failure rate of this distribution. When the conditions change at an instant Ti , the failure distribution changes from W bi−1 to W bi and, as the resulting failure distribution, F(t), is continuous, this implies to adjust the origin of W bi in order that W bi−1 and W bi have the same value at the instant Ti . This leads to change the system of reference for drafting W bi as illustrated in black for interval (i) in Fig. 32.12. In this example, the origin of W bi is equal to Ti − i and the current time is equal to ρi = t − Ti + i in this system

566 Fig. 32.12 Identification of the relevant system of reference for W bi in interval (i)

32 Monte Carlo Simulation i

i-1

1

Wbi Ref0

Refi Current time

Wbi-1 0

0

of reference. This implies that, in this system of reference, the interval [Ti , Ti+1 ] becomes [i , ωi ] where ωi = Ti+1 − Ti + i . It has to be noted that the value of i depends on the assumptions adopted to link the distributions and this will be analysed in more detail hereafter for several assumptions. Figure 32.11 illustrates an example with one change but a more interesting example where the conditions change at T1 and come back to the normal situation at T2 is analysed in the remaining part of the chapter (see Fig. 32.13). Then, the items failures are governed by: • W b0 ≡ W b0 (λ1 , β1 , t), i.e. F0 (t) = 1 − exp(−λ1 · t β1 ) within the interval (0), [0, T1 ]; • W b1 ≡ W b1 (λ2 , β2 , ρ1 ), i.e. F1 (t) = 1 − exp(−λ2 · ρ1β2 ) within the interval (1), [T1 , T2 ]; • ωi = Ti+1 − Ti + i , i.e. F2 (t) = 1 − exp(−λ1 · ρ2β1 ) within the interval (2), [T2 , ∞]. Fig. 32.13 TTF distribution: probabilistic continuity principle

32.6 Parameters Changing When Conditions Change

567

32.6.3 Various Approaches to Manage the Distribution Changes 32.6.3.1

Probabilistic Continuity

In the probabilistic continuity approach the assumption is that, when the change occurs, the values of the CDFs of the distributions W b0 and W b1 remain equal. For a change at W b1 this leads to the following equation where 1 is the unknown value: β

β

1 − exp(−λ1 · T1 1 ) = 1 − exp(−λ2 · 1 2 )

(32.19)

 β Then 1 = β2 λλ21 · T1 1 and defines the origin of W b1 at T1 −1 . This implies that, within interval (1), the distribution F(t) = W b1 (λ2 , β2 , ρ1 ) with ρ1 = t − (T1 − 1 ). At the time of the second change, the current time in interval (1) is equal to [T2 − (T1 − 1 ). Then, similarly as above, the link between the distributions W b1 and W b2 at T2 can be done according to the following equation: β

1 − exp[−λ2 · [T2 − (T1 − 1 )]β2 ] = 1 − exp(−λ1 · 2 1 )

(32.20)

This leads to the calculation of 2 and over [T2 , ∞[ the TTF distribution can be calculated as F(t) = W b2 (λ1 , β1 , ρ2 ) with ρ2 = t − (T2 − 2 ). This principle is illustrated in Fig. 32.13 where the calculations have been performed with the following parameters: λ1 = 10−4 h−1 and β1 = 3 and λ2 = 3.10−4 h−1 and β2 = 3. Parameter λ2 has been chosen equal to three times parameter λ1 to model tighter conditions within the interval [T1 , T2 ]. Note: the system of reference Ref0 and the system of reference Ref1 have been identified on this figure but the system of reference Ref2 has been omitted to keep the figure as simple as possible.

The failure rate corresponding to the above failure distribution is illustrated in β −1 Fig. 32.14. It is not continuous: it jumps from Λ(T1− ) = λ1 β1 · T1 1 to Λ(T1+ ) = β −1 λ2 β2 · 1 2 when the conditions change at instant T1 and another jump occurs when Fig. 32.14 Failure rate: probabilistic continuity principle

568

32 Monte Carlo Simulation

the normal conditions come back at T2 . It has to be noted that it does not come back to the value without distribution changes. Even if this is a classical approach, with regards to the failure rate behaviour, it seems difficult to find a physical example corresponding to the probabilistic continuity model.

32.6.3.2

Temporal Continuity

Another classical approach to link the distributions is the temporal continuity approach which is illustrated in Fig. 32.15. In this case, the assumption is that the origins (t = 0) of the Weibull laws W b0 , W b1 and W b2 do not change when conditions change at T1 or T2 . Therefore, if the tighter conditions have an impact between T1 and T2 , the failure rate comes back to its normal value at T2 exactly as if no event had happened at T1 . This leads to the following failure rate jumps: • From Λ0 (T1 ) = λ1 · β1 · T1β1 −1 to Λ1 (T1 ) = λ2 · β2 · T1β2 −1 at T 1 ; • From Λ1 (T2 ) = λ2 · β2 · T2β2 −1 to Λ2 (T2 ) = λ1 · β1 · T2β1 −1 at T2 . A physical example for the temporal continuity approach may be a system made of two different components C1 and C2 where C1 is replaced by C2 over the interval [T1 , T2 ] but without impact on the probability of failure of C1 . From the failure rate above, the corresponding TTF distribution, F(t), can be obtained from the general formulation of the item unreliability: t

F(t) = 1 − R(t) = 1 − exp(− ∫ Λ(τ )d τ )

(32.21)

0

As Weibull laws are considered, the integrals can be easily calculated and the results are illustrated in Fig. 32.16. The envelopes W b0 (λ1 , β1 , t) and W b1 (λ2 , β2 , t) have been drafted in dotted lines in this figure. The comparison between the probabilistic continuity and the temporal continuity approaches points out that they both belong to a more general approach where the Fig. 32.15 Failure rate: temporal continuity principle

32.6 Parameters Changing When Conditions Change

569

Fig. 32.16 TTF distribution: temporal continuity principle

TTF distribution is continuous and where the failure rate jumps from one value to another when the conditions change.

32.6.3.3

Failure Rate Continuity and Aging Modelling

In both probabilistic continuity and temporal continuity approaches, the failure rate jumps from a value to another when the conditions change. It is like if, due to the condition change, a kind of shock was experienced by the related item. When constant failure rates are involved, it is the only way to model the changes but when non-constant failure rates are involved, smoother solutions without jump may be analysed. Let us consider the example of a pool of two photocopiers PC1 and PC2 of different capacities: PC1 having a capacity twice of this of PC2 . When PC1 fails, the workload of PC2 is multiplied by three and this continues until PC1 is repaired. Reciprocally, when PC2 fails, the workload of PC1 is increased by 50% (i.e. multiplied by 1.5). In this case, when a photocopier fails, there is no reason why the failure rate of the other copier would change instantaneously but, as its workload increases, its wear and then its aging speed are certainly increased. This leads to design a new approach, where the parameters of the failure distribution are modified to model the workload change while keeping the failure rate continuous. This is the continuous failure rate approach illustrated in Fig. 32.17. Therefore Λ0 (T1 ) = Λ1 (1 ) when the conditions change at T1 and the link between W b0 and W b1 is made by the following equation: Fig. 32.17 Failure rate: failure rate continuity and aging principle

570

32 Monte Carlo Simulation

Fig. 32.18 TTF distribution: failure rate continuity principle

β −1

λ1 · β1 · T1 1

β −1

= λ2 · β2 · 1 2 a

(32.22)

This allows to calculate 1 which in turn gives the origin of W b1 at T1 − 1 as already explained above in 32.6.3.1. More precisely, Fig. 32.17 models the evolution of the failure rate of the photocopier PC2 when PC1 is faulty over [T1 , T2 ]. During this interval, the workload is multiplied by three and, therefore, considering that the wear and then the aging of PC2 is multiplied by three seems a realistic assumption. Mathematically speaking this means that, for a duration ρ1 in this interval, the failure rate related to W b1 is multiplied by three with regards to the nominal situation. Then, more generally, if v is the aging coefficient, the failure rate during [T1 , T2 ] is calculated as: Λ1 (ρ1 ) = λ1 · β1 · (v · ρ1 )β1 −1

(32.23)

  β −1 This is equal to λ1 · vβ1 −1 ·β1 ·ρ1 1 and this gives the parameters of W b1 which β1 −1 and β2 = β1 . are equal to λ2 = λ1 · v The TTF distribution corresponding to the failure rate illustrated in Fig. 32.17 is represented in Fig. 32.18. In this figure, the envelopes W b0 (λ1 , β1 , t) and W b1 (λ2 , β2 , t) are drafted in dotted lines and the resulting TTF distribution evolves between these two envelopes. The increasing slope of Λ1 (t) between T1 and T2 seems a rather realistic model of the increasing workload of PC2 when PC1 fails.

32.6.4 General Approach to Update Failure Dates 32.6.4.1

Principle for Updating Failure Dates on the Fly

An overview of the updating process has been given in the previous subsections but it is necessary to go into more details to describe the overall approach when the item experiences many failure distribution changes before it actually fails. This is illustrated in Fig. 32.19 which presents a general sequence of condition changes.

32.6 Parameters Changing When Conditions Change Change of laws 0

T0=0

Intervalle i

1

T1

571

i

2

T2

Ti

Ti+1

Fig. 32.19 Example of sequence of condition changes

According to this figure, an item failing in interval (i), has survived until Ti and experienced i condition changes. When the dates of condition changes are known in advance, the simulation of the date of failure is illustrated in Fig. 32.20. The failure distribution F(t) is completely defined at t = 0 and simulating the instant of failure, tf , is done just by firing a random number z and calculating tf = F −1 (z). In each of the intervals, the failure distribution Fi (ρi ) is defined with regards to its own system of reference, Refi (see Figs. 32.21, 32.22, 32.23 and 32.24). Then the failure distribution F(t) with regards to Ref0 (i.e. since t = T0 = 0) is equal to Fi (ρi ) with regards to Refi . In addition, the distribution is continuous when the conditions change: if i (in Refi ) corresponds to Ti (in Ref0 ), then F(Ti ) is equal to Fi (i ). Parameter i depends Fig. 32.20 Example of failure distribution with condition changes

2

1

0

3

z

T0=0

Fig. 32.21 Simulation of the time of failure when no change has occurred yet

T1

T2

T3

T4

0

Ref0 z

T0=0

0

572 Fig. 32.22 Simulation of the time of failure when one change has occurred

32 Monte Carlo Simulation

Ref0

0

1

Ref1

1

0

z

T0=0

T1

Fig. 32.23 Simulation of the time of failure when two changes have occurred

Fig. 32.24 Simulation of the time of failure when three changes have occurred

Ref0

0

2

1

Ref3

3

1

0

3

z 2

0 T0=0

T1

T2

T3

on the Fi (ρi ) parameters and on the assumptions about the linkage between the distributions. When the exact dates of condition changes are not known in advance, the failure distribution is not completely known at t = 0. As illustrated in Fig. 32.21, at the instant t = T0 = 0, only F(t) ≡ F0 (t) is known (with ρ0 = t and 0 = 0). The instant of failure can be calculated as tf ,0 = F0−1 (z) and, if no change occurs before tf ,0 (i.e. tf ,0 < T1 ), the failure will actually occur at this time.

32.6 Parameters Changing When Conditions Change

573

If the conditions change at T1 < tf ,0 , the parameters of the distribution change and F0 (t) is replaced by F1 (ρ1 ) defined in the system of reference Ref1 . As illustrated in Fig. 32.22, the instant of failure has to be updated from tf ,0 to tf ,1 = T1 − 1 + θ1 with regards to the system of reference Ref0 . If the conditions change at T2 < tf ,1 , the parameters of the distribution change and ρi = t − Ti + i is replaced by F2 (ρ2 ) defined in the system of reference Ref2 . As illustrated in Fig. 32.23, the instant of failure has to be updated from tf ,1 to tf ,2 = T2 − 2 + θ2 with regards to the system of reference Ref0 . Again, if the conditions change at T3 < tf ,2 , the parameters of the distribution change and F2 (ρ2 ) is replaced by F3 (ρ3 ) defined in the system of reference Ref3 . As illustrated in Fig. 32.24, the instant of failure has to be updated from tf ,2 to tf ,3 = T3 − 3 + θ3 with regards to the system of reference Ref0 . If no other change occurs before tf ,3 (i.e. tf ,3 < T4 ), then the failure of the modelled item is actually simulated at tf = tf ,3 i.e. at the same instant as the one calculated in Fig. 32.20 where the instant of condition changes were known in advance. Therefore, the above process allows to update the actual time of the item failure on the fly (i.e. according to what happens during the Monte Carlo simulation). The date of failure is simulated exactly as if the instants of condition changes were known in advance. This is very economical from the Monte Carlo simulation time point of view as only one random number z is used to do that.

32.6.4.2

Principle for Calculating Failure Dates

Simulating the instant of failure, tf , can be done just byfiring  a uniformly distributed and this is equivalent to such as z = F t random number z and calculating t f f     1 − z = R tf where R tf is the item reliability at time tf . The random number z being uniformly distributed over [0, 1], (1 − z) is uniformly distributed as well. Then R(t) can be used instead of F(t) to generate times to failures: tf = R−1 (z)

(32.24)

The reliability R(t) being easier to handle, it is used instead of F(t) hereafter in this subsection. R(t) is linked to the failure rate Λ(τ ) by the fundamental relationship t

R(t) = exp[−(∫ Λ(τ )d τ ] and this leads to: 0

t

tf = ln[−(∫ Λ(τ )d τ ]

(32.25)

0

For i condition changes R(t) can be written: T1

T2

Ti

t

0

T1

Ti−1

Ti

R(t) = exp[−( ∫ Λ(τ )d τ + ∫ Λ(τ )d τ + . . . ∫ Λ(τ )d τ + ∫ Λ(τ )d τ )]

574

32 Monte Carlo Simulation

In each interval, the failure rate depends on specific parameters which, in turn, depend on the type of failure distribution used to model the time to failure. This leads to: R(t) = R(Ti ) · Ri (ρi )

(32.26)

ρi where Ri (ρi ) = exp[−

i (τ )d τ ]

(32.27)

i Tk+1

ωk

Tk

k

Noting Rk = exp[− ∫ Λ(τ )d τ ] = exp[− ∫ Λk (τ )d τ ] Leads to R(Ti ) = R0 · R1 · . . . · Ri−1

(32.28)

Formula 32.28 provides a recurring solution to compute R(Ti ) on the fly each time the conditions change: R(Ti ) = R(Ti−1 ) · Ri−1

(32.29)

If Qi (τ ) is one primitive of Λi (τ ) then Ri (ρi ) can be calculated as:   Ri (ρi ) = exp − Qi (ρi ) − Qi (i ) And finally R(t) =



R(Ti )

 exp −Qi (ρi ) exp −Qi (i )

(32.30) (32.31)

 Due to the continuity of the distribution R(Ti ) = exp −Qi (i ) when the conditions change at Ti , then:

 R(t) = Ri (ρi ) = exp −Qi (ρi )

(32.32)



And finally, for z = exp −Qi (θi ) , the instant of failure can be found with regards to the system of reference Refi , by using the following formula: ln(z) = −Qi (θi )

(32.33)

Formula 32.33 provides the key for updating the failure date when the conditions change but it is tractable only if Qi (θi ) is easily inversible. As tf ,i − Ti + i = θi , the time of failure of the simulated item with regards to the system of reference Ref0 is found equal to: tf ,i = Ti + θi − i

(32.34)

32.6 Parameters Changing When Conditions Change

575

32.6.5 Generalities About the Application to Weibull Distributions In order to apply the results obtained in Sects. 32.6.4.1 and 32.6.4.2, it is necessary to analyse how the distributions Fi−1 (ρi−1 ) and Fi (ρi ) are linked together when a condition change occurs at Ti . This is easily tractable when analytical inversible formulae of the distributions are available. The aim of this subsection is to analyse the linkage problem when Weibull distributions are involved.

32.6.5.1

System of Reference Determination After the ith Change

In the general case, the jump, Jpi , of the failure rate at time Ti and the parameters of the Weibull laws W bi (i.e. λi and βi ) and W bi−1 (i.e. λi−1 and βi−1 ) have to be considered. The value of the jump and of the parameters are inputs of the model and lead to the following equation: β −1

λi · βi · i i

β

i−1 = Jpi · λi−1 · βi−1 · ωi−1

−1

(32.35)

This allows to calculate i as: β

i = [

i−1 Jpi · λi−1 · βi−1 · ωi−1 λi · βi

−1

1

] βi −1

(32.36)

The origin of the system of reference Refi of W bi (λi , βi , ρi ) can be determined at Ti − i and this allows to complete the calculations proposed in Sect. 32.6.4 with regards to the linkage of the distributions at Ti . In the case of the exponential law (βi = 1), this formula is undetermined but this, fortunately, does not matter as, due to the absence of memory of this law, i is equal to zero in this case. 32.6.5.2

Failure Rate Linking: Unification of the Approaches

As explained above and illustrated in Fig. 32.12, the relevant system of coordinates for the Weibull law during interval (i) is such that: • The current time within the interval is given by ρi = t − Ti + i ; • The item failure during this interval is governed by W bi (λi , βi , ρi );   • F Ti+ is equal to Fi (ρi = i ) at the beginning of the interval;  −  • F Ti+1 is equal to Fi (ρi = ωi ) with ωi = Ti+1 −Ti +i at the end of the interval. The value of i depends only on the assumptions adopted to link the distributions together or, what is equivalent, on the way the failure rate,Λ(Ti− ), just before Ti , is linked to the failure rate, Λ(Ti+ ), just after Ti .

576

32 Monte Carlo Simulation

Fig. 32.25 General evolution of the failure rate when conditions change

i-1

i Wbi

Jump Jpi

Wbi-1

Time

0

The single model illustrated in Fig. 32.25 unifies the three approaches analysed above with regards to the condition changes: • The failure rate is immediately multiplied by Jpi : Λ(Ti+ ) = Jpi .Λ(Ti− ); • After the jump, the failure rate evolves according  to a Weibull law W bi (λi ,β β−1i , ρi ) with ρi = t − Ti + i and i is such that Λ Ti+ = Λi (i ) = λi · βi · i i . This general model encompasses, in particular, all the approaches described above without jumps (Jpi = 1) or with jumps (Jpi = 1), including the case Jpi = 0 where the item failure is suspended (e.g. disabled by an external event) during the interval [Ti , Ti+1 ].

32.6.5.3

Failure Date Updating

As said above, when a Weibull distribution is involved, Λi (t) = λi ·βi ·t βi −1 . Applying the results developed in Sect. 32.6.4.2 gives, with regards to the system of reference Refi : ρi

• Qi (ρi ) = ∫ Λi (τ )d τ ) = λi · ρiβi . 0 ρi

• Qi (ρi ) = ∫ Λi (τ )d τ ) = λi · ρiβi . 0

β

Applying Formula 32.33 gives ln(z) = −λi · θi i which provides the instant of failure θi (with regards to the system of reference Refi ) as: 

 β1 i 1 θi = − · ln{z} λi

(32.37)

θi is the instant of failure in the system of reference Refi . As θi = tf ,i − Ti + i , the instant of failure in the system of reference Ref0 is equal to tf ,i = Ti − i + θi . Therefore, the date of failure after the ith change can be updated from tf ,i−1 > Ti to tf ,i (i.e. θi in the system of reference Refi ), as illustrated in Fig. 32.26.

32.6 Parameters Changing When Conditions Change Fig. 32.26 Specific system of reference (Refi ) for W bi (λi , βi , ρi )

577 i

i-1

1 Ref0

Wbi

Refi

z Update

Wbi-1 0

0

If no other condition change occurs after Ti , then the failure is simulated at tf = tf ,i .

32.6.6 Detailed Application to Weibull Distributions In the above Formula 32.36, i depends on the i changes which have occurred before Ti during the Monte Carlo simulation. However, its calculation depends also on the assumptions adopted to link the distributions and this is discussed in Sect. 32.6.6.

32.6.6.1

Failure Rate Continuity

This is a particular case with Jpi = 1. This gives:  i =

β

i−1 λi−1 · βi−1 · ωi−1 λi · βi

−1

 β 1−1 i

(32.38)

This case exists only when two Weibull laws are linked together or when a Weibull law is linked to an exponential law or vice versa. It does not exist when two exponential laws are linked together.

32.6.6.2

Failure Rate Continuity with Aging

In this model, the Weibull law W b0 (λ0 , β0 , t) is used as reference and the aging parameter vi changes from an interval to the other. Then Weibull laws β −1 W bi (λ0 · vi 0 , β0 , ρi ) are used in interval i where the aging has been multiplied by vi and i can be calculated from Formula 32.38:  i =

β −1

β −1

0 0 λ0 ·vi−1 ·β0 ·ωi−1 β −1

λ0 ·vi 0

·β0

 β 1−1 i

578

32 Monte Carlo Simulation

and then:  i =

32.6.6.3

β −1

β −1

0 0 · ωi−1 vi−1

 β 1−1 0

(32.39)

β −1

vi 0

Temporal Continuity

In the case of temporal continuity, i is an input data as i = Ti . The corresponding jump Jpi can be obtained from Formula 32.35: Jpi =

βi −1

λi ·βi ·Ti

β

i−1 λi−1 ·βi−1 ·ωi−1

−1

As ωi−1 = i−1 + Ti − Ti−1 and as i−1 = Ti−1 , parameter ωi−1 = Ti and, finally: Jpi =

32.6.6.4

λi · βi β −β T i i−1 λi−1 · βi−1 i

(32.40)

Probabilistic Continuity

In this case, the continuity of the TTF distribution leads to Ri−1 (ωi−1 ) = Ri (i ). Using Weibull laws gives: β

β

i−1 ) = exp(−λi i i ) exp(−λi−1 · ωi−1

β

β

i−1 = λi i i and, finally: Then λi−1 · ωi−1

β

i = [

i−1 1 λi−1 · ωi−1 ] βi λi

(32.41)

On another hand:   βi−1 −1 Λ Ti− = λi−1 · βi−1 · ωi−1   β −1 Λ Ti+ = λi · βi · i i And this provides the resulting failure rate jump: β −1

Jpi =

λi · βi · i i

β

i−1 λi−1 · βi−1 · ωi−1

−1

(32.42)

32.6 Parameters Changing When Conditions Change

579

32.6.7 Examples of Application 32.6.7.1

Links Between Weibull Distribution

The flexibility of the approach described above is illustrated in Figs. 32.27 and 32.28 with a case mixing jump, aging and continuity: the failure rate of an item is governed by W b0 until instant T1 where it experiences a change multiplying its failure rate by four (e.g. a non-lethal shock). Then, from T 1 to T 2 , the failure rate is governed by W b1 with an aging speed multiplied by three (v = 3). At time T2 another change occurs where the failure rate remains continuous and is then governed by W b2 with the same parameters as W b0 (the aging comes back to v = 1). The evolution of the failure rate is illustrated in Fig. 32.27 and the corresponding failure distribution (unreliability) in Fig. 32.28. It has to be noted that, when several similar items experience the same condition changes, their failure probabilities increase or decrease at the same time. Therefore, there is no immediate effect but the similar items are going to fail within a smaller or a larger time interval. When the failure rate increases, this constitutes a kind of common cause failure between the items. This is equivalent to the semi-catastrophic model described in Chap. 5 with, in addition, a non-lethal shock at time T 1 . Fig. 32.27 Failure rate: example of jump, aging and continuity

Wb1 Aging x 3 1.6

Continuity

1.2 0.8 0.4 0.0 0.

Fig. 32.28 Failure distribution: example of jump, aging and continuity

Jump 4

Wb0 5.

10.

Wb2 T1

T2

25.

30.

1.

Wb2

0.8 0.6 0.4

Wb1 Aging x 3

Wb0

0.2 0. 0.

5

10.

15. T1

20. T2

25.

30.

580

32 Monte Carlo Simulation

Fig. 32.29 Linking constant failure rates (exponential laws)

λ1= 1.5 10-4 λ2= 4.5 10-4

4. 10-4 3. 10-4

Exp1

Jump 1/3

Jump 3

2. 10-4 1. 10-4

Exp0

0 0.

32.6.7.2

T2

T1

5. 10-4

5.

Exp1

10.

15.

20.

25.

30.

35.

Links Between Exponential Distributions

Figure 32.29 presents the first attempt, performed a long time ago, to handle dynamic transitions with constant failure rates: this is a simplified approach to model the overload in the photocopier example or the sharp increase of the temperature of a computer when the air-cooling system fails. In the example in Fig. 32.29, the failure rate is multiplied by three within the interval [T1 , T2 ] and, if the item has not failed in the meanwhile, it comes back to its initial value when the overload (or the temperature) comes back to the initial value. It has to be noted that: • It is necessary to introduce jumps (Jpi = 1), otherwise it would not be possible to change the value of the constant failure rate; • This is a particular case of the temporal continuity (see 32.6.3.2). 32.6.7.3

Links Between Weibull and Exponential Distributions

The following Figs. 32.30, 32.31, 32.32, 32.33 and 32.34 illustrate models of connection Weibull-exponential (i.e. between non-constant and constant failure rates). Figure 32.30 illustrates a situation where the aging stops over [T1 , T2 ]. Then the failure rate becomes constant over this period of time. This can model milder Fig. 32.30 Connection Weibull-exponential-Weibull type 1: no aging over [T1 , T2 ]

T1

0.3 2.0.

T2

, β1= 3 λ1= 10-4 λ2= 0.0675

λ2 = Cte.

Wb2

Wb0

0.1 0.0 0.

5.

10.

15.

20.

25.

30.

35.

32.6 Parameters Changing When Conditions Change Fig. 32.31 Connection Weibull-exponential-Weibull type 2: jump on the right

581

0.4 0.3

, β1= 3 λ1= 10-4 , Jp2=3 λ1= 0.0675

0.1

Wb0

0.0 0.

5.

0.6 0.4 0.2

Wb2

λ2 = Cte.

0.2

Fig. 32.32 Connection Weibull-exponential-Weibull type 3: jump on the left

T2

T1

λ1= 10-4

Jp1=4 λ2= 0.27

Wb0

10.

Jp2

15.

20.

T1

T2

, β1= 3

25.

30.

35.

λ2 = Cte. Wb2

Jump

0.0 0.

5.

10.

15.

20.

T1

T2

25.

30.

35.

Fig. 32.33 Connection Weibull-exponential-Weibull type 4: jump on the left and the right

Fig. 32.34 Connection Weibull-exponential-Weibull type 5: suspended event

0.5

λ1= 10-4 λ2= 0

0.4

, β1= 3

0.3 0.2

Wb0

0.1 0

0

5.

λ2 = 0 Jp1 10.

Wb1

Jp2 15.

20.

25.

30.

35.

582

32 Monte Carlo Simulation

conditions during this period like the use of the modelled item in hot standby position. This is an example of failure rate continuity. In Fig. 32.31, the failure rate behaviour is similar to Fig. 32.30 until time T2 where a jump occurs before it increases again. This can model an item placed in hot standby position at T1 and which experiences a shock when it is restarted at T2 . In Fig. 32.32, the failure rate increases until T1 where the item experiences a shock. Then the failure rate becomes constant until T2 where the conditions are strengthened and, then, it increases again. This can model an item which is degraded when placed in hot standby position and then restarted. In Fig. 32.33, the failure rate jumps both at T1 and T2 and is constant between them. This can model an item which is degraded when it is placed in hot standby position. Then this degradation is mitigated to some extent (the item is improved) when it is started again.

32.6.7.4

Suspended Events

In the example above (Figs. 32.30, 32.31, 32.32, 32.33 and 32.34), the constant failure rate between T1 and T2 is higher than or equal to the failure rate at time T1 . Similar models can be developed for items where the failure rate becomes lower than the failure rate at time T1 . Figure 32.34 is a very interesting particular case where the failure rate drops to zero over [T1 , T2 ]. It models the situation where an event is suspended over an interval of time. This can model an item used in cold standby condition over [T1 , T2 ], if the failure rate is considered to be reasonably equal to zero when the system is in this state. The shape of the TTF distribution illustrated in Fig. 32.35 is different compared to the examples previously analysed because it remains constant between T1 and T2 . That means that the updating date θ1 is rejected to infinity (see Fig. 32.26) because no failure can occur during this interval. This implies a specific mathematical treatment to calculate θ2 when the event is no longer suspended. Fig. 32.35 Figure 32.20. TTF distribution in case of suspended event

32.6 Parameters Changing When Conditions Change

583

Fig. 32.36 Connection exponential-Weibullexponential: aging during [T1 , T2 ]

32.6.7.5

Links Between Exponential and Weibull Distributions

Figure 32.36 illustrates the connection between exponential and Weibull distributions. This can model an item with a constant failure rate until T1 . Then, it experiences tighter conditions implying wear (i.e. aging) over [T1 , T2 ]. When the tighter conditions terminate, it comes back to a constant failure rate but, due to wear, at a higher level. This model belongs to the failure rate continuity approach but differs from the general model as the failure distribution after T2 is not an input of the model but depends on the value of the failure rate reached by the Weibull distribution when the tighter conditions terminate at T2 .

32.6.7.6

Links Between Weibull Distributions with Various Shape Parameters

Until now, the shape parameters β of the Weibull distributions have been taken equal to 1 (constant failure rates) or greater than 2 (increasing failure rates). Therefore, the cases with shape parameters β lower than 1 or equal to 2 are still to be analysed and this is done with the last three following examples. Figure 32.37 illustrates an example where the shape parameters β of the Weibull laws are lower than 1. In this case, the failure rates are decreasing and this can be used Fig. 32.37 Failure rate continuity with Weibull laws shape parameters β lower than 1

584 Fig. 32.38 Failure rate continuity with Weibull laws shape parameters β equal to 2

32 Monte Carlo Simulation

8.

6. 10-3 4. 10-3 2. 10-3 0.0 0.

Fig. 32.39 Failure rate continuity with Weibull laws shape parameters β between 1 and 2

T2

T1 10-3

8. 10-3 6. 10-3 4. 10-3 2. 10-3 0.0 0.

λ1=10-4 λ2=10-4

, β1= 2 , β2= 2

Wb2

Wb1

Wb0

5.

λ1= 3. 10-4 λ2= 9. 10-4

10.

15.

20.

T1

T2

25.

, β1= 1.6 , β2= 1.6

30.

35.

Wb2

Wb1 Wb0

5.

10.

15.

20.

25.

30.

35.

to model an item in its early life and under debugging conditions. For example, in Fig. 32.37, the debugging conditions are more effective within the interval [T1 , T2 ]. Figure 32.38 illustrates an example where the shape parameters β of the Weibull laws are equal to two. In this case, and as shown in the figure, the failure rates are increasing linearly with time. Figure 32.39 illustrates an example where the shape parameters β of the Weibull laws are comprised between one and two. In this case, and as shown in the figure, the failure rates are increasing very much at the beginning and then less and less. This is the opposite of what happens with the traditional bathtub curve in the wear out period. This can be used to model a mechanical item in its running-in period. On the contrary of the debugging shown in Fig. 32.37 which intends to bring weaknesses to light, the running-in period intends to improve the functioning by finely tuning mechanical parts working together.

32.7 Comparison Between Analytic and Monte Carlo Calculations In practice (see Chaps. 25, 32, 33 and 36), for a given random variable and when n is large enough, the Monte Carlo simulation provides good estimations of: • The mathematical expectation mn which measures the mean value;

32.7 Comparison Between Analytic and Monte Carlo Calculations

585

• The standard deviation σn which indicates how the random variable is scattered with regards to its mean value; • The confidence interval at, e.g., 90% which measures the accuracy of the mean value estimation. In addition, the full distribution of the random variable is also available. Therefore, compared to the analytical calculations which are focused on single results, the Monte Carlo simulation provides plenty of various results at the same time: in fact, the possibilities are virtually endless and only limited by the analyst imagination. In addition, the accuracy of these results can be easily monitored from a conservativeness point of view whereas this is sometimes not the case for the analytical approaches, where the impact of approximations may be difficult to estimate. Finally, the latent grievance existing in the collective unconscious about the inaccuracy of this approach does not hold anymore: actually, the accuracy of the results is always known and they can be made as accurate as wanted, provided that it is possible to perform a sufficient number of simulations. This is easy for frequent events (e.g. production availability calculations) and harder for rare events (e.g. safety related events) but, in this case, acceleration techniques (beyond the scope of this book) can be implemented (Estecahandy et al. 2014 and 2015) to shorten the computation time. It has to be noted that it is the contrary of the analytical approaches for which the approximations work better for low than for high probabilities. It has also to be noted that Monte Carlo simulation provides a powerful means to propagate the input uncertainties through conventional models like fault trees, reliability block diagrams or even Markov models. This implies that the two approaches, analytical and Monte Carlo simulation, are complementary and therefore, according to the study to be performed, the reliability analyst should have to use both of them. Thanks to the incredible increasing computing powerfulness of nowadays computers, it is less and less difficult to reach low probabilities and/or handle large complex models with Monte Carlo simulation provided an efficient behavioural model is implemented. Formal languages (e.g. AltaRica language (Batteux et al. 2019; Aupetit 2020)) can be used for this purpose and among them the stochastic Petri nets have proven to be very effective when used as such or as a basis for dynamic reliability block diagrams, dynamic fault trees or dynamic flow diagrams (see Chap. 33).

32.8 Associated Exercises Several exercises related to Chap. 32 are proposed in Chaps. 29 and 34. They are shared with Chaps 25 and 33: • Exercise 25.1: uncertainty propagation into a FT modelling an overpressure protection system to calculate the impact on the PFDavg (average unavailability) and the PFH (average failure frequency).

586

32 Monte Carlo Simulation

• Exercise 33.12: Monte Carlo simulation to estimate the number of cars visiting a service station and the number of lost sales due to too long waiting queues. Station closed at night. • Exercise 33.13: extend exercise 33.12 to draw the curves of the cars in the entrance and exit queues of the service station. • Exercise 33.14: same exercise as 33.12 when the station is open at night. • Exercise 33.15: same exercise as 33.13 when the station is open at night.

References Aupetit B (2020) Calcul d’indicateurs de sûreté de fonctionnement de modèles AltaRica 3.0 par simulation stochastique. Doctoral thesis of the University Paris-Saclay prepared at Centrale Supélec. Paris, France Batteux M, Prosvirnova T, Rauzy A (2019) AltaRica 3.0 in 10 modeling patterns. Int J Crit ComputBased Syst Inderscience Publishers. 9:1–2, 133–165. https://doi.org/10.1504/ijccbs.2019.098809 Dutuit Y, Signoret J-P, Thomas P (2016) Handling dynamic transitions with stochastic Petri nets. In: Proceedings ESREL 2020 congress Dutuit Y, Signoret J-P, Thomas P (2018) Dynamic transitions–Changing event occurrence laws at discrete instants. In: Proceedings LM21 congress. France Dutuit Y, Folleau C, Signoret J-P, Thomas P (2020) Dynamic transitions: simulating changes of probabilistic laws at discrete instants. In: Proceedings LM20 congress. France Estecahandy M, Bordes L, Collas S, Paroissin C (2015) Some acceleration methods for Monte Carlo simulation of rare events. Reliab Eng Syst Saf (RESS) 144:296–310. Elsevier Estecahandy M, Bordes L, Collas S, Paroissin C (2014) Fast monte carlo simulation methods adapted to simple petri net models. ARES2014: 370–379. IEEE GRIF-Workshop (2020) PETRI module. Funded and developed by TOTAL, http://grif-workshop.fr/. Accessed Aug 2020 Labeau PE (2001) Modification des lois de probabilité décrivant les temps de défaillance et de réparation d’un composant en fonction de contraintes externes. Application à la simulation de Monte-Carlo, Personal note Law AM, Kelton WD Ed 2 (1991) Simulation modeling and analysis. McGraw Hill International Editions, New York L’Ecuyer P (1990) Random numbers for simulation. Commun ACM 33(10):85–97. Association for Computing Machinery Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering, Springer, Berlin Signoret JP (1997) Simulation de Monte Carlo de composants dont le taux de défaillance change en fonction de contraintes externes, Étude de faisabilité. Rapport EP/P/SE MRT ARF JPS 97117. Société Nationale Elf Aquitaine (Production), France. Private internal document Von Neumann J (1951) Various techniques used in connection with random digits. J Res National Bureau of Standards, N° 12,36-38. Barry N. Taylor Wikipedia Buffon (2020) https://en.wikipedia.org/wiki/Buffon’s_needle_problem. Accessed Sept 2020

Chapter 33

Petri Net Modelling

33.1 Quest for Complex Behaviour Modelling The industrial systems have rather complex behaviours, as already discussed in Chap. 30. Despite its great interest, the Markovian approach described in Chap. 32 covers only partly the modelling needs: when the exponential assumption does not hold any more or when the complexity and the number of states of the systems under study increase the analytical calculations have to be abandoned and replaced by the Monte Carlo simulation described in Chap. 32 and based on the use of random numbers and statistic estimations. When this decision has been taken, it remains to choose a relevant model able to represent the behaviour of the modelled system as close as in actual life and to support the Monte Carlo simulation effectively. As most of the systems can be described by using discrete states, the quest for a relevant model is immediately oriented toward the finite state automata which describe how transitions occur between a finite but countable number of states (Lawson 2004; Carroll and Long 1989; Wikipedia FSM 2020a). They are virtual machines described in a very abstracted mathematical way and implemented with various formal modelling languages which are beyond the common knowledge of reliability engineers. Fortunately, some graphical representations are available and among them the Petri nets (PNs) have proven, since the late seventies, to be very well adapted for the purpose of Monte Carlo simulation support and for the use by engineers. The location of the PN technique within the corpus of the dependability methods and tools is illustrated in Fig. 33.1. It has to be noted that the Boolean models (e.g. reliability block diagrams, fault trees or event trees) as well as the Markovian models can also be described in terms of finite state automata which, finally, constitute the underlying mathematical background of all the techniques used within the safety and dependability (i.e. reliability in broad sense) field.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_33

587

588

33 Petri Net Modelling

Probabilistic models

Analytical approaches

Taylor expansion Simplified formulae

Specific formulae

Boolean approaches RBD FT ET

Static models

Dynamic models

Monte Carlo simulation

Behavioural approaches

Markovian approaches Markov graphs

Petri nets

State-transition model (Finite state automata)

Generic tools

State of the art

Fig. 33.1 Petri net modelling within the corpus of methods and tools

Exercises related to the Petri net modelling are proposed and developed in Chap. 34. The list of these exercises is provided in Sect. 33.14 with a brief description of each of them and the links toward the relevant sections or subsections are indicated.

33.2 History The Petri Nets have been introduced in 1962 in a thesis written by Carl Adam Petri, Petri (1962). The original purpose was to represent asynchronous finite state automata in a graphical form and to perform the formal proof that they were free of bugs. This is an important topic for the design of automata and a particular category of PNs has even been standardized under the name of GRAFCET1 (IEC 60848 2013) to design the specifications of physical automata used in industry. Nowadays, PNs are still used for this purpose, as could confirm a quick search on Internet: the great majority of research and development works devoted to PNs are focused on the automation or information technology (IT) domains. Contrary to synchronous automata which are triggered at regular intervals by their internal clocks (which generally operate at high frequency), the asynchronous automata are triggered by specific external events which arise at random. This is similar to what happens with random events (e.g. failures or repairs) impacting the 1 GRAFCET

(graphe fonctionnel de commande état transition) is a specification language for the functional description of the behaviour of the sequential part of a control system.

33.2 History

589

system states and this is why the PNs have drawn the attention for a potential use in the safety and dependability field. The adoption of this technique has been achieved in two steps: • At the end of the 1970’s/beginning of the 1980’s, an academic doctoral thesis has been prepared and issued where Petri net models were used to automatically identify the states of a system and to build large Markov graphs calculated in a conventional analytical way (Natkin 1980). The technique has been immediately tried by some reliability engineers for generating large Markov graphs related to actual industrial systems. The results have been disclosed to the reliability engineer community during the French national reliability congress held in 1980 in Perros-Guirec (Ligeron and Delage 1980). This approach has proven to be effective because the size of the PN increases linearly with the number of modelled components when the size of the equivalent Markov graph increases exponentially. Then it is easier to produce large models prone of errors by using PNs than building directly the Markov graph by hand. • Following the above works, one of the author of this book realized that, beyond the Markov graph generation, the Petri nets were providing a powerful support for Monte Carlo simulation. This has been the starting point of a joint research and development project between ELF and TOTAL started in 1982. This led to the first issue in 1983 of the computation engine called MOCA-RP performing Monte Carlo simulation on Petri net models. Continuously improved and used for actual safety and dependability studies from 1983 until now, it is integrated within the GRIF-workshop (2020) software package. Used as behavioural model supporting Monte Carlo simulation for more than 35 years now, the PNs have proven to be very effective to model the complex behaviour of large industrial systems (e.g. safety and production systems) (Signoret 1998, 2008, 2009; Signoret and Leroy 1985, 1989; Signoret et al. 2002, 2013). It is now a well-recognized technique in France and in more and more other countries and it has been standardized in IEC 62551 (2012) for being used in safety and dependability studies. Unfortunately, it is sometimes discarded due to the allegation of old school people who profess that Monte Carlo simulation is inaccurate because of the use of random numbers. They are wrong (see Chap. 32) and, at the present time, it is certainly one of the best quality-price ratio techniques with regards to the intellectual investment (which is, in fact, very limited) and the modelling capacities (which are virtually endless). However, a warning has to be raised: once you have experienced Petri nets and Monte Carlo simulation, the other techniques may appear so much limited in comparison that it may be difficult to come back to them!!

590

33 Petri Net Modelling

33.3 Petri Net Use Within Automation and Dependability Fields As said above, the PNs have been developed within the automation and TI domain (ISO 15909 2019; TGI 2020) to fulfil needs rather different from those of the reliability modelling domain (IEC 61551 2012). Therefore, the expectations are different and the main differences are the following: • Within the automation and TI domain, the analysts use PNs to design automata free of failures (bugs) and which properly operate. • Within the reliability field, the analysts use PNs to model not only how a system properly operates but also how it fails. The introduction of the failure makes a big difference because all the good properties wanted when developing automata free of failures do not hold anymore when failures have to be modelled. This is in particular the case for: • • • • •

Conflict-freedom: only one event can occur from a given state; Liveness: no deadlock (i.e. no blocking state); Reachability: ability from any state to reach any other state; Boundedness: limited number of tokens in places; Safeness: pre- and postconditions of an event are not satisfied at the same time in order to prevent endless loops (Ling and Schmidt 2013).

The absence of conflicts (only one event can occur in a given state) is very important to ensure the deterministic behaviour of an automaton but this is irrelevant when modelling systems because many events can occur from a given state: for example, all the component failures are valid at the same time. They are in conflict and the future of the system may depend on which failure occurs first. An automaton is no longer able to operate when a deadlock occurs and it is very important to prevent such a situation but, when PNs are used to model accidents or absorbing states (see Chap. 32), no other event can occur and the resulting PN model is not necessarily live (in the acceptation used for automata). Reachability is important for automata to be sure that any state is reachable at any time but, when a system has several functioning phases, it may never come back to a previous state and therefore all the parts of the resulting PN model are not necessarily reachable (in the acceptation used for automata). Boundedness is also important for automata (generally no more than one token at the time in a given place) but, when PNs are used to count the number of failures or the number of available spare parts of a modelled system, this requirement cannot be fulfilled. However, the safeness which prevents endless loops is an important property both for automata or PNs modelling the behaviour of physical systems. In both cases, the occurrence of endless loops (which occur, for example, when one transition remains endlessly valid when it is fired) results from mistakes in the PN design.

33.3 Petri Net Use Within Automation and Dependability Fields

591

Therefore, most of the good properties wanted to design automata are lost when PNs are used to model the behaviour of physical systems: in this case, the mathematical developments available to analyse and design automata are not really useful but this is the dynamic modelling possibilities which are of utmost interest.

33.4 Basic Principles 33.4.1 Graphical Elements What makes the main interest of PNs among the various finite state automata using formal languages is the graphical representation which is immediately readable and intelligible by the analysts even if they do not know the underlying formal mechanisms. As defined by Carl Adam Petri, the PNs are made of two complementary graphical representations: • A static part (i.e. a simple drawing) which does not change when time elapses (Fig. 33.2 left); • A dynamic part which defines the PN state at a given instant and which changes when time elapses according to the occurring events (Fig. 33.2 right). The basic static elements are illustrated on the left-hand side of Fig. 33.2: • Transitions: drafted as rectangles, they represent potential events which can occur (when the transition is valid) or not (when the transition is inhibited);

Upstream

• Places: drafted as circles, they represent local states (e.g. component states) of the modelled system. With regards to transitions, they are located upstream, downstream or both;

Upstream place

Pl1

Pl2

Place (both upstream and downstream)

#2=1

#1=2

Pl1

Place marking

Upstream arc

Pl2

Downstream

Transition Downstream arc Downstream place

Tr1

Pl3

Fig. 33.2 Static and dynamic parts of a basic Petri net

Tokens #3=1

Tr1 Pl3

592

33 Petri Net Modelling

• Arcs: drafted as segments with arrows, they link places to transitions (upstream arcs) or transitions to places (downstream arcs). They are used to validate the transition when events are triggered (see hereafter). The basic dynamic elements are represented by the tokens illustrated on the righthand side of Fig. 33.2. The location of the tokens in the places constitutes the marking of the PN and it is the marking of the PN which defines the state of the PN at any time. The marking changes according to the events which occur and models the dynamic behaviour of the modelled system. The marking of a given place is noted by using the “#” symbol. Then in Fig. 33.2, the markings of the places are #1=2, #2=1, and #3=1. As many markings (then many states) can be represented on a single static part of the PN, the use of tokens is a very compact way to code the PN states. This explains why the size of the PN models increases linearly with the number of components of the modelled system. This avoids the combinatorial explosion of the number of states occurring with the Markovian approach and the handling of large industrial systems can be considered with confidence.

33.4.2 Validation of Transitions and Firing Rules A given state of the modelled system is represented by a corresponding marking of the PN. Then, simulating the behaviour of this system simply consists in modifying this marking according to the events which can occur from this state. It is done in three steps and the transitions (which represent events) play the main role for doing that: • Identifying the transitions related to the events which can occur from the present state: these transitions are said to be valid. • Choosing one of the valid transitions and simulating the occurrence of the related event: this transition is said to be fired. • Modifying the PN marking according to the new state which is reached. Proceeding in this way is the basic principle for simulating the system behaviour from a Petri net. This implies that sound rules have to be defined to determine which transitions are valid from a given marking, which transition is going to be fired first when several of them are valid at the same time and how the marking is going to change when the transition is actually fired. For basic Petri nets, the validation and firing rules are rather simple: • Validation of a transition: every upstream place contains at least one token. • Firing of a transition: one token is removed from every upstream place and one token is added in every downstream place.

33.4 Basic Principles

Pl1

Pl2

593

Pl1

Pl2

Tr

Tr

Valid transition

Pl2

2nd firing

1st firing

Pl3

Pl1

Pl3

Tr

Inhibited transition

Pl3

Fig. 33.3 Example of transition firings

It has to be noted that with the above rules, even when the tokens seem to move from place to place when a transition is fired, they do not actually move as they are destructed in the upstream places and created in the downstream places. Other kind of PNs (e.g. coloured PNs described in Sect. 33.12) may allow the tokens to move but they are based on different rules. With the above rules, the transition on the left-hand side of Fig. 33.3 is valid because it has one or more tokens in each of its upstream place (two tokens in place Pl1 and one token in place Pl2 ). That means that the attached event is able to occur. When the transition is fired, this simulates the occurrence of this event and applying the rules leads to the PN illustrated in the middle of Fig. 33.3: one token has been removed from Pl1 and also from Pl2 and one token has been added in Pl2 and also in Pl3 . Then, when the firing is completed, one token remains in Pl1 , the marking of Pl2 has not changed and two tokens are in Pl3 (middle of Fig. 33.3). Nevertheless, after this first firing, the transition is still valid and it can be fired once more, with the resulting marking illustrated on the right-hand side of the figure. After that, all the tokens in Pl1 have disappeared and the transition is no longer valid: it is inhibited as long as no other token is added in Pl1 thanks to the firing of another transition not represented on this figure.

33.4.3 Managing Conflicts When the PN is conflict free as in Fig. 33.3, only one transition can be valid at the same time and there is no problem to select the transitions to be fired but, when several transitions are valid at the same time, extra rules have to be introduced in order to select the one which is going to be fired first. In the middle of Fig. 33.4, an example of two conflicting transitions Tr1 and Tr2 is illustrated. The resulting marking shown on the left-hand side of the figure is obtained if Tr1 is fired first and the resulting marking shown on the right-hand side of the figure is obtained if Tr2 is fired first. The firing of Tr1 inhibit Tr2 and vice versa and the resulting markings are different. Therefore, the future behaviour of the PN

594

33 Petri Net Modelling

Pl1

Pl1

Pl4

Pl2

Tr1

Tr1

Tr2

Pl5

Pl4

Pl2

Tr1

Tr2

Tr2

Firing of Tr2

Firing of Tr1 Pl3

Pl1

Pl4

Pl2

Pl3

Pl5

Pl3

Pl5

Fig. 33.4 Example of conflicting transitions

depends on which transition is fired first. This situation has generally to be avoided and this can be done in several ways: • Fire the transition with the lowest number first (i.e. fire Tr1 before Tr2 ). This is the simplest way but is likely to introduce bias in the calculation as the firing of Tr1 is systematically favoured compared to the firing of Tr2 . • Choose at random the transition to be fired first (e.g. generate a random number and, if it is lower than 0.5, fire Tr1 and fire Tr2 otherwise). • Introduce priorities on the transitions (see 33.6.1) and fire the transition with the highest priority first (e.g. gives a priority of 1 to Tr1 and a priority 0 to Tr2 in order to force the firing of Tr1 ). • Inhibit all the transitions except the one which has to be fired first (e.g. modify the PN to inhibit Tr2 when Tr1 has to be fired first). • Introduce firing delays and fire the transition with the shortest delay first (e.g. if δ 1 is the firing delay of Tr1 and δ 2 the firing delay of Tr2 , fire Tr1 if δ1 < δ2 and fire Tr2 otherwise). The last way leads to introduce a special kind of Petri nets: Temporized (Timed) Petri net: Petri net with delays associated to the transitions. Temporized PNs allow to solve most of the conflicts appearing in the model when they are used for Monte Carlo simulation purposes: in this case, the transition delays related to failures and repairs are calculated by using random numbers and they have very low probability to be generated, in the simulations, exactly with the same values (e.g. with 12 identical digits) at the same time. Nevertheless, some events may have deterministic delays (e.g. reconfiguration delays, travel delays), which leads to transitions conflicting with exactly the same delays. In this case, precautions have to be taken and the other techniques (lowest transition number, selection at random, priority, inhibition) have to be implemented according to the modelled behaviours. This is analysed hereafter. Exercise 33.4 related to this subsection is described in Sect. 33.14 and its solution can be found in Chap. 34.

33.4 Basic Principles

595

33.4.4 Introduction of Delays 33.4.4.1

Type of Delay Identification

Chapter 33.4.2 has shown how a transition can be validated and fired to simulate events. This allows to model when an event is able to occur and what happens when it occurs but it remains to determine when it is going to actually occur. This implies, for a given transition (i.e. a given event), to determine the delay elapsing between the validation and the firing of this transition. As said above, this is in the scope of the temporized Petri nets. This leads to attach to each transition the delay related to the occurrence of the corresponding event. No more assumption is needed about the nature of the delays which can be deterministic, stochastic or even physical: • Deterministic delay: delay related to an event which occurs after a constant duration δ. Among them, the delays equal to zero play an important role as they are used to synchronize the firing of several transitions. • Stochastic delay: delay related to an event which occurs after a random delay δ. Therefore, this delay is a random variable with a given probabilistic distribution F (δ) which can be simulated, as described in Chap. 32 about Monte Carlo simulation. • Physical delay: delay related to an event which occurs after a duration governed by the evolution of physical parameters. 33.4.4.2

Deterministic and Stochastic Delays

This Chapter aims mainly to deal with deterministic and stochastic delays which, as explained in Chap. 32, can be easily handled within the Monte Carlo simulation framework: the constant delays can be modelled by Dirac distributions and the stochastic delays can be simulated by using their probabilistic distribution F (δ). Therefore, there is no theoretical reasons to analyse these two types of delays separately but it can be interesting to make a graphical difference in order to identify them at once. This has been standardized as shown in Fig. 33.5. Four types of transitions have been identified and standardized: Dirac distributions

δ=0

δ Deterministic delay

Exponential law Arbitrary distribution

)

λ

Stochastic delay

Fig. 33.5 Standardization of timed transitions (IEC 62551 2012)

596

33 Petri Net Modelling

• Constant delay equal to zero (instantaneous transitions): this is useful to synchronize transitions but needs a particular attention as endless loops may be easily introduced by using such transitions; • Constant delay not equal to zero: this is useful to model events occurring after a constant delay; • Exponential law: this is useful to make the link with the Markovian approach; • Arbitrary distribution: this is useful to go out of the Markovian framework. The introduction of random delays leads to another kind of Petri nets: Stochastic Petri net: timed Petri net with random delays associated to the transitions. The stochastic PNs are the basis to model the behaviour of industrial systems and to perform Monte Carlo simulations. Within a Monte Carlo simulation, each time a transition becomes valid, its delay before firing has to be calculated: • For the transitions with deterministic delay, this is 0 or δ; • For the transition with stochastic delays, the firing delay has to be calculated as explained in Chap. 32: – Generate a random number zi by using a pseudo random number generator, – Calculate the firing delay δ i by inverting the probabilistic distribution (see Chap. 32): δi = F−1 (zi ) i.e. δi = −ln(zi )/λ in the case of an exponential law with a parameter λ. Therefore, a sufficient number of simulations have to be performed in order that for each transition, the whole range of the random variables related to the firing delay be covered. Exercise 33.2 related to this subsection is described in Sect. 33.14 and its solution can be found in Chap. 34.

33.4.4.3

Physical Delays

Taking the physical delays into account is beyond the scope of the book and only the principle is going to be explained using a simple example: the spurious closure of the outlet of a tank. When this occurs, the pressure in the tank increases until it crosses the threshold πH of a pressure sensor (e.g. a PSH) triggering the closure of a valve at the tank inlet. If this works, the pressure stops to increase, if this does not work, another protection system is going to be activated (or the tank is going to burst). The tank pressure depends on the flow at the inlet and on the time elapsing from the outlet closure. If the flow is constant, this can be written π = ϕ(δ) where δ is the delay elapsing since the outlet closure. Then the pressure sensor is triggered when πH = ϕ(δH ) and then after a delay δH = ϕ −1 (πH ).

33.4 Basic Principles

597

Interpolation

Calculated results

Time

Response surface

Time

Time

Fig. 33.6 Response surface principle

Therefore, the process is very similar to the calculation of stochastic delays: the function ϕ(δ) is just replacing the distribution F (δ). This seems rather simple but, unfortunately, in actual life, the function ϕ(δ) is generally a system of complicated differential equations, difficult and very timeconsuming to calculate. Then it cannot be directly implemented in a Monte Carlo simulation which needs fast calculations of the transition delays to be effectively undertaken. The idea is to run the system of differential equations apart from the Monte Carlo simulation for a set of input variable values chosen to cover the range encountered in the Monte Carlo simulation. This provides a set of results (points) which can be used to fit a surface passing through these points, as illustrated in Fig. 33.6 left-hand side and middle. When this is done, the surface can be used instead of the system of differential equations to interpolate the result related to any variable value within the range, as illustrated on the left-hand side of Fig. 33.6. This kind of surface gives very fast responses and, for this reason, is called response surface. The one presented in Fig. 33.6 is a single dimension surface (i.e. a line) but in real cases this is generally a hyper-surface with as many dimensions as the number of considered variables (Wikipedia RS 2020b). Using response surfaces is an effective way to make Monte Carlo simulations tractable in order to take the evolution of physical parameters into account. They have been used for example in Averbuch et al. (2007) to model the flow assurance (e.g. preventing hydrate or paraffin plugs in pipelines) of an oil and gas production platform.

33.4.5 Simple Examples The very simple example illustrated on the left-hand side of Fig. 33.7 has been used by the authors for decades to illustrate the use of the stochastic PNs. It is made of: • Three places to model the item states: “Up”, “Wait for repair”, and “Repair”;

598

33 Petri Net Modelling

End of repair Pl1 Tr3

End of repair Pl1 Tr3 Up

Up

μ Pl3 Repair

Failure

δ

Tr2 Start of repair

μ Tr1

Pl2

λ

Pl3 Pl4 Wait for repair

Repair

Failure

δ

Tr2 Start of repair

Tr1

Pl2

λ

Failure count Pl4 Wait for repair

Fig. 33.7 PNs related to a single repaired item

• Three transitions to model three events: “Failure”, “Start of repair” and “End of repair”. When this item fails, it has to wait for some time before the repair starts. This delay can be due to the time needed for detecting the failure, having a maintenance team available, obtaining a spare part, repairing more critical failures, etc. In Fig. 33.7 it has simply been modelled by a constant delay, δ, which represents the average time needed before starting the repair. The remaining part of this model comprises the usual item states (“Up” and “Repair”) and exponential laws have been used to model both failures and repairs. The animation of such a Petri net finds its roots in the ancient use of abacus where small stones (called calculi) where used to rightly perform calculations. Here, the tokens play the same role as the calculi in old times. Therefore, the readers are invited to get some calculi (coins, small stones, lentils, etc.) and use them to play with the PN and understand the mechanism of creation/destruction of tokens: • At the beginning the item is in a perfect operating state and one token is located in place “Up” (#1 = 1). • From this state, only transition “Failure” is valid. • When the transition is fired, the item fails and the token is removed from “Up” (#1 = 0) and one token is introduced in place “Wait for repair” (#2 = 1). • From this state, only transition “Start of repair” is valid. • When the transition is fired, the repair of the item starts and the token is removed from place “Wait for repair” (#2 = 0) and one token is introduced in place “Repair” (#3 = 1). • From this state, only transition “End of repair” is valid. • When the transition is fired, the item is operating again and the token is removed from place “Repair” (#3 = 0) and one token is introduced in place “Up” (#1 = 1). • This state is similar to the initial state at the beginning.

33.4 Basic Principles

599

On the right-hand side of Fig. 33.7, place “Failure count” has been added in order to count the number of occurring failures. This addition does not change the behaviour or the PN with regards to failure and repair but one token is added in this new place each time a failure is simulated and the marking of this place is equal to the number, Nbf , of occurred failures (#4 = Nbf ). Therefore, the number of tokens is not kept constant throughout the simulation and it is even unbounded as an infinite number of failures can occur over an infinite time (T → ∞) of simulation. This is one of the “good” properties desired for automata which is not fulfilled when PNs are used for reliability modelling purposes. Such a manual animation as above is very useful because it allows to verify that the PN behaves as expected. When using a software, this can be done through what is called a stepper allowing to trigger by hand the valid transitions and it is wise to use a PN software package implementing this feature when developing large PNs. Only the sequence of events is analysed above but in an actual Monte Carlo simulation the delays between firings are taken under consideration and the process continues until one history is completed when the date of firing of the next valid transition is larger than the period of interest, T. According to the principle of the Monte Carlo simulation, many such histories have to be achieved until representative statistical samples of the parameters of interest are gathered. When this is done on the example in Fig. 33.7, this allows to estimate, for example: • The probability to have one token in place “Up” at time T (#1 = 1 at time T ): availability of the item at time T; • The mean marking of the state “Up” (mean value of #1 over [0, T ]): average availability of the item over [0, T ]; • Number of histories with 0 token in “Failure count” at T (#4 = 0 at time T ) divided by the total number of simulated histories: reliability over [0, T ]; • Mean firing number of transition “Failure”, mean number of tokens within place “Failure count” (mean value of #4 over [0, T ]) or mean number of the marking changes of Pl1 (mean number of changes of #1 from 1 to 0): mean number of failures of the item over [0, T ]; • Time spent with one token in place “Up” (accumulated time with #4 = 1): accumulated up time of the item over [0, T ]; • Time spent with one token in places “Wait for repair” or “Repair” (accumulated time with #2 = 1 plus accumulated time with #3 = 1), or more simply T minus the accumulated up time: accumulated down time of the item over [0, T ]; • Accumulated up time divided by the mean number of failures: mean up time (MUT); • Accumulated down time divided by the mean number of failures (or more simply T − MUT ): mean down time (MDT ); • T divided by the number of failures: mean time between failure (MTBF); • Time spent with one token in place “Repair” (accumulated time with #3 = 1): maintenance load of the item over [0, T ];

600

33 Petri Net Modelling

Fig. 33.8 PN modelling of a shared maintenance team

From other items

End of repair Pl1 Tr3

Up

μ Repair

Repair team

Pl3 Pl4

To other items

λ

Tr1 Failure

Pl2

Tr2 Wait for repair Start of repair

The maintenance policy is very much simplified in Fig. 33.7 and this has been improved in Fig. 33.8 where a single repair team has been modelled thanks to place “Repair team” which can be shared between several items modelled in the same way. When the item is waiting for repair (#2 = 1), the repair starts immediately when the repair team is available (#4 = 1). Therefore, transition “Start of repair” is now an instantaneous transition. The waiting for repair delay does not really depend on the firing delay of this transition but on the arrival of one token in place “Repair team”. When this transition is fired, the token in place “Repair team” is removed and no other items sharing this place can be repaired until the token comes back. This happens when transition “End of repair” is fired. It has to be noted that this shared repair team mechanism has been introduced, again, without any change on the basic structure of the PN illustrated on the left-hand side of Fig. 33.7. This gives an idea of the flexibility of the PN modelling. This PN allows to estimate the average waiting time before a repair is undertaken by dividing the accumulated time spent with one token in place “Wait for repair” by the number of occurred failures. This is useful to verify that the number of repair teams is not under or over dimensioned. All exercises mentioned in Sect. 33.14 illustrates counting processes but exercises 33.1 and 33.2 are more specifically related to this subsection and solutions can be found in Chap. 34.

33.5 Extensions of the Basic PNs The basic PNs described above already have a modelling powerfulness greater than any other reliability modelling technique but over time some extensions have proven to be very useful to make easier the modelling of complex behaviours and to manage PNs related to large industrial systems. These illustrations are presented in Fig. 33.9.

33.5 Extensions of the Basic PNs

Pl2

Pl1

601

Repeated place

Pl1 Tokens

Weighted 3 arc

0 Reset arc

2

Tr -2

Pl3

Predicate

0

3

Tr Inhibitor -2 arc

Pl2

2

Global assertions Local assertion Pl3

Fig. 33.9 PN extensions of the static (left-hand side) and dynamic (right-hand side) parts

33.5.1 Weighted Arcs, Inhibitor Arcs and Repeated Places The introduction of weighted arcs and inhibitor arcs are rather common extensions of the basic Petri nets. This leads to the so-called generalized Petri nets: Upstream weighted arc: similar to an ordinary upstream arc but it validates the corresponding transition when the number q of tokens in its attached upstream place is greater or equal to its weight p i.e. when q ≥ p. Downstream weighted arc: similar to the ordinary downstream arc but which places a number p of tokens equal to its weight in its attached downstream place when the transition is fired. This is illustrated in Fig. 33.9 where an upstream arc of weight 3 and a downstream arc of weight 2 are represented. The weight of the arcs is indicated in the middle of them. It is equal to 1 for ordinary arcs and it is omitted in the graphical representation. Inhibitor arc: upstream arc which inhibits a transition when the number of tokens in its attached upstream place is greater to its weight. This is the counterpoint of the downstream weighted arcs as they inhibit the corresponding transition instead of validating it. An inhibitor arc of weight 2 is illustrated in Fig. 33.9. A minus sign has been added in order to easily distinguish the inhibitor arcs from the others. Therefore, an arc of weight -p is an inhibitor arc of weight p. According to the standard IEC 62551 (2012), the inhibitor arcs can be represented by a dotted line or by a solid line with a little circle (which is a reminder of the NOT gate used to represent logic models) in place of an arrow. The dotted line is adopted in this book as it is more easily readable. An ordinary inhibitor arc has also a weight equal to 1 (noted -1) which is generally omitted in the graphic representations. With the above convention, the weights of normal arcs are strictly positive and the weights of inhibitor arcs are strictly negative: this leaves room for a kind of arcs with a weight equal to zero. This is an opportunity to represent the reset arc:

602

33 Petri Net Modelling

Reset arc: upstream arc playing no role in validation or inhibition but removing all the tokens in its upstream place when the transition is fired. Therefore, this arc resets the number of tokens to zero when the corresponding transition is fired. It has also been represented with a dotted line in Fig. 33.9. Another very simple but very important extension is the introduction of the concept of repeated places: Repeated place: copy of a place used in another location of the Petri net. As illustrated in Fig. 33.9, a repeated place is represented by a square instead of a circle: this allows to identify them easily and to draw clearer models by avoiding arcs running throughout the PNs. Most of the exercises described in Sect. 33.14 involve inhibitor arcs and repeated places. Solutions can be found in Chap. 34.

33.5.2 Predicates and Assertions/Messages The use of predicates and assertions which allow to exchange information between the transitions is less common than the weight on the arcs. PNs using predicates and assertions are called interpreted stochastic PNs. Predicate: any formula which can be “true” or “false”. As illustrated in Fig. 33.9, the predicates are identified by using a double question mark (“??”). They are used to validate a transition by testing the value of a given variable. Examples of predicates are the following: – ??Mes1 == “true” which tests whether the value of the Boolean variable Mes1 is equal to “true”. In this case, the writing can be simplified to ?Mes1. – ??Mes1 == “false” which tests whether the value of the Boolean variable Mes1 is equal to “false”. In this case, the writing can be simplified to ? − Mes1. – ??X == Y which is “true” when X is actually equal to Y. – ??X > 0 which is “true” when X is greater than 0. – ??2 == 3 which is “false”. Assertion: any formula allowing to update a variable used in the PN. As illustrated in Fig. 33.9, the assertions are identified by using a double exclamation mark (“!!”). There are two kinds of assertions: the local assertions linked to a given transition which are used to update a variable when the transition is fired and the general assertions which are automatically and immediately updated as soon as one of the variables in the formula changes. Examples of assertions are the following: – !!Mes2 = “true” which updates the value of the Boolean variable to “true”. In this case, the writing can be simplified to !Mes2. – !!Mes2 = “false” which updates the value of the Boolean variable to “false”. In this case, the writing can be simplified to ! − Mes2. – !!X = 2 which affects the value 2 to X.

33.5 Extensions of the Basic PNs

603

– !!X = Y which affects the value of Y to X. – !!X = Y + Z which affects the value of Y +Z to X.

33.5.3 New Validation of Transitions and Firing Rules The introduction of new elements implies, of course, a change in the validation and firing rules which are modified as follows: • A number of tokens at least equal to the weight of the upstream arcs; • A number of tokens lower than the weight of the inhibitor arcs; • The predicates are “true”. Therefore, with 3 tokens in Pl1 and 1 token in Pl3 and provided that the message Mes1 (i.e. the only predicate) is “true”, the transition presented on the left-hand side of Fig. 33.10 is valid. When a transition is fired, the firing rules are modified as follows : • Remove in upstream places a number of tokens equal to the weight of upstream arcs. • Remove in upstream places all tokens when a reset arc is connected. • Add in downstream places a number of tokens equal to the weight of the downstream arcs. • Execute all the assertions. For the transition presented in Fig. 33.10, the result of the firing is represented on the right-hand side: 3 tokens removed from Pl1 , 2 tokens removed from Pl2 and one added, 2 tokens added in Pl3 and the message Mes2 (i.e. the only assertion) becomes “true”. Therefore, whatever the number of tokens in Pl2 before the firing of the transition, it is always equal to 1 after the firing. Fig. 33.10 Example of transition firing for a PN with extensions

Pl1

Pl1

Pl2 3

-2

0

2

? Mes1 Tr ! Mes2

Pl3

Pl2 3

Firing

-2

0

2

? Mes1 Tr ! Mes2

Pl3

604

33 Petri Net Modelling

Tr1 (Prio = -2)

Tr2 (Prio = -1)

Tr3 (Prio = 0)

Tr5

Tr4 (Prio = 1)

(Prio = 2)

Fig. 33.11 Example of transitions with priority

33.6 Other Extensions 33.6.1 Priority of the Transitions When several transitions are conflicting (see Fig. 33.11), a simple way to solve the conflict is to introduce priorities on these transitions. This avoids to introduce extra places, predicates or inhibitor arcs to force the transitions to be fired in the correct order. When priorities are implemented, the transition with the highest priority is fired first: in Fig. 33.11, Tr5 is therefore fired first. When it has been fired, Tr5 is no longer valid and Tr4 has now the highest priority and can be immediately fired. When it has been fired, Tr4 is inhibited and Tr3 has now the highest priority, etc. Finally, the transitions are fired in the order Tr5 , Tr4 , Tr3 , Tr2 and Tr1 . It has to be noted that the normal priority of the transition is zero (Prio = 0) and it can be omitted on graphic representations of PNs. Exercises 33.5 and 33.10 related to this subsection are described in Sect. 33.14 and their solution can be found in Chap. 34.

33.6.2 Suspended Events (Transition with Memory) In the previous examples (e.g. Figs.33.7 or 33.8), repairs have been modelled just by using a simple repair rate but the organization of maintenance has not been taken into account at all. In actual life, the repair team works only during the day and stops the repair operations during the night. Therefore, the repair operations follow the alternation of nights and days and are suspended during the night. This is a typical example of suspended events: Suspended event: event whose occurrence is delayed due to the occurrence of a given state but goes back to its normal course when this state disappears. This is the case of the repairs illustrated in Fig. 33.12:

33.6 Other Extensions

605

Day

Up Transition with memory

Tr4 δ = 8h

Pl1 EoR Tr3

(Mem) μ

δ = 16h Tr5

δ

Pl3 Night

Validation of EoR

Rep Inhibition of EoR

α

Tr1

SoR

Fail λ

Pl2 Tr2

Wait

Re-validation Firing of EoR of EoR

α

δR Fig. 33.12 Example of a repair suspended during the night (i.e. non-working hours)

• When a token arrives in place Rep (repair) during the day (one token in place “day”), transition EoR (End of repair) is validated. • Then a random number z is generated and, μ being a constant repair rate, the firing delay δR = −ln(z)/μ is calculated. • If the night occurs before the firing of EoR, this transition is inhibited until the next day (i.e. the repair is suspended during the night). As shown in Fig. 33.12, the part of the work already done is equal to δR − α and the time needed to complete the repair operation is α. • When the night finishes, transition EoR becomes valid again but it would be irrelevant to calculate a new random firing delay as if the repair was beginning from scratch: the remaining part of the delay, α, has to be used instead. This raises the general question about a transition which is inhibited before being fired and which is validated again after some time: • Whether the revalidation is due to a new event and, in this case, a new firing delay has to be generated by using random numbers; • Or the revalidation is due to the removal of the causes suspending the event which, then, can occur again and, in this case, the firing delay is equal to the remaining part of the time delay not previously consumed. In the second case, the remaining time before firing has to be stored in memory to be used when the transition is validated again and this leads to the concept of transition with memory:

606

33 Petri Net Modelling

Transition with memory: transition storing in memory the remaining firing delay when it is inhibited before being fired. This property is indicated by “(Mem)”. Transition EoR in Fig. 33.12 is a typical transition with memory which, when inhibited, “keeps in mind” how much time is needed to complete the repair operation. Suspended events which are impossible to model when using the analytical models are, on the contrary, very easy to take into account by introducing the transition with memory. They are, for example, useful to model counters like this illustrated in Fig. 33.13 which triggers the preventive maintenance of the item after 5000 h spent in up state. Transitions with memory are also very useful to model aging items which are used on an intermittent basis, as illustrated in Fig. 33.14. Note on this figure the message “?Idle” which is used to stop the item and the message “? − Idle” which is used to restart the item. These messages come from another part of the model and the related transitions are triggered instantaneously when they become “true”. With this model, the delay before the failure is fired once and the item is stopped and restarted until the failure actually occurs. Such a mechanism is useless when exponential laws are implemented as, by definition, they are memoryless (see Chap. 31 about Markov processes). Fig. 33.13 Example of a counter for triggering the preventive maintenance

δ = 5000 h

Tr3

Pl1 Up

Counter

EoR

μ Rep Pl3

Fail

(Mem)

λ

Tr1

PM

δPM

δ

SoR

Pl2

Preventive maintenance

End of PM

Wait

Tr2

Fig. 33.14 Example of an aging component used in intermittent basis

Tr3

Pl1 Up

?Idle

EoR

μ Rep Pl3

Stop Fail

δ

)

Tr1

(Mem) Pl2

SoR

Wait Tr2

?-Idle Start

Idle

33.6 Other Extensions

607

It has to be noted that, again, the above mechanisms have been introduced without perturbing the basic model of a repaired item. Exercises 33.8, 33.9 and 33.10 related to this subsection are described in Sect. 33.14 and their solution can be found in Chap. 34.

33.6.3 Probabilistic Switches When an item in standby position is started, this can be successful only with a given probability and this kind of event cannot be modelled by a random delay as this has been done above. However, using the properties of the exponential law, it is possible to model an item which starts or fails to start upon demand as illustrated on the left-hand side of Fig. 33.15 where two exponential transitions are in conflict (i.e. valid at the same time): • When the message “?Dem” becomes “true”, then the two transitions Tr1 and Tr2 become valid and they are in conflict. • The firing delay of Tr1 can be calculated by δ1 = −ln(z1 )/λ1 and the firing delay of Tr2 by δ2 = −ln(z2 )/λ2 where z1 and z2 are two random numbers. • Then when δ1 < δ2 , Tr1 is fired first and vice versa. According to the exponential law properties, Tr1 is fired first (i.e. δ1 < δ2 ) with the probability λ1 /(λ1 + λ2 ) and Tr2 is fired first (i.e. δ2 < δ1 ) with the probability λ2 /(λ1 + λ2 ). In addition, the mean delay before one of these transitions is fired is equal to δ = 1/(λ1 + λ2 ). Therefore, it is possible to model an event which occurs almost instantaneously with a probability (1−γ ) and does not occur with a probability γ by selecting two exponential delays governed by parameters λ1 and λ2 such that: • δ = 1/(λ1 + λ2 ) is negligible with regards to the other delays considered in the PN; • γ = λ2 /(λ1 + λ2 ) and 1 − γ = λ1 /(λ1 + λ2 ). Fig. 33.15 Two ways to model success or failure upon demand

Demand ?Dem

λ1

OK

SB Standby position

Tr1 λ2

?Dem Tr2

KO

Conflicting transitions

SB

1−γ OK

Standby position

?Dem Tr3

γ

KO

Probabilistic switch

608

33 Petri Net Modelling

The above technique works and can be used with the basic PNs but, nevertheless, it is time-consuming (need to generate two random numbers and invert the exponential formula twice when it has to be remembered that in Monte Carlo simulation hundreds of such calculations have to be made) and, for more efficiency, the probabilistic switch illustrated on the right-hand side of Fig. 33.15 can be used instead: • • • •

When the message “?Dem” becomes “true” then transition Tr2 becomes valid; Then a random number z3 is generated; If z3 > γ a token is added in place OK when the transition is fired; If z3 ≤ γ a token is added in place KO when the transition is fired.

From the Monte Carlo simulation point of view, this is far more effective than the technique implementing exponential delays as it needs the use of only a single random number. The drawback is that this implies to modify the firing rules: when the transition is fired, only the downstream arc corresponding to the value of the generated random number is considered and the other is ignored. Therefore, after the firing of Tr3 , one token is added in place OK or in place KO but not in both places at the same time. A probabilistic switch can be easily added to transform the PN used to model a repaired item into a model for a repaired item operated in standby position. This is done in Fig. 33.16. where, when a demand occurs, the item starts with a probability 1−γ and fails with a probability γ . In this figure, when the demand disappears (message “? − Dem” becomes “true”), then the item goes back to the standby position. Again, the addition of the “Demand” and “End of demand” transitions has been made without modifying the basic modelling of the repaired item. Both models in Fig. 33.15 can be easily extended to more than two possibilities. It is illustrated in Fig. 33.17 in the case where n states can be reached with the probabilities γ1 , γ2 , . . . , γn when a triggering event (modelled by the message “?Dem”) occurs. Of course, the sum of these probabilities has to be equal to 1. Fig. 33.16 Example of an item operated in standby position

Tr3

Pl1

EoR

μ Rep Pl3

Fail

?-Dem

Up

Tr1

λ

End of demand

1−γ ?Dem Demand

SoR

δ Tr2

Pl2 Wait

SB

33.6 Other Extensions

609

SB

?Dem

λ1

SB

?Dem Tr1

E1

λ2

?Dem Tr

?Dem Tr2

E2

λn

Trn En

Conflicting transitions

γ1 E1

γ2

γn

E2

E2

Probabilistic switch

Fig. 33.17 Modelling of several events occurring on demand

Using the exponential delayslike in the model on the left-hand side of the figure implies to define γi as γi = λi /( nk=1 λk ) and to generate n different random numbers zi to simulate the n delays δi = −ln(zi )/(λi ).   • Select n values λi such as γi = λi /( nk=1 λk ) and δ = 1/( nk=1 λk ) is negligible. • Fire n different random numbers zi to simulate the n delays δi = −ln(zi )/(λi ). • Fire transition k with the shortest delay δk = Min(δi ) which removes the token in place SB and inhibit all the other transitions. Compared to the above model, the efficiency of the probabilistic switch is very obvious in this case as only one single random number z is needed to do the same thing: • If z ≤ γ1 a token is added in place E 1 when the transition is fired; • If γ1 < z ≤ γ1 + γ2 a token is added in place E2 when the transition is fired; • If γ1 + γ2 < z ≤ γ1 + γ2 + γ3 a token is added in place E3 when the transition is fired; etc.

33.6.4 Dynamic Transitions When a Monte Carlo simulation is in progress, the parameters of some transitions can change according to the states visited by the modelled system: for example, the failure rate of a computer increases very much when its ventilation fails. Modelling such dynamic changes implies to implement dynamic transitions within the PN: Dynamic transition: transition whose parameters dynamically change according to the states visited during the Monte Carlo simulation. Such a dynamic transition is illustrated in Fig. 33.18 where the distribution of the firing delay of Tr1 depends on variables (ν and ξ ) which are updated by assertions executed in other parts of the PN:

610

33 Petri Net Modelling

Fig. 33.18 Modelling dynamic transitions

Tr3

Pl1 Up

EoR

μ Rep Pl3

SoR

Fail

δ

Tr1

Variables updated by assertions )

Pl2 Wait

Tr2

• When such a change occurs when the transition is inhibited, there is no problem: the parameters of the distribution have just to be updated. • When such a change occurs when the transition is already validated, then the parameters of the distribution have to be updated but the firing delay which has already been calculated has also to be updated. In the general case, updating the firing delay is not an easy task. However, as described in Chap. 32, it is rather easy when Weibull distributions are implemented. Such dynamic transitions are useful to model the semi-catastrophic common cause failure (see Chap. 5) analysed in 33.7.1.4.

33.7 Miscellaneous Modelling Techniques 33.7.1 Common Cause Failure Modelling Safety and dependability of redundant systems can be impeded by the occurrence of common cause failures (CCFs) leading to the failure of several components at the same time. Therefore, the modelling of CCFs is an important topic to take into consideration when modelling industrial systems by using PNs. The use of predicates and assertions allows to model the several kinds of common cause failures (see Chap. 5) in a rather easy way.

33.7.1.1

Breaking Model and Beta-Factor Model

When it occurs, this kind of CCF actually breaks (i.e. leads to the failure of) the exposed items. This is, for example, the case of an overvoltage, a fire or a flooding which leads to the failure of the exposed items.

33.7 Miscellaneous Modelling Techniques Tr3

611

Pl1 Up

OK

EoR

CCF occurrence

μ Rep Pl3

SoR

Fail

δ

Tr1

!-Ccf Tr4 ?Ccf

Pl2 Wait

Tr2

!Ccf

Tr5

Tr6

(Prio = 0) Several similar items

CCF

(Prio = -1) Low priority

Fig. 33.19 Modelling CCF with the breaking model

This model is implemented in Fig. 33.19 for one of the items subject to be failed by this CCF. The other items have to be modelled in the same way with the sub-PN on the left-hand side of the figure: • Transition Tr1 models the independent failures and Tr5 the common cause failure. • The CCF occurs (firing of Tr5 ) with a delay governed by any kind of distribution; when it occurs, the logic variable Ccf becomes “true” (message “!Ccf ” attached to Tr5 ). • Then the predicate “?Ccf ” becomes “true”, transition Tr4 is fired and the item enters in a state where it waits to be repaired. In this model, the repair of the CCF consists in repairing all the item failures individually and transition Tr6 is used only to reset the logic variable Ccf to “false”. A lower priority has been attached to Tr6 in order to avoid the conflicts and force all the Tr4 like transitions (i.e. all the similar transitions related to the items exposed to this CCF) to be fired first. A particular case of this model is the popular beta-factor model where the item failures are modelled by using a failure rate λ split into a common cause part, β · λ, and an independent part (1 − β) · λ. With the beta-factor model, transition Tr5 is governed by an exponential law of parameter β · λ and transition Tr1 of the similar items by an exponential law of parameter (1 − β) · λ.

33.7.1.2

Disabling Model

In this CCF model, the CCF does not actually break (i.e. leads to the failure of) the exposed items but only disables them as long as it has not been repaired. This is, for example, the case of the loss of utilities providing power (electricity, fuel, etc.) or command-control which disables the connected items. In this case, the CCF can be modelled by a sub-PN similar to this used for the items, as done in Fig. 33.20. Again, only one item is presented in this figure but all items subject to be failed by this CCF have to be modelled in the same way:

612

33 Petri Net Modelling

Pl1

Tr3

!!NbR=NbR+1 EoR

Pl4 OK

?-Ccf

Up

Start Stop

Tr1

Fail (Mem)

Tr6 ?Ccf

!-Ccf !!NbR=NbR+1 EoR Fail

Tr4

!Ccf Rep Pl6 Several similar !!NbR=NbR-1 items (Prio = 1) !!NbR=NbR-1 SoR SoR Pl5 Wait Pl2 Wait Tr5 ??NbR>0 Tr2 ??NbR>0 Similar sub-PNs Rep Pl3

Fig. 33.20 Modelling a CCF disabling the exposed items

• Transition Tr4 models the occurrence of the CCF. • When the CCF occurs (firing of Tr4 ) with a delay governed by any distribution, the logic variable Ccf becomes “true” (message “!Ccf ”). • When Ccf becomes “true”, transition “Stop” is validated if the item is in Up state. • Then it is immediately fired and the token is removed from place “Up”. • When the CCF has been repaired, Tr6 is fired and the logic variable becomes “false” (message “! − Ccf ”). • When Ccf becomes “false”, transition “Start” is validated and it is immediately fired. • A token is added in place “Up” and the item is operating again. It has to be noted that transition “Fail” (Tr1) is now a transition with memory to take the intermittent functioning into account and that the repair teams have been represented by a logic variable NbR (number of repair teams) which is decremented by one when a repair starts and incremented by one when a repair ends. Note also the priority equal to 1 for Tr5 in order that the CCF repair begin before this of the item waiting for repair (Tr2 like transitions).

33.7.1.3

Shock Model

This CCF model is complementary to the breaking model which, in fact, models the occurrence of a lethal shock. When the former occurs, it provokes a non-lethal shock which leads to the failure of the exposed items only with a given probability γ . This is, for example, a brutal change in the temperature, a mechanical shock or more generally an event increasing the load of the exposed items and then their probabilities of failure. The shock model is commonly used in complement of the beta-factor model. This model is implemented in Fig. 33.21. Only one item is presented in this figure but all items subject to be failed by this CCF have to be modelled in the same way. The functioning of this PN is similar to the beta-factor model except that:

33.7 Miscellaneous Modelling Techniques

Tr3

613

Lowest priority

Tr5

Pl1 Up

EoR

μ Rep Pl3

Fail

SoR

(Prio = -2)

λind

δ

Tr1 Pl2 Wait

Tr2

Non lethal shock

λNls

?Nls

γ

Tr4 (Prio = 0)

1−γ Pl4

OK

!Nls

Tr6

!-Nls Tr7

CCF Several similar items

(Prio = -1) Low priority

Fig. 33.21 Modelling a CCF inducing a shock on the exposed items

• The logic variable Nls (Non-lethal shock) is used to model the occurrence of the non-lethal shock; • Transition Tr4 is now a probabilistic switch; • An auxiliary place (Pl4) and an auxiliary transition (Tr5 ) have been added to solve the conflict between Tr4 and Tr7 . Several conflicts have to be solved in this PN when the lethal shock occurs (firing of Tr6 ): • Nls becomes “true” and Tr4 is validated, one token is placed in place CCF and Tr7 is also validated. • Then Tr4 and Tr7 are in conflict but, as Tr4 has to be fired first, the priority of Tr7 is lower than this of Tr4. • When Tr4 is fired, one token appears in place “Wait” with a probability γ or in the auxiliary place Pl4 with a probability 1 − γ . • When a token appears in Pl4 , then Tr5 becomes valid and it is in conflict with Tr7. • Tr7 which resets Nls to “false” has to be fired before Tr5 in order to prevent Tr4 to be validated again. Therefore, the priority of Tr5 has to be lower than the priority of Tr7 which has to be lower than the priority of Tr4 . This is why in Fig. 33.21 the priority of Tr4 is 0, the priority of Tr7 is -1 and this of Tr5 is -2.

33.7.1.4

Semi-Catastrophic Model

Another modelling close to the shock model is to consider that an event occurring somewhere in the PN increases the failure probability of several items. This can be modelled by implementing dynamic transitions (see Sect. 33.6.4 and Chap. 32) as illustrated in Fig. 33.22:

614

33 Petri Net Modelling

Tr3

Pl1 Up

EoR

μ Rep Pl3

SoR

Dynamic transition

OK

From other parts of the PN

) Fail

δ

Tr1 Pl2 Wait

Tr2

Tr4

!! Several similar items

CCF

Toward other parts of the PN

Fig. 33.22 Modelling a CCF with dynamic transitions

• On the right-hand side of the figure, the firing of transition Tr4 updates variable ν thanks to assertion !!ν = F(. . .) where F(. . .). is any function. • This modifies the firing delay distribution, F (δ, ν), of transitions Tr1 (Fail) of the several items which are exposed to this change. • Then all the firing delays depending of variable ν are updated (i.e. shortened in case of CCF) according to its new value (see 33.6.4). This model is interesting to change the failure rate of several items when some external event occurs: for example, if the air conditioning fails in a given room, all the electronic devices in this room do not fail immediately but their failure rates increase according to the temperature. In the same way, if corrosion, humidity, load, etc. increase, then the exposed devices do not fail immediately but their failure rates increase. It has to be noted that the same model can be used to decrease at the same time the failure probability of several items as well. In this case, it models a common cause of improvement rather than a common cause of failure.

33.7.2 Modelling Maintenance and Maintenance Supports 33.7.2.1

Maintenance Tools Mobilisation

The maintenance teams have already ben modelled in the previous subsections of this chapter: • In Figs. 33.7, 33.12 and 33.13 etc. with a simple waiting delay δ; • In Fig. 33.8 with an auxiliary place “Repair team”; • In Fig. 33.20 with a logic variable NbR.

33.7 Miscellaneous Modelling Techniques Pl1

Mob

Tr3 OL

!!NbF=NbF-1 EoR

Rep Pl3

Tr1

Fail !!NbF=NbF+1

SoR Tr2

Up

615 Mobilisation ends

Tr6

??NbF=0

Pl4 nM Not mobilised EoM

??NbF>0 δM Pl6 Tr4 OL Travel to locations

On locations Several Pl2 Wait similar items Tr5

δT

TtL

Mobilisation starts SoM

Pl5 Mob Mobilised

Fig. 33.23 Modelling the mobilisation of a maintenance support

This is sufficient to explain the basic mechanisms of PNs and to undertake basic modelling with PNs. However, complements are needed when maintenance operations have to be more detailed, for instance to describe the maintenance team or maintenance support mobilisation (see Fig. 33.23). This PN functions as follows: • When an item needing the maintenance support (e.g. a rig to maintain a subsea platform) fails (firing of Tr1 ), variable NbF (number of failure) is incremented by 1. • As soon as this variable is greater than zero, transition Tr4 (start of mobilisation) is validated (predicate “??Nbf > 0”); it is fired when some administrative delay γM (deterministic or aleatory in function of the nature of Tr4 ) is elapsed and the maintenance support is mobilised (one token in place Mob). • Transition Tr5 (TtL: travel to location) is validated and it is fired after a given delay δT (deterministic or aleatory in function of the nature of Tr5 ). • When Tr5 is fired, one token appears into place OL (on location) and the maintenance support is ready to be used. • At this time, all the transitions similar to SoR (start of repair) of the items waiting for maintenance are validated. • One of them is fired, the token is removed from OL and the maintenance begins for the item which has caught the token. • When the repair is finished, the corresponding transition EoR (end of repair) of the repaired item is fired, tokens are placed in places Mob and Up and Nbf is decremented by 1. • Then transition Tr5 is validated again, the maintenance support travels toward the new item to repair and, after another delay δT , Tr5 is fired and one token appears in place OL. • From this state, if another item needs to be repaired (Nbf > 0), the same process as above is repeated. • If no other item needs to be repaired, “??Nbf = 0” becomes “true” and Tr6 (EoM: end of mobilisation) is validated and immediately fired.

616

33 Petri Net Modelling

• The maintenance support is demobilised and needs to be re-mobilised if a further failure occurs. With this model, the maintenance support is mobilised as soon as one item failure occurs, the time needed to come into location is taken into account, only one fault is repaired at the same time and the demobilisation occurs as soon as no more repairs have to be done. If several resources have to be mobilised to perform the maintenance, similar mechanisms can be implemented for each of them.

33.7.2.2

Spare Part Modelling

Most of the maintenance interventions cannot be undertaken without having the right spare parts on hand. However, storing many spare parts of any kind is costly and takes space; this is why only a minimum of spare parts are available at any time and need to be reordered when they have been used. Figure 33.24 proposes two ways to model the spare part provisioning. In each case, a new spare part is ordered when the stock becomes under sp units. On the left-hand side the sub-PN works as follows: • As long as the number of tokens in place SP is greater than the weight sp of the inhibitor arc, transition SpO (spare part order) is inhibited. • When the number of tokens becomes lower than sp, SpO is validated and, after a delay δSP , it is fired and a new token (i.e. a new spare part) is added in place SP. The PN on the right-hand side of the figure behaves exactly in the same way: when variable NbSP becomes lower than sp, then SpO is validated and, when it is fired, NbSP is incremented by 1. This is the opportunity to illustrate a transition without arcs nor places and used only to update a variable by using predicates and assertions. In both cases, SpO is inhibited as soon as the number of spare parts in stock becomes equal to sp. The value of sp (i.e. the size of the stock) can be tuned according to the failure frequency and the provisioning delay in order to avoid shortages (and therefore, maintenance delays) when a spare part is needed. Exercise 33.11 related to this subsection is described in Sect. 33.14 and its solution can be found in Chap. 34. Fig. 33.24 Modelling the spare part provisioning

δSP

SpO

Spare part order

-sp

SpO SP

Toward spare part users

δSP

33.7 Miscellaneous Modelling Techniques

33.7.2.3

617

Repair Priorities

According to the impact of failures on a system, the repair is more or less urgent and this leads to define several levels of priority with regards to repair: • Urgent repair: an important function is immediately lost which has to be reestablished as soon as possible. This is, for example, the case of a production valve which spuriously closes. Then the production is immediately lost and the repair has to be done as soon as possible. • Non-urgent repair: an important function is lost but the repair can be delayed until the urgent repairs have been performed. This is, for example, the case of a production valve which is stuck open. There is no impact on the production but this can have an impact on safety. The repair has to be done as soon as possible but this is less urgent than for the spurious failures. • Opportunistic repair: an auxiliary function is lost and the repair can be delayed until the maintenance support has been mobilised to repair urgent and/or nonurgent failures. This is, for example, the case of the failure of auxiliary components which do not prevent a system to operate. In this case, the repair is not urgent and can be done when the maintenance team is on location for any other reason. These three kinds of repairs are illustrated in Fig. 33.25. The urgent repairs are presented on the left-hand side of the figure: • When the item fails, it enters in a state where it waits for an available spare part (place WSP). • When the spare part is available (one token in SPU ), Tr2 becomes valid and it is immediately fired: one token is added in place Urg (which counts the number of urgent failures to be repaired by the maintenance support), variable NbF is incremented by one and the item enters in a state where it waits for the maintenance support (one token in place WMS). • When the maintenance support is on location (one token in place OL), the repair can start and SoR (Tr3 ) is fired. Pl1 Up

Mob

!!NbF=NbF-1

Tr1

EoR Tr4

Fail

Tr3

SoR

Tr1

EoR Tr4

Fail

Urg Rep Pl3

Pl1 Up

Mob

!!NbF=NbF-1 Tr1

EoR Tr4

Pl1 Up

Mob

Fail Opp

nUrg WSP SPU

Tr3 Tr2 !!NbF=NbF+1

OL

WSP

Rep Pl3

OL WMS Urgent repair

Fig. 33.25 Modelling repair priority

SPnU

SoR Urg

Tr3 Tr2 !!NbF=NbF+1

WMS Non-urgent repair

WSP

Rep Pl3

OL

SPOp

nUrg SoR Urg

Tr2

WMS Opportunistic repair

618 Fig. 33.26 Maintenance support mobilisation modelling

33 Petri Net Modelling

Urg Tr6

nUrg Opp

OL

EoM

δM

Pl6

On locations Tr5

Pl4 nM Not mobilised

δT

Travel to locations TtL

Tr4

??NbF>0 SoM

Pl5 Mob Mobilised

• When the repair is completed, transition EoR (Tr4 ) is fired and one token is removed from place Urg, tokens are added in places Mob and Up and variable NbF is decremented by one. This sub-PN is designed to work with the spare part model presented in Fig. 33.24 through the use of the repeated place SPU and with the mobilisation model presented in Fig. 33.23 through the repeated places OL, Mob, Urg and variable NbF. The non-urgent repairs are modelled exactly in the same way as the urgent repairs except that the starting of a non-urgent repair is inhibited as long as an urgent repair has to be done before. The opportunistic repairs are also modelled in the same way but variable NbF is not modified and the starting of the opportunistic repair is inhibited as long as an urgent repair or a non-urgent repair has to be done. With regards to the maintenance support mobilisation, the sub-PN proposed in Fig. 33.23 has to be lightly adapted so that the mobilisation ends when all the failures (urgent, non-urgent and opportunistic) are repaired. This is illustrated in Fig. 33.26 where the repeated places Urg, nUrg and Opp are used to inhibit transition EoM (end of mobilisation). It has to be noted that variable NbF(number of failures) counts only the urgent and non-urgent failures and, therefore, the opportunistic failures do not participate to the validation of SoM (start of mobilisation).

33.8 Undertaking System Modelling 33.8.1 Modelling of the System The same system made of two similar components with a single repair team already analysed in Chap. 31 with the Markovian approach is modelled in Fig. 33.27. Components A and B are modelled by implementing sub-PNs already analysed above:

33.8 Undertaking System Modelling

619 Pl1

A !MT

A

Tr3

Up1

Pl4

B !MT

EoR1 Fail1

B

Rep1 Pl3

Single repair team

!-MT SoR1 Tr2 ?MT

Tr1

Tr6

Up2

EoR2 Fail2

Tr4

Rep2 Pl6 Pl2 Wait1

!-MT SoR2 Tr5 ?MT

Pl5 Wait2

Fig. 33.27 Modelling a redundant system of two components with a single repair team

• Each of them has – Three states (places): Up (operating), Wait (waiting for the maintenance team) and Rep (under repair); – Three transitions: Fail (failure), SoR (start of repair) and EoR (end of repair). • The single repair team is modelled by the Boolean variable MT. When it is “true” (message “?MT ”) the repair of one component starts; it becomes “false” (message “! − MT ”) when one repair actually starts to avoid having two repairs at the same time. • When the repair ends, variable MT becomes “true” again (message “!MT ”). Exponential distributions have been chosen for failure and repairs in order to simplify the figure, but the development hereafter is valid for any kind of distributions.

33.8.2 Monte Carlo Simulation of the Model The Monte Carlo simulation of this model is similar to this described in Sect. 32.2. The difference is that, now, there is a single repair team. The realization of one history of the system (i.e. a trajectory of the random process modelled by the Markov graph) over a period of time [0, T ] is illustrated in Fig. 32.4 and performed as follows: • Determine the initial conditions: one token in Up1 , one token in Up2 and T0 = 0. – – – –

The initial state is [Up1 , Up2 ] i.e. one token in Up1 and one token in Up2 ; The initial time is T0 = 0; The initial value of variable MT is “true”; The initial timetable is [T0 , T ] where T is the duration of interest for the simulation.

• From state [Up1 , Up2 ] two transitions are valid: the failures of A (Tr1 ) and B (Tr4 ).

620

33 Petri Net Modelling

– By using a random number, calculate the firing delay δ1A of Tr1 ; – By using a random number, calculate the firing delay δ4B of Tr4 ; => Tr1 will occur at T1A = T0 + δ1A = δ1A . => Tr4 will occur at T4B = T0 + δ4B = δ4B . => The timetable becomes [T1A , T4B , T ] if T1A < T4B and Tr1 is the next transition to be fired. • Fire Tr1 at T1A : state [Wait1 , Up2 ] is reached and as MT is “true”, Tr2 (start of repair of A) becomes valid. – Tr2 being an instantaneous transition its firing delay is δ2A = 0: => Tr2 will occur at T2A = T1A + 0 = T1A . => The timetable becomes [T2A , T4B , T ] and Tr2 is the next transition to be fired. • Fire Tr2 at T2A : state [Rep1 , Up2 ] is reached, MT becomes “false” and Tr3 (end of repair of A) becomes valid. – By using a random number, calculate the firing delay δ3A of Tr3 : => Tr3 will occur at T3A = T2A + δ3A . => If δ3A is short enough, the timetable becomes [T3A , T4B , T ] and Tr3 (end of repair of A) is the next transition to be fired. • Fire Tr3 at T3A : state [Up1 , Up2 ] is reached, MT becomes “true” and Tr1 (failure of A) becomes valid again. – By using a random number, calculate the firing delay δ1A of Tr1 : => Tr1 will occur at T1A = T3A + δ1A . => As it is likely that T4B < T1A , the timetable becomes [T4B , T1A , T ] and Tr4 (failure of B) is the next transition to be fired. • Fire Tr4 at T4B : state [Up1 , Wait2 ] is reached and, as MT is “true”, Tr5 (start of repair of A) becomes valid. – Tr5 being an instantaneous transition its firing delay is δ2A = 0. => Tr2 will occur at T5B = T4B + 0 = T4B . => The timetable becomes [T5B , T1A , T ] and Tr5 is the next transition to be fired. • Fire Tr5 at T5B : state [Up1 , Rep2] is reached, MT becomes “false” and Tr6 (end of repair of B) becomes valid. – By using a random number, calculate the firing delay δ6B of Tr6 : => Tr6 will occur at T6B = T5B + δ6B . => If T6B > T1A , the timetable becomes [T1A , T6B , T ] and Tr1 (failure of A) is the next transition to be fired.

33.8 Undertaking System Modelling

621

• Fire Tr1 at T1A : state [Wait1 , Rep2] is reached and, as MT is “false”, transition Tr2 (start of repair of A) is inhibited as long as Tr6 (end of repair of B) has not been fired. • Etc. When the lowest firing delay is greater than T, one history of the modelled system (i.e. one trajectory of the underlying stochastic process) has been achieved. Performing plenty of such histories allows to perform statistic estimations to obtain the parameters of interest as this has been explained in Chap. 32.

33.8.3 Timetable In the above simulation, the concept of timetable has been introduced. According to the fired transitions, this table has successively comprised several values like [T1A , T4B , T ], [T2A , T4B , T ] or [T1A , T6B , T ] indicating, at any time, which transitions were valid and when they were expected to be fired. In the above tables, a value like T4B indicates that transition Tr4 is expected to be fired at date T4B . The timetable [T4B , T1A , T ] indicates, for example, that the next transition to be fired is Tr4 and that it will be fired at date T4B . Therefore, T4B is, in fact, a shortcut covering a doublet of two values: the number of the transition and its expected firing date. Therefore, [T4B , T1A , T ] indicates also that Tr1 is expected to be fired in second position at date T1A and that no more transitions are expected before the duration of interest, T, has elapsed. It has to be noted that such a timetable can be implemented in different ways in a software package in order to keep the transitions to be fired, their firing dates and the good sequencing. The updating of this table is essential to properly undertake a Monte Carlo simulation and this has to be done each time a transition is fired. The above example shows how to remove the firing date of the fired transition and to add the firing dates of the newly validated transitions. However, this has to be completed by considering the transitions inhibited when another transition is fired: for example, when variable Ccf becomes “true” in the PN illustrated in Fig. 33.20, transition “Stop” is fired and this inhibits transition Tr1 (Fail). Then the firing date of Tr1 has to be removed from the timetable because it is no longer valid. Finally, the updating of the transition time is performed in four steps: • • • •

Remove the date of the fired transition; Remove the firing dates of the inhibited transitions; Calculate the firing dates of the newly validated transitions; Introduce them in the right places of the timetable, i.e. by ascending order of the firing dates, if they are lower than the period of interest T.

Proceeding in this way, the first transition in the list is always the next one to be fired and, when the list is empty, the simulation is completed. When such a table is implemented, the time progresses step by step from a firing date to the next one and

622

33 Petri Net Modelling

this allows to speed up very much the simulation compared to techniques handling the time on a continuous basis (e.g. proceeding by constant steps, t, of time).

33.8.4 Pre-Processing and Table of Impacted Transitions When a transition is fired, a new state is reached and from this new state some transitions may be inhibited and some new transitions may be validated. This is by following what happens that the graph of the successive marking described in the previous section can be obtained. In Sect. 33.8.2, each time a transition has been fired, the resulting state has been analysed in order to identify which other transitions have become valid. Proceeding that way obliges to re-make the same analysis each time the same transition is fired as for example for Tr1 (failure of A), which is valid twice in the above simulation. Doing the same analysis several times is always a waste of time and this is especially time-consuming in Monte Carlo simulation where it is performed thousands of times. Therefore, it is a good idea to analyse which transitions are impacted by the firing of another one and to build a table of the potentially impacted transitions to keep memory of this information. For example, in Fig. 33.27 the firing of Tr1 impacts the validation of itself and of Tr2 , the firing of Tr5 impacts the validation of itself, of Tr6 and of Tr2 , the firing of Tr3 impacts the validation of itself, of Tr1 , Tr2 and Tr5 , etc. Then, when Tr1 is fired, only Tr1 and Tr2 have to be analysed to see if they have become valid or not and when Tr5 is fired, only Tr5 , Tr6 and Tr2 have to be analysed to see if they have become valid or not. This is economical from a computation time point of view as only two or three transitions have to be analysed instead of six. A simple pre-processing of the PN allows to easily identify which transitions are impacted when a given transition is fired. This is any transition • having an upstream place in common with the upstream and downstream places of the fired transition; • which uses in its predicates some variables updated by the fired transition (by local and global assertions). It has to be noted that the fired transition belongs to its own list of impacted transitions because it is not necessarily inhibited when it is fired. Therefore, the pre-processing establishes the list of impacted transitions once and for all. Each time a transition is fired, this list can be used to identify the transitions candidates for inhibition or validation. Therefore, they and only they have to be verified as no change has occurred on the other ones in the overall PN. When such a table is implemented, this allows to speed-up very much the simulation as only few transitions have to be verified each time a firing occurs instead of all the transitions of the overall PN.

33.8 Undertaking System Modelling

623

33.8.5 Preventing Endless Loops The presence of loops is a normal situation in a PN and for example, in Fig. 33.27, transitions Tr1 (Fail1), Tr2 (SoR1) and Tr3 (EoR1) form a loop which is normally executed several times during a Monte Carlo simulation. With such loops, there is no problem as the firing dates increase progressively until the observation time T is exceeded. In fact, a problem occurs with a loop when the first firing date in the timetable is stuck to a value which does not change anymore. In this case, the observation time T cannot be reached and the PN enters into an endless loop which cannot be stopped. This is a common problem which arises with automata and a PN is actually a finite state automaton. Figure 33.28 gives several examples of loops. On the very left-hand side, the subPN Loop 1 illustrates a token generator which places a new token in place Pl1 at a regular interval δ. When transition Tr1 is fired, it remains valid and generates a loop Tr1 -Tr1 -Tr1 etc. This gives a finite loop because the firing date is incremented by δ each time Tr1 is fired: after some firings, it is going to reach T and the simulation stops. The sub-PN Loop 2 seems similar to Loop 1 and, like for Tr1, the firing of Tr2 generates a loop Tr2 -Tr2 -Tr2 etc. However, the firing delay being equal to 0, this gives an endless loop because the firing date does not change: the limit time T is never reached and the simulation never stops on this criterion. In addition, it adds continuously new tokens in Pl2 and the number of tokens in this place goes to infinity, which is likely to cause computation problems (overflow). The sub-PN Loop 3 involves two transitions which go around in circles: the firing of Tr3 validates Tr4 and the firing of Tr4 validates Tr3 . As both Tr3 and Tr4 are instantaneous transitions, the firing date does not change and the simulation cannot stop on the limit time T criterion. The sub-PN Loop 4 is a single transition intended to update the value of message Mes. Again, this transition is continuously valid and, as its firing delay is equal to zero, it is continuously fired.

Pl3

δ

Pl1 Loop 1

Tr2

Tr1

Pl2 Loop 2

Fig. 33.28 Example of endless loops

Tr3

Tr4

!Mes

Pl5 Loop 3

Loop 4

Tr5

624

33 Petri Net Modelling

Pl3

δ

Tr2

Tr1

Tr3

-3 Pl1 Loop 1

Pl2 Loop 2

Tr4

?-Mes !Mes

Tr5

Pl5 Loop 3

Loop 4

Fig. 33.29 Example of endless loop prevention

Therefore, such endless loops have to be prevented and this can be done by adding conditions to validate the transitions. This is done in Fig. 33.29: • Loop 1: it is an endless loop only when δ = 0. In this case, the number of tokens in Pl1 is limited to three and Tr1 has to wait for the removing of at least one token before being fired again; when δ = 0, it is similar to the sub-PN used to model the spare parts provisioning previously analysed in 33.7.2.2. • Loop 2: the number of tokens in Pl2 is limited to one and Tr2 has to wait for the removing of this token before being fired again. • Loop 3: Tr4 is now inhibited as long as the token in Pl5 has not be removed. • Loop 4: now Tr5 is valid only if the logic variable Mes is “false”. When this occurs, Tr5 is immediately fired, Mes becomes “true” and Tr5 is inhibited and cannot be fired any more. The occurrence of endless loop is a problem when performing Monte Carlo simulations and then preventing them is a major concern when building PNs. Nevertheless, despite the precautions, some endless loops can remain in large PNs, especially when numerous instantaneous transitions are implemented (e.g. to model complicated system reconfigurations after failure), and several mechanisms can be introduced to stop the simulation when this happens: 1. Stop the computation when a preset limit computation time Tc is exceeded; 2. Stop the computation when a preset number nTr of transitions has been fired without modification of the first firing date in the timetable. Defining an allowed computation time Tc and interrupting the simulation when it is reached is not expressly dedicated to detect endless loops. This is especially useful when computing a PN for the first time to see how many histories can be performed for a computing time equal to Tc. When one endless loop occurs, the number of histories is reduced and anomalies can be detected in the results. The second solution is really dedicated to the detection of endless loops: it allows to stop the simulation as soon as an endless loop occurs but also indicates which loop has occurred by printing the list of the nTr last transitions fired without change

33.8 Undertaking System Modelling

625

in the firing date. If nTr = 10, this will lead to Tr2 –Tr2 –Tr2 -Tr2 -Tr2 -Tr2 -Tr2 -Tr2 -Tr2 Tr2 for Loop 2 and to Tr3 -Tr4 -Tr3 -Tr4 -Tr3 -Tr4 -Tr3 -Tr4 -Tr3 -Tr4 for Loop 3. This is generally sufficient to clearly identify the loops and this is helpful to modify the PN in order to prevent them. When nTr is too small, it can detect sequences of instantaneous transitions which are not looped. This may arise with large PNs implementing many instantaneous transitions (e.g. for system reconfiguration modelling) and, in this case, nTr has to be increased until the false loops are no longer detected. Besides the endless loops, the analysts using PNs for designing automata are also concerned by liveness (no deadlocks) and reachability (ability to come back to any state). These properties, which are essential for designing automata operating properly, are not really useful in the reliability modelling context where final states without transitions can be considered (e.g. absorbing state for reliability purpose or accident) and where a system leaving a functioning phase does not necessarily come back to it. Exercises 33.1, 33.6, 33.7 and 33.12 to 33.15 related to this subsection are described in Sect. 33.14 and their solution can be found in Chap. 34.

33.8.6 Markov Graph Generation After having shown how a PN works, it is the good place to explain how it can be used to generate an equivalent Markov graph: this has been the first use of Petri nets within the safety and dependability field and is still useful to build large Markov graphs. As illustrated in Fig. 33.30, this is done by identifying the successive possible markings of the PN which are reachable from an initial state: 1. Initialize the list of reachable states with the initial state. E1

Up1

E1

Up2 Fail2

Fail1 Wait1 Up2

Up1 Wait2

SoR1

SoR2

Rep1 Up2

Up1 Rep2

EoR1 E3

Zero-duration states

Rep1 Wait2

EoR1

EoR2

E4

E5

E3

μa

μb

A failed first

γ =1

AB

AB

λb

Fail1 Wait1 Rep2

λb

λa

γ =1

EoR2 E2

Fail2

AB

μb AB

E4

μa

E5

E2

λa AB

Fig. 33.30 Markov graph (right) generated by the successive markings of a PN (left)

B failed first

626

33 Petri Net Modelling

2. Identify the transitions valid from this state and add the new reachable states to the list. 3. Choose one of the reachable states not analysed yet and repeat step 2. 4. Stop when no more reachable states remain in the list. This allows to build the marking graph related to the analysed Petri net, marking graph which can be defined as follows: Marking graph: set of all reachable markings of a Petri net associated with the transitions leading from a marking to another one. It has to be noted that, in the general case, the marking graph depends on the initial state used to implement the above algorithm. Applying it to the PN in Fig. 33.27 and starting with state [Up1 , Up2 ], states [Wait1 , Up2 ] and [Up1 , Wait2 ] are reachable. From [Wait1 , Up2 ] state [Rep1 , Up2 ] is reachable and so on until all the states have been identified. This leads to the marking graph presented on the left-hand side of Fig. 33.30 where 7 states have been identified: 5 permanent states and 2 zero-duration states. This process leads to explore all the transitions able to be fired during a Monte Carlo simulation starting with the selected initial state. Therefore, it can be used, at the same time, to establish the table of impacted transition described in the previous chapter. As the failure and repair distributions are exponential, this marking graph is equivalent to the Markov graph presented on the right-hand side of the figure where the zero-duration states can be eliminated. This models the FIFO (first in, first out) maintenance policy, which has been also identified in Chap. 31 (Fig. 31.48). With this example, the 6 places and 6 transitions of the PN have been replaced by the 5 states and 8 transitions of the Markov graph. Therefore, the benefit from the size of the model point of view is not obvious. It becomes more evident when 3 redundant components are considered: 9 places and 9 transitions are replaced by 11 states and 19 transitions. And so on, when the number of components increases, the size of the PN increases linearly whereas the size of the equivalent Markov graph increases exponentially. Therefore, a PN comprising only exponential and instantaneous transitions is equivalent to a Markov graph and can be used as an interface for generating this Markov graph. The main interest of this approach is to reduce the risk of errors when building the Markov graph and the possibility to achieve the calculations in an analytical way. Of course, this is no longer possible as soon as non-exponential distributions (e.g. constant delays different of zero) are introduced in the PN.

33.9 Undertaking System Calculations

627

33.9 Undertaking System Calculations 33.9.1 Availability and Unavailability In the PN modelled in Fig. 33.27, the whole system is available when component A is available (#1=1) or when component B is available (#4=1). Then a classical IF-THEN-ELSE coding structure can be used to express a discrete random variable Av equal to 1 when the system is available and equal to 0 when it is unavailable: IF #1 + #4 > 0 THEN Av=1 ELSE Av=0 More simply, an if -then-else operator, ite(.), can be used: Av = ite(#1 + #4 > 0; 1, 0). The various values of the random variable Av can be used to perform statistic calculations during a Monte Carlo simulation exactly as #1 in 33.4.5. With regards to availability and unavailability, a given history, h, can provide: • • • •

Instantaneous availability (Avh (T ) : value (1 or 0) of Av at time T; Instantaneous unavailability   (U vh (T ) : value (0 or 1) of (1 − Av) at time T; Average availability Av h : mean value of Av over [0, T ];   Average unavailability U v h : mean value of (1 − Av) over [0, T ].

The average availability or unavailability mentioned above are the average values obtained for a given history, h, not the average values of the whole Monte Carlo simulation. As for Avh (T ) or U vh (T ), each history provides single values Av h and U v h and many such histories have to be achieved to gather significant samples allowing relevant statistic estimations. A more general way to perform the above calculations is to introduce the sub-PN illustrated in Fig. 33.31 which is linked to the modelled system through variable Av: • When the system fails, Av becomes equal to 0 and transition “Failure” is immediately fired (=> #Up = 0, #Dwn = 1). • When the system fails for the first time, transition “Unreliability” is immediately fired (=> #Fault = 1, #TTF = 0 and #Dwn remains equal to 1).

628

33 Petri Net Modelling

Fig. 33.31 Auxiliary sub-PN for availability, reliability and frequency calculations

Up Available ??Av==0 Failure

Restoration ??Av==1 Dwn

Unavailable

TTF Unreliability Fault Faulty

• When the system is repaired (restored), Av becomes equal to 1 and transition “Restoration” is immediately fired (=> #Dwn = 0, #Up = 1). From this auxiliary sub-PN the following statistics can be obtained: • • • • •

Average value of #Up at T: instantaneous availability; Average value of #Dwn at T: instantaneous unavailability; Mean marking of #Up over [0, T ]: average availability; Mean marking of #Dwn over [0, T ]: average unavailability; Mean number of firings of transition “Failure” over [0, T ]: mean number of failures; • Mean number of failures over [0, T ] divided by the duration T: failure frequency.

This is illustrated in Fig. 33.32 for unavailability and failure frequency. These results are obtained with failure rates equal to 10−3 /h, repair rates equal to 0.1/h and 106 histories (10 s of simulation on a PC). The curve for the instantaneous unavailability is more rugged than these related to the average unavailability or the failure frequency because it is obtained from a discrete random variable (only 1 or U(t)

w(t) w

U U(0,t)

Failure frequency

Unavailability Time

Fig. 33.32 Example of unavailability and failure frequency calculations (106 histories)

Time

33.9 Undertaking System Calculations

629

0) when the others are real random variables. Fortunately, the instantaneous unavailability U (t) is generally not really interesting compared to the average unavailability U¯ (0, t) or the failure frequency w(t). The curves related to these two last parameters are similar to those obtained with analytical calculation even if they are a little bit rougher (see Chaps 22 or 31).

33.9.2 MTBF, MUT and MDT In addition to availability and unavailability, the following statistic results can be obtained: • The result of T divided by the mean number of failures: mean time between failures (MTBF); • The mean accumulated time with #Up = 1: mean accumulated up time; • The mean accumulated time with #Dwn = 1: mean accumulated down time; • The mean accumulated up time divided by the mean number of failures: mean up time (MUT); • The mean accumulated down time divided by the mean number of failures: mean down time (MDT). It has to be noted that the mean number of firings can also be obtained by the average number of changes between 1 and 0 of variable Av and that the accumulated up time (respectively down time) can be obtained by the accumulated time spent with Av=1 (respectively Av=0). This is illustrated in Fig. 33.33. MTBF(0, t) is drawn on the left-hand side and MUT (0, t) and MDT (0, t) are drawn on the right-hand side. As usual, MDT (0, t) 90; 90, 70 × #1 + 40 × #4) The production availability PA is equal to Prod /90 in order to have a result in percentage: PA = ite(0.778 × #1 + 0.444 × #4 > 1; 1, 0.778 × #1 + 0.444 × #4) Then the production availability can be calculated as the mean value of PA over a given duration of interest. This is the same kind of calculation as this performed above with Av to calculate the average availability. The results are illustrated in Fig. 33.35 where the average availability and the production are compared. The production availability is lower because the production capacity is only 122% (i.e. 110 m3 /h/90 m3 /h) compared to 200% for the availability calculations.

632

33 Petri Net Modelling

Fig. 33.35 Comparison between average availability and production availability (106 histories)

A(0,t) Aas

Average availability Production availability

PA(0,t) PAas

Time

33.10 Accuracy of Results and Data Uncertainty Handling The simple PNs in Fig. 33.36 model repaired items with a failure rate and a repair rate. On the left-hand side of the figure, the failure rate is considered to be perfectly known and it is modelled by a constant value λ. On the right-hand side, the failure rate is considered as a random variable governed by lognormal law LnN (λ, 3). With such models, the item unavailability, U (t), is obtained by observing if one token is present in place “Down” at a given time t. The sub-PN on the left-hand side is going to be used to illustrate how the Monte Carlo accuracy increases when the number of simulated histories increases and the sub-PN on the right-hand side to analyse the impact of the uncertainty on the failure rate. Figure 33.37 shows how the width of the confidence interval (drafted in dotted lines) decreases when the number of simulated histories√is multiplied by 10. It is divided by 3.16—which is exactly what is expected (i.e. 10 = 3.16) according to Chap. 32—and the confidence in the estimation of U (t) increases. The same calculations have been performed to obtain the results presented in Fig. 33.38. The difference is that the failure rate, λ, is now a random variable governed by a lognormal distribution with an average value equal to λ and an error factor equal to 3. In order to compare the results with these of Fig. 33.37, the same numbers of histories have been simulated but they have been split between 50 different values of the random variable. For example, the 5.0 × 104 histories on the left-hand side of

Up

μ

Tr2 Repair

Up Failure Tr1

Down

Pl2

μ

Tr2 Repair

Failure Tr1 Down

)

Pl2

Fig. 33.36 PN modelling a repaired item without (left) and with (right) data uncertainty

33.10 Accuracy of Results and Data Uncertainty Handling

Simulation accuracy 5.0 104 hist.

633

Simulation accuracy

5.0 105 hist.

Fig. 33.37 Confidence interval as a function of the number of simulated histories

Data uncertainty 1.0 103 x 50 hist.

Data uncertainty 1.0 104 x 50 hist.

Fig. 33.38 Confidence interval taking data uncertainties into account

Fig. 33.37 have been split between 50 times 1000 histories on the left-hand side of Fig. 33.38. The behaviour of the confidence interval is very different compared to the previous case: it does not decrease but remains rather stable and it reflects the impact of the input data (i.e. the failure rate λ) uncertainty. When the number of histories is small, this confidence interval is the combination of the simulation accuracy with the data uncertainties. When the number of histories increases, the impact of the simulation accuracy decreases. For example, on the right-hand side of the figure, it is mainly due to data uncertainties. It has to be noted that the average value (in solid black line) in Fig. 33.38 is slightly below the average value in Fig. 33.37. This has already been brought to light in Chap. 25 with the Boolean models and this is also analysed in Chap. 36 for safety systems related calculations. When the input data accuracy decreases (e.g. when error factors increase), the underestimation of the average values increases and nonconservative results can be obtained. When this is not acceptable (e.g. when dealing with safety systems), precautions have to be taken as for example these proposed in IEC 61508 (2010): perform point calculations with the 70% confidence values of the input data or perform Monte Carlo simulations using the distributions of input data and retain the 90% confidence upper bound of the result rather than the average value (see Chap 36).

634

33 Petri Net Modelling

33.11 Building PNs Related to Large Systems 33.11.1 Main Drawback: Legibility Problem For more than 35 years, the Petri nets (PNs) have proven to be very powerful for safety/dependability modelling and calculations but, nevertheless, the dissemination is rather slow/and the PNs are not yet currently used by reliability engineers. Beyond the grievances from people reluctant to Monte Carlo simulation, the dissemination seems also impeded by the graphical model itself which often looks complicated and is difficult to understand as soon as more than two or three interacting components are modelled. This is a little bit surprising for a model expected to handle large industrial systems. A simple example can be used here to illustrate the problem: two redundant components periodically tested with a test interval τ and with a single repair team: therefore, a failure remains hidden until it is revealed by a test and can be repaired provided that the repair team is available. This simple system is modelled in Fig. 33.39. The resulting PN has been drawn by using the conventional graphic elements and in particular the conventional inhibitor arcs represented by a small circle at the end. It looks like many PNs proposed in publications with numerous parallel or crossing arcs difficult to follow. The result is a “spaghetti-like” PN whose underlying logic is difficult to understand without a great effort. This obliges the reader to enlarge the PN and put colours to follow

F

Unreliability Component 1 Repair team

UP1 Tr4 )

Rep1

Component 2

RT

UP2

λ2

λ1

Tr1

Failure

Tr11

Tr5 Fail2

Tr8

Hid1

Tr3

Hid2 Tr6

Tr2 Wait1

Rep2

Test starts

WT

τ

)

Fail1

Rep1

Waiting for tests

Tr9

Rep2

Tr7

Tr10 D

Wait2 Test stops

Number of failures

CF

Fig. 33.39 Example of spaghetti-like PN

Test in progress

33.11 Building PNs Related to Large Systems

635

each arc from its beginning to its end! This is time-consuming, boring and, in a way, discouraging. The structure of this PN is not complicated. Nevertheless, it is made of the following elements: • Components modelling: – Places modelling the states of component 1 (respectively component 2): • Up1 (Up2 ): component in good operating state. • Hid1 (Hid2 ): component failed, undetected. • Wait1 (Wait2 ): component failed, failure revealed (and waiting for repair). • Rep1 (Rep2 ): component under repair. – Transitions modelling the events occurring on component 1 (respectively component 2): • Tr1 (Tr5 ): failure of the component. The firing delay is exponentially distributed, failure rate λ1 (respectively λ2 ). • Tr2 (Tr6 ): detection of the failure of the component. This is an instantaneous transition. • Tr3 (Tr7 ): start of repair of the component. This is an instantaneous transition. • Tr4 (Tr8 ): end of repair of the component. The firing delay is governed by a general distribution F (δ) which is the same for both components. • Test modelling: – Places modelling the test operations: • WT: the system is waiting for tests (i.e. within the test interval and the test occurs at the end of this period). • D: auxiliary place allowing the occurred failures to be detected by the tests before entering in the following test interval. – Transitions modelling the test events: • Tr9 : test performance after one test interval has elapsed. This is a deterministic transition with a firing delay equal to τ which is the test interval duration. Tr9 models both the end of test intervals and the performance of tests. • Tr10 : come back to the test interval period. This is an instantaneous transition which is fired when all the hidden failures have been revealed. • Detection of the first failure for unreliability calculations: – F: place indicating that the system has failed at least once. – Tr11 : detection of the 1st overall system failure. This is an instantaneous transition.

636

33 Petri Net Modelling

• Auxiliary places: – RT: repair team. When a repair starts, the token is removed from this place and then no other repair can take place until the token comes back into this place when the repair is completed. – CF: counting failure. This place CF receives one token when one failure occurs and loses one token when one repair occurs. It allows to know how many items are failed at any time. In this PN, several instantaneous transitions introduce some conflicts which have been solved by using inhibitor arcs. For example, let us consider that component 1 has failed (Tr1 has been fired and there is one token in Hid1 ): • The detection of this failure cannot be done before the next test has been performed: therefore, Tr9 must be fired before Tr2 is enabled. This is achieved through the inhibitor arrow between WT and Tr2 . • When the test occurs, Tr9 is fired, Tr2 is enabled, one token is added in D. • Tr2 must be fired before Tr10 (otherwise the failure would not be detected) then Tr10 must be inhibited as long as there is one token in Hid1 . This is achieved through the inhibitor arrow between Hid1 and Tr10 . • Then, provided that Hid2 is empty (no failure of component 2), Tr10 is enabled and can be fired immediately to come back in the waiting to test conditions. These mechanisms are implemented in Fig. 33.39 for component 1 as well as for component 2. The PN in Fig. 33.39 contains also a mechanism to detect the first failure of the redundant system. This is done through transition Tr11 and place F which works as a logic AND gate: it is inhibited as long as, at least, one of the two components is available and it is fired as soon as the two components are unavailable at the same time. When Tr11 is fired, one token arrives in place F which, thanks to the inhibitor arc, prevents this transition to be fired twice. Therefore, Tr11 is a single shot transition which is fired only once during the performance of one history if the system has had at least one failure. Then, if Tr11 has been fired k times over n histories, the unreliability of the system can be estimated by k/n which is, in fact, the average firing frequency of this transition. Therefore, the explanations of the model seem rather simple for a model which looks rather complicated. This kind of representation participates to the idea that PNs are esoteric and difficult to handle whereas, actually, they are rather simple to use. Moreover, it has to be noted that only two items are represented above, this gives an idea of what happens when dozens of interacting items have to be modelled.

33.11.2 Increasing Legibility of Large PNs As done in Fig. 33.40, one way to make the PN presented in Fig. 33.39 easier to read is to draft in dotted lines (see Fig. 33.9) the inhibitor arcs previously drafted in solid lines. This allows to make at a glance the difference between ordinary arcs and

33.11 Building PNs Related to Large Systems

637

Unreliability

F

Tr11 Component 1

Tr4 )

Rep1

Component 2

RT Repair team

Up1

Up2

λ2

λ1

Tr1

Failure

Tr5 Fail2

Tr8

Hid1

Tr3

Hid2 Tr6

Tr2 Wait1

Rep2

Test starts

WT

τ

)

Fail1

Rep1

Waiting for tests

Tr9

Rep2

Tr7

Tr10 D

Wait2 Test stops

Number of failures

Test in progress

CF

Fig. 33.40 Use of inhibitor arcs in dotted lines for easier identification

inhibitor arcs. This is an improvement but, obviously, this would not be enough to keep the control of the building of PNs related to large systems. The solution is to split the overall PN in sub-PNs easily intelligible and to link them in some ways. The first step is then to identify the potential sub-PNs, as illustrated in Fig. 33.40 where the various parts have been made apparent. The second step would be to remove all the crossing arcs running throughout the model and replacing them by simple links as this has been done in Fig. 33.41 by implementing the repeated places introduced in 33.5.1. This leads to split the overall PN into six individual sub-PNs: • Two sub-PNs modelling the behaviour of the components themselves: Component 1 and Component 2, • Two auxiliary sub-PNs: “Unreliability” and “Test operations”, • Two auxiliary places: RT and CF (sub-PNs reduced to single places). The intricate arcs have disappeared, the 6 sub-PNs now communicate through the repeated places, the overall PN is readily understandable and can be easily extended to three or more redundant components. Therefore, splitting large PNs into smaller sub-PNs easy to develop and communicating between themselves in some ways (repeated places like above or messages) seems the key to keep the control when building PNs related to large industrial systems.

638

33 Petri Net Modelling Tr11

UP2

F

Failure UP1

Repair team

RT

Component 1

RT

)

Rep1

Fail1

λ2 CF

Wait1

Rep2

Waiting Test starts for tests τ WT

) Rep2

Tr9 CF

WT Tr6

Tr2

Tr8

Hid2 WT

Tr3

Tr5 Fail2

CF

Hid1

Rep1

RT

UP2

λ1

Tr1

CF

Test operations

Component 2 UP1

Tr4

Number of failures

CF Unreliability

Tr7

Hid2

D

Wait2 Hid1

RT

Tr10

RT

Test in Test stops progress

Fig. 33.41 Splitting of the model into modules

33.11.3 Modularization of Large PNs The need for splitting the overall PN into smaller sub-PNs easy to understand and developed separately leads immediately to the idea of modularization of the overall PN and this implies the notion of modules. Figure 33.42 shows a sub-PN extracted from Fig. 33.41 and modelling an item with hidden failures periodically tested to be revealed and repaired. Such a sub-PN can be used as a module and constitutes an interesting way to build a PN in a modular way. In addition, it allows to make clearly the difference between the essential part of the model (the skeleton) and the auxiliary elements used to synchronize this essential part with the rest of the overall PN. However, in Fig. 33.41, the use of repeated places in the “Unreliability” subPN (UP1 and Up2 ) and in the “Test operations” sub-PN (Hid1 and Hid2 ) implies to adapt these sub-PNs to the number of modelled items. A more general modelling can be obtained by using predicates and assertions, as illustrated in Fig. 33.43. In this figure, the repeated place RT is replaced by the integer variable NbR (number of repair teams available), the repeated place CF by the integer variable NbF (number of faulty items) and the repeated place WT by the logic variable Tst (Test in progress). These variables are used in turn within predicates and assertions as follows: • When the item fails, assertion “!!NbF = NbF +1” increments the number of faulty items by 1. When the item is repaired, assertion “!!NbF = NbF − 1” decrements the number of faulty items by 1. Therefore, NbF is equal to the number of faulty items at any time.

33.11 Building PNs Related to Large Systems

639

Generic item with hidden tested failures RT

UP EoR Fail

CF

Hid

Rep Skeleton of the module

CF

Detection

WT

SoR

Auxiliary places

Wait RT

Fig. 33.42 Generic module for an item subject to hidden failures detected by periodic tests

Generic item with hidden tested failures UP

!!NbR=NbR+1 !!Nbf=Nbf-1 !Ci

EoR

SoR

!!NbR=NbR-1

F

??Nbf >k

Fail !!Nbf=Nbf+1 !-Ci

Rep

Unreliability

Hid

Predicate Assertion

Detection

??NbR>0

Wait

Fig. 33.43 Generic modules of items with hidden failures: use of predicates and assertions

• As soon as NbF becomes greater than k, the unreliability transition is fired thanks to the predicate “??NbF > k”. This allows to model a system of n items operated with a k/n logic. E.g. k is equal to 2 if 3 items operated in 2 out of 3 are modelled by implementing three times the module on the left-hand side of the figure.

640

33 Petri Net Modelling

• When a failure occurs, it remains hidden as long as no test is performed. This is done thanks to assertion “!Tst” emitted by transition “Test starts” and the predicate “?Tst” received by transition “Detection”. In the model proposed in Fig. 33.39 to Fig. 33.41, all the tests are performed at regular interval even when no faults to be detected are present. As in this model the tests only reveal the faults but have no other impact on the items state, performing a test when Nbf is equal to 0 is useless: it is going to reveal nothing! When achieving Monte Carlo simulations, these tests are going to be performed many times: if the test interval is short, such useless calculations should be avoided. If the faulty states are unknown in real life, they are perfectly known during the simulation, and therefore the idea is to trigger a test only when it is sure that it is going to reveal something. This can be done by using one of the periodic test laws introduced in Chap. 32 to calculate the firing delay of the next test: for a failure occurring at time t, then the delay will be, for example, δ = τ − t.mod (τ ). This allows to simplify the model and to introduce the periodic test directly on transition “Detection” of the tested modules, as done in Fig. 33.43. Therefore, the first occurring failure determines the date of the next test. Such a trick allows to speed up the simulation but, of course, if beyond the simple fault detection, the tests have impacts to be considered (e.g. failures due to tests, cost, production lost, etc.), all the tests would have to be modelled like this is done in Figs. 33.39, 33.40 and 33.41. In the item module, a logic variable Ci has also been introduced to model the item state at any time. When the system fails, Ci is updated to “false” by using assertion “!Ci” and when it is repaired it is updated to “true” by using assertion “!Ci”. This state variable is useful when implementing the RBD (reliability block diagram)-driven PNs or dynamic fault trees described in 33.11.4. The example above has been developed for periodically tested items but, since the beginning of this chapter, the PNs have been organized in sub-PNs which can easily be used as modules. This is illustrated in Fig. 33.44 where a module related to items with revealed failures and disabling CCF is represented on the left-hand side and a module related to items with revealed failures and maintenance support mobilisation is represented on the right-hand side. In these modules, the logic state variable Ci has been replaced by an integer variable with two values, [0, 1], which plays exactly the same role. When modelling production systems, this variable can be replaced by a real variable with two values [0, Ki] where Ki is the nominal production capacity of the considered item. This variable is useful when implementing the FD (Flow diagram)-driven PNs described in 33.11.5.

33.11 Building PNs Related to Large Systems Generic item with revealed failures and disabling CCF

Generic item with revealed failures and maintenance support mobilisation ?-Ccf

UP

!!NbR=NbR+1 !!Nbf=Nbf-1 !Ci

641

UP Mob !!Nbf=Nbf-1 !Ci

EoR

EoR

?Ccf

Fail

Fail !!Nbf=Nbf+1 (Mem) !-Ci

Rep

SoR

Rep

!!Nbf=Nbf+1 !-Ci

SoR

!!NbR=NbR-1 ??NbR>0

Wait

OL

Wait

Fig. 33.44 Generic modules of items with revealed failures

33.11.4 Modelling of Binary Systems 33.11.4.1

RBD-Driven PNs (Dynamic Reliability Block Diagrams)

Basic Elements of RBD-Driven PNs When a PN is used for classical availability or reliability calculations, the modelled system is a binary system with only two states. Then, its failure or success can be modelled by a logic combination of the state variables of its components which, in turn, can be modelled by one of the Boolean models (see Part 3). This leads to the idea of using sub-PN modules to model the behaviour of the components and an RBD to model the logic linking the states of the overall system to the component states. The result is an RBD-driven PN which is one of the simplest ways to obtain dynamic RBDs (see Chap. 27). The elements of such RBD-driven PNs are presented in Fig. 33.45: • Blocks: modelled by sub-PN modules governing the value of the state variables Ci of the modelled item. This state variable is the core of the RBD-driven PN modelling. In Fig. 33.45, an integer variable has been chosen and the output CiOut of block Ci is equal to 1 if its input, CiIn , is equal to 1 (the system is not failed upstream) and its state variable, Ci, is also equal to 1 (the block itself is not failed). Translated into assertion, this gives !!CiOut = Ciin × Ci. • Serial node: links the output of one block to one or several other nodes. Its output is equal to its input. For node Ns1 , this gives, for example, !!Ns1t = In1 .

642

33 Petri Net Modelling Block Ci UP

State variable

!!NbR=NbR+1 !Ci=1

Ns1

EoR

!!CiIn

?Ccf

Fail !!Ci=0

Rep

Np

!!Ci=0

r/m Majority vote node (parallel node= 1/m)

State variable

!!NbR=NbR-1

SoR

Nsk

??NbR>0

Wait

Serial nodes

Fig. 33.45 Elements of an RBD-driven PN

• Majority vote (r/m) node: links the output of several blocks/nodes to another block. The output of this node is equal to 1 if at least r of its inputs are equal to m  1. For node Np, this gives !!Np = ite( Ink ≥ k; r, 0). k=1

• Parallel node: particular case of a majority vote logic where r is equal to 1 (i.e. r/m = 1/m). Each of the assertion mentioned above is a global assertion allowing to update the system state as soon as an item state changes.

Example of RBD-Driven PNs Figure 33.46 gives an example of such an RBD-driven PN modelling a redundant system made of two similar components A and B subject to a breaking common cause, using spare parts to be repaired and sharing the same maintenance team which needs a delay δ before arriving on location. This RBD-driven PN is made of two parts: • A virtual RBD on the left-hand side of the figure; • Three auxiliary sub-PNs on the right-hand side: – CCF modelling, – Spare parts provisioning, – Availability/reliability calculations (similar to Fig. 33.31). The module used for modelling the individual components is slightly different from the module discussed above. This has been done to highlight the flexibility of the approach which allows to model the same things in many various ways. The different parts of the overall PN communicate by using predicates, assertions (messages) and repeated places:

33.11 Building PNs Related to Large Systems A

CCF

OK

?Ccf

!A=0 RepA

System state

UPA

!MT !!B=1

643

!A=0

Up

!-Ccf N1

!Ccf

(Prio=-1)

SP

!-MT

Availability / Reliability

??S==0 WaitA

??S==1

CCF

?MT S

1

B

UPB

!MT !!B=1

Auxiliary sub-PNs

?Ccf

!B=0 RepB !-MT ?MT

!B=0

N2

SP

Dwn

Spare parts provisioning

TTF

δSP -2 SP

WaitB

Fault

Virtual RBD

Fig. 33.46 Example of an RBD-driven PN

• As soon as the CCF occurs, both A and B fail (note the priority on the transition which has been explained in 33.6.1). • When an item fails, it can be repaired when the maintenance team is available (message “?MT ” is “true”) and when one spare part is available (at least one token in place SP). • The repair starts after a constant delay corresponding to the time needed by the maintenance team to reach the location of the failure. When it starts, the message “! − MT ” makes the maintenance team unavailable for another repair at the same time. • When the repair is completed, the message “!MT ” becomes “true” and the maintenance team is available again. The underlying virtual RBD is modelled by the following assertions: • • • •

Input of the RBD (“true” all the time): !!In = 1; Serial node N1 at the output of module A: !!N1 = In × A; Serial node N2 at the output of B: !!N2 = In × B; Parallel node S at the output of the RBD: !!S = ite(N1 + N2 ≥ 0; 1, 0).

Such an RBD-driven PN can be easily extended to more than two redundant components just by adding more item modules. For example, extending this model to 3 identical items operating in 2 out of 3 implies only to: • Add a third module C identical to A and B, with a state variable C; • Add assertion !!N3 = In × C; • Add assertion !!S = ite(N1 + N2 + N3 ≥ 2; 1, 0). The auxiliary sub-PN introduced in Fig. 33.31 completes the RBD-driven PN to perform the reliability/availability calculations as described in 33.9. In this case,

644

33 Petri Net Modelling

variable Av has only been changed to variable S in order to trigger the system failure thanks to predicate “??S == 0” and the system restoration by predicate “??S == 1”. It has to be noted that, when safety instrumented systems are modelled (see Chap. 36), the average unavailability provides the PFDavg (average probability of failure on demand) and the average failure frequency provides the PFH (probability of failure per hour).

RBD-Driven PNs as an Interface to Hide PNs As shown above, the design and the use of pre-established modules is very helpful for the analyst who wants to build large PNs. This is rather simple and allows to save time, to limit the risk of error and to use PNs in their native form without losing any part of the modelling power of this approach. Nevertheless, the implementation of RBD-driven PNs is a good way to design user-friendly RBD like interfaces allowing the analysts to benefit from the modelling powerfulness of PN without needing to be very familiar with them. According to the philosophy developed above, this can be done by developing a library of sub-PN modules which can be used like a kind of “Lego” to build an overall PN. This library must comprise three types of modules: • Modules modelling the blocks and governing the state variables of these blocks and taking into account, for example: – The self-revealed (Fig. 33.44) or periodically tested failures (Figs. 33.42 or 33.43); – The unlimited (Fig. 33.7) or limited number of repair teams and with (Figs. 33.8, 33.23 and 33.44 right-hand side) or without (Fig. 33.44 left-hand side) mobilisation; – The various types of common cause failures (Figs. 33.19, 33.20 and 33.21); – The failure when started on demand (Fig. 33.16); – The failure priority (Fig. 33.25); – The preventive maintenance (Fig. 33.13); – The spare parts (Fig. 33.46); – The presence or absence of transitions with memory (Figs. 33.12, 33.13, 33.14 and 33.20), etc.; • Modules modelling external events impacting the behaviour of the blocks, like: – – – – –

Maintenance support mobilisation (Figs. 33.23 and 33.26); Preventive maintenance (when it is related to several items at the same time); Test operations (Fig. 33.41); Spare parts provisioning (Fig. 33.24); Common cause failures (Figs. 33.19, 33.20, 33.21 and 33.46), etc.;

• Modules for availability/reliability calculations (Figs. 33.31 and 33.46).

33.11 Building PNs Related to Large Systems

CCF1

Library

645

CCF2

Mobilisation

S1 V1 In

S2

S3

2/3

LS Logic solver

S

Av/ Rel

V2 Safety valves

Sensors

Fig. 33.47 Example of an RBD-driven PN

Developed on a case by case basis, the number of potential modules is virtually endless. Fortunately, it is possible to build the modules step by step as this has been done above, by starting with a simple sub-PN like this on the left-hand side of Fig. 33.7 and adding various mechanisms (e.g. failure counting, CCFs, maintenance support mobilisation, etc.) according to the needs for modelling the behaviour of a specific item. Figure 33.47 illustrates the principle to use an RBD as an interface in order to build an RBD-driven PN modelling a typical safety instrumented system made of three sensors organized in 2 out of 3 (2/3), one logic solver and two redundant safety valves: • Build the reliability diagram (in bold line in Fig. 33.47) as an ordinary RBD. • For each of the blocks, select the relevant sub-PN module in the library and attach it to the block. • Select the auxiliary modules (utilities, common cause failures, repair teams, etc.) not belonging to the core of the RBD and link them to the relevant blocks. Most of these stages can be partly or completely automated in order to facilitate the analyst’s work: for example, the logic of the diagram (as done in Fig. 33.46) can be created automatically from the structure of the RBD and the calculation module introduced systematically and automatically linked to the output value of the RBD. In this process, the RBD is used as an interface to build large PNs. Therefore, the PNs themselves remain hidden behind the RBD and the analyst does not even need to look at or have knowledge about them. This appears to be a good solution to satisfy analysts reluctant to handle PNs! Nevertheless, this is not a simple mapping of the RBD to a PN and the overall PN (i.e. the RBD-driven PN) remains in background and is actually used to perform the Monte Carlo simulations. Therefore, it can be made visible if this is explicitly required by the analyst and then possibly modified and completed if need be.

646

33 Petri Net Modelling

G1 OR gate Inverted state variable

Primary event

G2 AND gate

1

1

UP EoR Rep

?Ccf

Fail SoR

=1

Wait >0

G3 2/3

Inverted state variable: 1

2

3

Fig. 33.48 Elements of a FT-driven PN

The BStoK (STochastic BlocK diagram) module included in GRIF-Workshop (2020) has been developed for this purpose. It can be used to handle the RBDdriven PNs and the Flow diagram-driven PNs described hereafter. Its built-in library, achieved according to the principles described above, provides all the sub-PN modules needed to model the more usual component behaviours. In addition, it contains features allowing the users to create specific sub-PN modules to satisfy their particular needs. If needed, when a model is built, an equivalent PN can be generated and modified: therefore, the modelling power is not limited by the use of a library of pre-established modules.

FT-Driven PNs (Dynamic Fault Trees) When a PN is used for classical availability or reliability calculations, the modelled system has only two states and what has been done above for modelling success can also be done for modelling failures. This leads to the idea of using a fault tree to model the logic linking the faults of the items (primary events of the FT) to the overall system state (to events of the FT) and sub-PN modules to model the item faults (primary events of the FT). The result is a FT-driven PN which is one of the simplest ways to obtain dynamic faut trees (DFTs, see Chap. 22). The principle is exactly the same as for the RBD-driven PNs. The difference is that the state variables, E¯ i , are now “true” when the items are failed and “false” when they are in up state. The basic elements for building a FT-driven PN are illustrated in Fig. 33.48. The sub-PN modules are the same as for the RBD-driven PNs except that the state

33.11 Building PNs Related to Large Systems

647

Fig. 33.49 Example of an FT-driven PN

Av/ Rel

S

G3 2/3

S1

Mobilisation

G1

CCF1

S2

LS

Logic solver

Sensors S3

Safety valves

CCF2

G2

V1

V2

Library

variable is inverted to cope with the FT philosophy which handles failures rather than successes. The serial, parallel and r/m nodes are replaced by OR, AND and k/m gates where k = m − r + 1. The Boolean formulae for !!G1 , !!G2 and !!G3 at the top of the logic gates in Fig. 33.48 provide global assertions to be used within the Petri net modelling these gates. Figure 33.49 gives a FT-driven PN related to the same example of safety instrumented system analysed by an RBD-driven PN above: three sensors organized in 2 out of 3 (2/3), one logic solver and two redundant safety valves. The principle for building such a model is exactly the same as above and is not repeated here. Again, in this process, the FT is used as an interface to build a large PN which remains hidden behind the FT. The analyst does not even need to look at or have knowledge about it and this appears to be a good solution to cope with the problems encountered by analysts familiar with the use of FT analysis and reluctant to handle PNs! Nevertheless, as previously mentioned for the RBD-driven PNs, this is not a simple mapping of the FT to a PN and the overall PN (i.e. the FT-driven PN) remains in background and is actually used to perform the Monte Carlo simulations. Therefore, it can be made visible if this is explicitly required by the analyst and then possibly modified and completed if need be. The RBD-driven PN illustrated in Fig. 33.47 and the FT-driven PN illustrated in Fig. 33.49 are equivalent. In fact, they are dual, i.e. it is possible to translate the RBD-driven PN into the FT-driven PN and vice versa. As explained in Chap. 27, specific dynamic gates are used when dynamic fault trees (DFTs) are implemented. Two of them are illustrated in Fig. 33.50 (priority AND gate, PAND) and Fig. 33.51 (sequential gate, SEQ). The PAND gate implies no dependencies between the modelled events. Then it can be introduced everywhere within a Petri net. This can be done by using the three transitions illustrated in Fig. 33.50: • Tr1: this transition detects if E¯ 1 occurs before E¯ 2 . If this is the case, TR1 is fired and the variable E1_first becomes “true” (i.e. equal to 1).

648

33 Petri Net Modelling =1 ?? = 1 ?? = 0

Tr3 PAND

Must fail first 1

!! ?? ?? ??

2

Tr1

?? ??

!!

Tr2

=1

=

0 =0

!!

= =

0 0 =1 =0

=1

Fig. 33.50 Modelling a priority AND gate (PAND) within a Petri net

Up SEQ

Up

Up

< ⋅⋅⋅⋅

1

2

Inhibited when 1

Inhibited when 1

1

−1

Down

Down

Down

Fig. 33.51 Modelling a sequential gate (SEQ) within a Petri net

• Tr2: this transition is fired when E¯ 2 occurs and E¯ 1 has occurred first (i.e. according to the rules of the PAND gate); • TR3: this transition resets the variable E1_first when E¯ 1 gets back to “false” (i.e. equal to 0) before E¯ 2 occurs. The SEQ gate implies dependencies between the modelled events and it is difficult to implement everywhere within a Petri net because these dependencies can impact many transitions needing to be inhibited. However, it is easy to implement when primary events are involved, as illustrated in Fig. 33.51 for a SEQ gate with n inputs: • E¯ 1 : the sub-PN related to this event behaves independently from the other events as this is the first occurring in the sequence. • E¯ 2 : the sub-PN related to this event is dependent of E¯ 1 as the transition is inhibited as long as E¯ 1 is “false” (i.e. equal to 0). • … • E¯ n : the sub-PN related to this event is dependent of E¯ 1 , E¯ 2 . . . E¯ n−1 as the transition is inhibited as long as E¯ 1 , E¯ 2 . . . E¯ n−1 are “false” (i.e. equal to 0). Like in Chap. 22, the SEQ gate has been developed for non-repaired items because its actual behaviour when repaired items are involved needs the introduction of more assumptions and that the purpose, here, is only to illustrate the principle.

33.11 Building PNs Related to Large Systems

649

As a conclusion, the use of Petri net proves to be an effective technique to model dynamic fault trees.

33.11.5 Modelling of Multistate Systems 33.11.5.1

From Binary to Multistate PN Modelling

The modelling of binary systems described above is very powerful to deal with the classical availability/reliability calculations and specially to deal with safety systems but is no longer sufficient as soon as production systems are considered. Even if a production system is often made of binary components (with two states, up and down), it involves itself generally more than two states as, between its perfect state and its complete failure, there are some degraded states which have to be taken into account (see Chap. 5). This is a typical multistate item whose production capacity is a combination of the production capacities of its components. Therefore, the production flow has to be modelled and this leads to the idea of using sub-PN modules to model the production capacities of the components and a flow diagram (FD) to model the circulation of the flow throughout the overall system from the flows circulating in the individual component. The result is an FD-driven PN which is one of the simplest ways to obtain dynamic flow diagrams. Flow diagrams and reliability block diagrams are similar models if it is noted that it is a logic flow which circulates throughout an RBD. Therefore, it seems legitimate to extend to flow diagrams what has been developed above for RBDs.

33.11.5.2

Basic Elements of FD-Driven PNs (Dynamic Flow Diagrams)

Figure 33.52 shows the evolution of the state variable from logic values (“true”/“false”) to two integer values 1 and 0 for binary items and to two real values Ki and 0 for binary production items and then to n real values for multistate production items. The sub-PN in Fig. 33.53 shows a simple binary production sub-PN module which is directly derived from the module presented in Fig. 33.45. The only difference is in Logic

Ci =

"True" "False"

Real

Integer

Ci =

1 0

Ci =

Binary items Fig. 33.52 State variable: from binary to multistate items

Ki

Ci =

0 Multistate items

Kin Kin-1 0

650

33 Petri Net Modelling Production item Ci Capacity is back to nominal value

UP !!NbR=NbR+1 !Ci=Ki

Maximum possible flow out

EoR

!!CiIn

?Ccf

Fail !!Ci=0

Rep

!!Ci=0

Capacity drops to 0)

!!NbR=NbR-1

SoR

??NbR>0

Wait

Fig. 33.53 Example of modelling of a simple binary production item

variable Ci which, as explained just above, is now a real variable which represents the production capacity of the item. Handling such production capacities is the core of this model. The functioning of this sub-PN is very simple: the production capacity Ci drops from the nominal capacity Ki to 0 when the item fails and raises from 0 to the nominal capacity Ki when it is repaired. It has to be noted that a breaking common cause failure has been introduced in the model: the production capacity drops to 0 when it occurs. For such an item, the maximum possible flow out Ciout can be calculated as the minimum between the input flow and the capacity of the item Ciout = min(Ciin , Ci). When several items run in parallel, the actual flow out is lower as the input flow is split between several items. A step forward is made by introducing multistate items with more than two production capacities. An example is given in Fig. 33.54 where a production well is activated

Gas-lift

Production Well

OK

!!Cw=90

Fail_GL

!GL

!-GL

Rep_GL KO

Auxiliary module

P90 ?-GL

?GL

??CwIn

!!Cw=50

P50

Fail1 !!Cw=0

!!Cw=50

Rep

Fail2

!!Cw=0

Flow diagram module

P0

Fig. 33.54 Example of modelling of a multistate production item

Capacity

33.11 Building PNs Related to Large Systems

651

by a gas-lift system (gas injection at the well bottom resulting in production improvement). This allows to increase the natural production capacity of 50 m3 /h (production capacity) to the nominal production capacity of 90 m3 /h. Therefore, this is a multistate item with 3 production levels: 90 m3 /h (nominal), 50 m3 /h (degraded) and 0 (failed). The production capacity depends on the message “?GL” which comes from the auxiliary sub-PN on the left-hand side of the figure, which provides a simplified model of gas-lift system. This functions as follows: • In the nominal conditions, the well produces 90 m3 /h. • When the gas-lift fails, the message “? − GL” becomes “true” and the production drops to 50 m3 /h. • When transitions Fail1 or Fail2 occur, the production capacity drops to 0 and the well has to be repaired. • When the repair is completed, the production capacity raises to 50 m3 /h. • It stays in this state until the message “?GL” becomes “true” and the nominal capacity 90 m3 /h is restored (i.e. instantaneously if “?GL” is “true”) when the repair is completed. In this module, the CCF has not been represented nor the maintenance team mobilisation or the spare part provisioning but of course, when needed, this can be added, as explained for binary items. When the modules have been designed, it is necessary to define the behaviour of the nodes allowing to link them together in order to propagate the flows properly. This is more complicated than in the RBD case but the solution provided in Fig. 33.55 allows to model simple production systems. This requires to identify the corresponding nodes: in the example, A is a divergent node which splits the flow in several branches and B is the corresponding convergent node which gathers all the flows originating from A. Therefore, the flow in B cannot be greater than the flow in A. This technique allows to manage the over capacities which can occur between nodes A and B. When the flow is split between parallel trains as shown in Fig. 33.56, a more sophisticated repartition can be implemented. In this figure, the total capacity of the Fig. 33.55 Example of corresponding nodes

Divergent node

O1

I1

= PN model

A



)

B

Convergent node

In On

Corresponding nodes

652

33 Petri Net Modelling

Fig. 33.56 Flow repartition proportionally to flow capacities

) Divergent node

I1

C1

O1

Train 1 A

)

=∑ In

Cn

Convergent node =∑ B

On

Train n Corresponding nodes

n  n trains is equal to Ck and the contribution of train i to this total capacity is equal n k=1 to αi = Ci /( k=1 Ck ). Then, the idea is to split the flow coming from the divergent node proportionally to αi . In this case, train i processes a flow equal to Ii = αi .A and, at its output, the flow is equal to Oi = min(Ii , Ci ) in order to model the case where C i is insufficient to process I i . The flow at the convergent node B is simply equal to the sum of flows coming n  Ok . from the n trains: B = k=1

This is equivalent to formula B = min(A,

n 

Ok ) introduced in the general model

k=1

described in Fig. 33.55. This model is realistic when the trains have no lower limit to the flow that they can process. When such a limit exists, rules must be introduced to take it into account.

33.11.5.3

Example of FD-Driven PNs

The elements developed above allow to build the FD-driven PN illustrated in Fig. 33.57 which models the example proposed in Fig. 30.4 of Chap. 30: • The production wells on the left-hand side have three different capacity levels (90, 50 and 0 m3 /h). • This production is processed by two treatment units of different capacities (70 and 40 m3 /h). • When it has been processed, the production is sent to the consumers by a unit with a capacity of 100 m3 /h. The modules have been simplified in order to keep clarity and highlight the resulting virtual FD-driven PN which is linked through messages to three auxiliary sub-PNs not explicitly represented but modelling the gas-lift for activating the wells, the CCF for the treatment units and the demand from customers at the input of the virtual FD-driven PN. Let us consider the nominal situation where every item

33.11 Building PNs Related to Large Systems Customer W

Gas-lift

!!W=0 !!W=50

=

Expedition E

UP

Rep !!E=100

Treatment

A

T2

UP

=

B

S !!E=0

?Ccf

!!W=0

P0

)

!!T1=0

!!W=50

Auxiliary sub-PN

!!T1=0

!!T1=70

P50

CCF

?Ccf

?-GL

?GL

T1

UP

P90

!!W=90

!!Dem

Virtual flow diagram

653

!!T2=0

!!T2=40

Rep

!!T2=0

Production Rep

)

Fig. 33.57 Example of simple FD-driven PN (Production model)

is in its nominal state and the demand is equal to the maximum production capacity of the wells: • • • • • • • •

Dem = 90; The flow at node A is equal to min(90, 90) = 90; It is split into α1 = 70/110 and α2 = 40/110; The flow at the T1 output is equal to min(90 × 70/110, 70) = 57.3; The flow at the T2 output is equal to min(90 × 40/110, 40) = 32.7; The flow at node B is equal to 57.3 + 32.7 = 90; The flow at the expedition output is equal to S = min(90, 100) = 90; The loss (i.e. the difference between the demand and the actual production) is equal to zero.

Let us now consider a situation where the demand has dropped to 80 m3 /h and the treatment unit T2 has failed: • • • • • • • •

Dem = 80; The flow at node A is equal to min(80, 90) = 80; It is split into α1 = 70/70 and α2 = 0/70; The flow at the T1 output is equal to min(80 × 70/70, 70) = 70; The flow at the T2 output is equal to min(0, 0) = 0; The flow at node B is equal to 70 + 0 = 70; The flow at the output S: S = min(70, 100) = 70; The demand is not fully satisfied and the loss is equal to 80–70 =10 m3 /h.

The reader can try several other situations to verify that this simple model works. It can be easily extended to the situation with more process units (production, treatment, expedition, etc.) provided that corresponding nodes can be identified.

654

33 Petri Net Modelling

The production availability of the modelled system can be calculated from the value of variable S: – Instantaneous production availability: value of S at time t; – Average production availability: average value of S over [0, t]. 33.11.5.4

FD-Driven PNs as an Interface to Hide PNs

When the capacities are limited to the values 1 and 0, the FD-driven PNs described above behave exactly as the RBD-driven PNs analysed in the previous subsection. An RBD-driven PN is then a particular case of FD-driven PNs and all what has been explained about the realization of interfaces allowing to hide a PN behind an RBD is still valid to hide a PN behind a flow diagram: • Gather sub-PN modules related to various component behaviours and auxiliary modules related to external events into a library; • Build the flow diagram as an ordinary flow diagram; • For each of the blocks, select the relevant sub-PN module in the library and attach it to the block; • Select the auxiliary modules (utilities, common cause failures, repair teams, etc.) not belonging to the core of the flow diagram and link them to the relevant blocks. As said for the RBD-driven PNs, most of these stages can be partly or completely automated in order to facilitate the analyst’s work and, in particular, with regards to the flow circulation and the production availability calculations. Again, the resulting PN is hidden behind the flow diagram but it remains available in case modifications for adaptation or improvement are needed. As already mentioned above in section “RBD-driven PNs as an Interface to Hide PNs”, the BStoK module included within the GRIF workshop package (GRIFWorkshop (2020)) allows to handle flow diagram-driven PNs including RBD-driven PNs as a particular case. The production model described in Fig. 33.57 has been very much simplified in order to highlight the modelling principle but it can be very much improved to become more realistic. For example, the similarities between RBD-driven PNs and FD-driven PNs provide an opportunity to mix them together to model, for example, the command-control system of the process units (RBD-driven PNs) and the process unit itself (FD-driven PNs). In fact, this is already the case when auxiliary sub-PNs are used to govern logic variables impacting the behaviour of the FD-driven PNs. A step further is to model these logic variables by complete RBD-driven PNs. This is illustrated in Fig. 33.58 where a production system modelled by an FD-driven PN is protected by a safety instrumented system modelled by an RBD-driven PN. In this case, several different flows circulate throughout the model (logic flow and production flow) and this is a multiple flow model. Such multi-flow models can be handled by the FLEX module which is an extension of the BStoK module included within the GRIF workshop (2020) software package).

33.11 Building PNs Related to Large Systems Fig. 33.58 Example of mixing FD-driven and RBD-driven PNs

655

RBD driven PN

Logic solver

S2

LS

Sensors S1

FD driven PN Wells

In

SDV1

SDV2

Logic flow Production flow Tank

In

Utilities Composite flow (3 sub-flows)

Gas treatment

Gas Oil Water

Filter

Separator

Scrubber

Divergent node

Water treatment Fig. 33.59 Example of model implementing a composite flow

A step forward is to introduce: • Composite flows made of several parts (e.g. water, oil and gas) following different paths after they have been separated (see Fig. 33.59); • Rules to split or gather these flows; • The fact that some units cannot operate below a minimum flow; • Icons to visually identify the related components (pumps, separator, valves, etc.); • Complex test, maintenance and spare part policies; • etc. In fact, all the modelling needs for developing realistic and comprehensive interfaces can be covered in one way or another by implementing PNs. The PETRO module, which is an extension of FLEX and BStoK within the GRIFWorkshop (2020) software package, provides a realistic (see Fig. 33.59) and powerful interface to model large production systems. With this interface, the entered model is very close to the process flow diagram of the production system to be modelled and the software provides all the tools allowing the production availability calculations. It has been designed for engineers without particular knowledge about probabilistic modelling and calculations but, again, if it is wanted, the underlying PN is still available for modifications. However, the counter-part of user-friendliness is the slowing down of the calculations.

656

33 Petri Net Modelling

33.12 Coloured Petri Nets Until now, the ordinary tokens as defined by Carl Adam Petri have been considered. Even when they seem to circulate, they are destructed and created by the firings of some transitions. This is not the case for all the kinds of Petri nets and some attempts have been made to attach a given token to a given element of a model and to manage to make the token circulate throughout the PN when the transitions are fired. This is the case of the coloured Petri nets (Jensen 1996). Contrary to the ordinary PNs where the tokens are trivialized, in coloured PNs the tokens are individualized and this is useful to • Follow them when they move on the Petri net, • Change their properties when they move according to the Monte Carlo simulation, • And, above all, validate the same transition from several different tokens. Therefore, the same static structure can be used with several coloured tokens as illustrated in Fig. 33.60, where colours have been used for modelling different components: black, white, grey and hatched. In this model: • The black and white components are in up state; • The grey component is waiting to be repaired; • The hatched component is under repair. This is a very compact representation as it replaces 4 sub-PNs like this represented on the left-hand side of Fig. 33.7. This is like if the 4 sub-PNs had been folded into a single one. Compared to the ordinary PNs which belong to the low-level model, the coloured PNs belong to the high-level models. Therefore, this seems very promising for reliability engineers to achieve even larger models than these which have been analysed above. This is also very interesting from a conceptual point of view. However, when trying to design and use coloured PNs, it becomes quickly obvious that this is not as easy as that! For example, transition Fig. 33.60 Example of coloured PN

EoR Tr3 Hatched token

μ Pl3

Up

Black token

Pl1

White token

Failure

Repair

δ

Tr1

λ

Pl2

Grey token

Tr2 Wait for repair SoR

33.12 Coloured Petri Nets

657

Tr1 (failure) in Fig. 33.60 is valid from the black or the white tokens point of views but inhibited from the grey or hatched tokens points of views. The firing delay is different for the black token compared to this of the white token. This results in a high degree of abstraction reducing the direct understanding of the PN. This can make the PN very difficult to debug when behaviour anomalies are detected. What is gained in the model size is paid by a loss of the intelligibility. The Monte Carlo simulation performances are also generally reduced. This is mainly due to the pre-processing identifying which transitions are impacted when another transition is fired (see Sect. 33.8.4) which is less effective than for the ordinary PNs. However, the performances can be improved when the coloured PN is unfolded (i.e. transformed into an ordinary PN) before being simulated but this transformation can be time-consuming. Such PNs should be used only when it is necessary to follow the components all along the production process. In oil and gas industry, as a barrel of oil is identical to any other barrel, this has not been really useful for the authors of this book as, over more than 35 years now, the needs have been covered by the ordinary PNs! Nevertheless, they have made some attempts to mix ordinary tokens and coloured tokens within the same framework in order to keep the calculation speed of the ordinary PNs and the improved modelling power of the coloured PNs. The works remain at the idea stage at the present time but this seems an interesting solution to be further investigated in the future. Anyway, in other industries, this can be useful, for example to model the expiry date of some components or products, to associate a traveller with its baggage when they are routed in different ways or to match several pieces manufactured on different chains but which have to be mounted together in the end. In this case, coloured PNs or similar object-oriented models seem a way forward especially if the computer speed continues to increase. As mentioned in Sect. 33.4.4.3, it has to be noted that, beyond the simple coloured tokens, the Petri net models can be improved to model systems mixing discrete and continuous states (e.g. to model physical evolution like increase of pressure, of temperature, of fluid level in a tank as a continuous function of time). This can be achieved by using response surfaces (Averbuch et al. 2007) or hybrid PNs (Villani et al. 2005; David and Alla 2004) which are beyond the scope of this book.

33.13 Conclusion About PNs Now that the Monte Carlo simulation has become operational and can be performed even by using personal computers, the readers of this chapter should be convinced of the simplicity, flexibility and modelling power of the Petri nets used in conjunction with this calculation method: • There are no real difficulties to undertake PN modelling provided that the simple principle of modularity is implemented and that the model is properly documented to remember what has been done.

658

33 Petri Net Modelling

• The model intelligibility is made easy thanks to the graphical aspects of the PNs allowing a relevant presentation of information. • The debugging of the models is facilitated thanks to the implementation of the principle mentioned above and also by the possibility to animate the PN step by step when the used software is providing a stepper allowing to trigger the transitions by hand and to immediately see if the behaviour is correct or not. • The calculation time decreases continuously thanks to the improvement of computers and the Monte Carlo simulation is going to be as fast as analytical calculations. This already occurs with powerful computers having tera, exa and now peta-flops of computing power and there is no doubt that simulation will be increasingly used in the future. Since their first attempts in the early eighties to use PNs for supporting Monte Carlo simulations, the authors of this book have never been disappointed by this technique. They always have managed to find a solution to achieve relevant models. When needed and after a period of reflection (e.g. to be sure to keep an ascending compatibility), improvements have been made to the GRIF-Workshop (2020) software package. The increasing complexity of the systems to be studied has, for example, required the introduction of the predicates and assertions but this has been done in the continuity of the works undertaken since the origin and none of the previous choices have had to be questioned. This proves once more the great flexibility of the approach. Nowadays, the mix between Petri nets and Monte Carlo simulation is certainly one of the best price/quality ratios with regards to the rather low intellectual investment and the modelling powerfulness. Nevertheless, the users should be careful not to become addicted to the point of discarding the analytical approaches which, in contrast, could appear to be extremely limited. They still are very useful in many cases, especially when very low probabilities have to be estimated.

33.14 Associated Exercises Fifteen exercises related to Chap. 33 are proposed in Chap. 34. They are based on the description of the behaviour of a service station. This is an opportunity to illustrate the use of PNs to model and analyse queuing processes: • Exercise 33.1 (Sects. 33.4.5, 33.5.1 and 33.8.5): model the service station with basic PNs (i.e. without predicates or assertions) and the queuing and refuelling processes under the following assumptions: no failure for the fuel pumps, no limitations with regards to the size of the entrance and cash queues and no difference between night and day.

References

659

• Exercise 33.2 (Sects. 33.5.1 and 33.4.5): model the fuel pump failure and repair processes under the assumption of a single repair team and no difference between night and day. • Exercise 33.3 (Sect. 33.5.1): link models developed in exercises 33.1 and 33.2 under the assumption that, when a pump fails, the ongoing refuelling is stopped and the car goes to the cash to pay. • Exercise 33.4 (Sect. 33.4.3): think about the model developed in exercise 33.3 and identify the potential simulation difficulties. • Exercise 33.5 (Sects. 33.4.3, 33.5.1 and 33.6.1): model the fact that, when the entrance queue is greater than or equal to 10 cars, a new arriving car renounces to refuel its tank and count the number of lost sales. • Exercise 33.6 (Sects. 33.5.1 and 33.8.5): model the fact that, when the exit queue is greater than or equal to 4 cars, the cars having finished to refuel their tanks wait in front of the fuel pumps until one place becomes free in the queue. • Exercise 33.7 (Sects. 33.5.1 and 33.8.5): gather the improved queuing models developed in exercises 33.5 and 33.6 and build the overall queuing and refuelling processes for the overall service station. • Exercise 33.8 (Sects. 33.5.1 and 33.6.2): model the night and day cycles and divide by 4 the arrival rate at night. • Exercise 33.9 (Sects. 33.5.1 and 33.6.2): extend exercise 33.8 under the assumption that only one pump is open at night. • Exercise 33.10 (Sects. 33.5.1, 33.6.1 and 33.6.2): extend exercise 33.8 under the assumption that the repair team does not work at night. • Exercise 33.11 (Sects. 33.5.1 and 33.7.2.2): extend exercise 33.10 to the mobilisation of the repair team and to the spare part provisioning. Use predicates and assertions to model the number of faulty pumps at any time. • Exercise 33.12 (Chap. 32 and Sects. 33.5.1 and 33.8.5): reduce exercise 33.10 when only one pump is open during the daytime and closed at night and perform a Monte Carlo simulation to calculate how many cars refuel their tank over 1 month and how many sales are lost due to a too long queue at the service station entrance. GRIF-Workshop (2020) can be used to do that. • Exercise 33.13 (Chap. 32 and Sect. 33.8.5): extend the Monte Carlo simulation developed in exercise 33.12 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station. • Exercise 33.14 (Chap. 32 and Sects. 33.5.1 and 33.8.5): same exercise as 33.12 when the single pump is open night and day. Perform a Monte Carlo simulation in order to calculate how many cars refuel their tank over 1 month and how many sales are lost due to a too long queue at the service station entrance. • Exercise 33.15 (Chap. 32 and Sect. 33.8.5): extend the Monte Carlo simulation developed in exercise 33.14 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station.

660

33 Petri Net Modelling

References Averbuch D, Dejean J-P, Gainville L, Guet S, Johnsen O and Maurel P (2007) FAMUS I: Risk-based design of offshore oil and gas production system—management of uncertainties by integrating flow assurance and reliability aspects into a stochastic Petri nets model. In: Proceedings of the safety and reliability conference ESREL 2007. Stavanger, Norway Carroll J, Long D (1989) Theory of finite automata with an introduction to formal languages. Prentice Hall, Englewood Cliffs, USA David R, Alla H (2004) Discrete, continuous and hybrid Petri nets. Springer Berlin and Heidelberg GmbH and co, Germany GRIF-Workshop (2020) PETRI, BStoK, FLEX and PETRO modules. Funded and developed by TOTAL, http://grif-workshop.fr/. Accessed Sept 2020 IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/electronic/programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60848 Ed. 1.0 (2013) GRAFCET specification language for sequential function charts. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 62551 Ed. 1.0 (2012) Analysis techniques for dependability. Petri net techniques. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 15909 Ed. 1.0 (2019) Systems and software engineering—high level petri nets (2 parts). International organization for standardization (ISO). Geneva, Switzerland Jensen K (1996) Coloured Petri nets—basic concepts, analysis methods and practical use. Springer, Berlin Lawson MV (2004) Finite automata, Chapman & Hall/CRC Ligeron J-C, Delage A (1980) Fiabilité du métro de Caracas. Second national congress of reliability and maintainability. Perros-Guirrec, France Ling S, Schmidt HW (2013) A notion of safeness in time for Petri nets. Monash University, Caulfield East, Australia Natkin S (1980) Les réseaux de Petri stochastiques. Thesis of doctor engineer, CNAM, Paris Petri CA (1962) Kommunikation mit Automaten. University of technology of Darmstadt, Germany Signoret J-P (1998) Modeling the behavior of complex industrial systems with stochastic Petri nets. Proceedings ESREL 1998, Trondheim, Norway Signoret J-P (2008) Analyse de risque des systèmes dynamiques—Réseaux de Petri. SE 4072 and SE 4073. Techniques de l’Ingénieur. Paris, France Signoret J-P (2009) Dependability & safety modeling and calculation: Petri nets. In: 2nd IFAC workshop on Dependable Control of Discrete Systems. Bari. Italy Signoret J-P, Leroy A (1985) Probabilistic calculations of the production of a subsea production cluster. In: Proceeding of safety and reliability society annual symposium, Southport. 1985 Signoret J-P, Leroy A (1989) Use of Petri nets in availability studies, Reliability “89”. Brighton, UK Signoret J-P, Chabot J-L, Hutinet T (2002) Hiding a stochastic Petri net behind a Reliability Block Diagram. ESREL 2002/ λμ13 congress, Lyon Signoret JP, Dutuit Y, Cacheux PJ, Folleau C, Collas S, Thomas P (2013) Make your Petri nets understandable: Reliability block diagrams driven Petri nets. Reliab Eng Syst Saf (RESS), 113:61–75. Elsevier TGI (2020) http://www.informatik.uni-hamburg.de/TGI/PetriNets/index.php. Accessed Sept 2020 Villani E, Pascal JC, Miyagi PE and Valette R (2005) A Petri net-based object-oriented approach for the modelling of hybrid productive systems. In: Nonlinear Analysis (Journal of) Elsevier, Volume 62, issue 8), Special issue on Hybrid Systems and Applications, Elsevier, pp 1394–1418 Wikipedia RS (2020b) https://en.wikipedia.org/wiki/Response_surface_methodology. Accessed Sept 2020 Wikipedia FSM (2020a): https://en.wikipedia.org/wiki/Finite-state_machine. Accessed Sept 2020

Chapter 34

Dynamic Modelling Exercises

34.1 Markovian Approach Exercises 34.1.1 Example: Pumping System 34.1.1.1

Description of the Pumping System

Let us consider the pumping system illustrated in Fig. 34.1 made of two redundant pumps P1 and P2 and one valve V. This system is derived from the simple example analysed in Chap. 31 devoted to the Markovian approach by adding valve V and only the parts boxed in dotted line are to be considered in order to limit the size of the corresponding Markov graphs. With regards to the operating philosophy: • • • •

The pumps can fail only when running. The valve can fail at any time. The pumps are stopped when the valve is in down state. All considered failures are self-revealed and repairs start as soon as possible.

34.1.1.2

Reliability Data

See Table 34.1.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_34

661

662

34 Dynamic Modelling Exercises

Fig. 34.1 Schematic of the pumping system to be studied

P1 (running)

Tank V P2 (running)

Table 34.1 Reliability data for the pumping system Item

Failure rate (λ)

Repair rate (μ)

Probability of failure on demand

Pump

λ p = 6.5 × 10−5 h−1

μ p = 2.0 × 10−2 h−1

γ = 1.0 × 10−2

Valve

6.8 × 10−6 h−1

λv =

μv = 8.0

× 10−2 h−1

/

34.1.2 Description of the Exercises Related to the Pumping System Take time to try to find the solution of the exercises before looking at the solutions!

Exercise 31.1 (Sects. 31.1.2 and 31.3.3): identify the various system states, split the states between up and down state classes and build the corresponding reliability Markov graph of the pumping system when there is no limitation with regards to the number of repair teams and when there is only a single repair team. Write the equations for assessing the unreliability of the system. Exercise 31.2 (Sect. 31.7.1.1): same exercise as Exercise 31.1 with a common cause failure on the pumps. Exercise 31.3 (Sect. 31.7.3): same exercise as Exercise 31.1 when pump P1 is running, pump P2 is kept in standby position with a perfect starting on demand (γ = 0). Exercise 31.4 (Sect. 31.5.4): same exercise as Exercise 31.3 when P1 and P2 are alternatively running and in standby position. The change occurs every month. Exercise 31.5 (Sects. 31.5.1 and 31.7.3): same exercise as Exercise 31.3 with a probability of failure, γ , to start on demand of P2 when P1 fails. Exercise 31.6 (Sect. 31.3.2): same exercise as Exercise 31.1 but for the availability Markov graph of the pumping system when there is no limitation with regards to the

34.1 Markovian Approach Exercises

663

number of repair teams. Write the equations for assessing the availability and the unavailability of the system. Exercise 31.7 (Sect. 31.3.2): same exercise as Exercise 31.6 but when there is a single repair team repairing in priority valve V, pump P1 and pump P2. Exercise 31.8 (Sect. 31.6.1): same exercise as Exercise 31.6 but considering that P1 and P2 are similar, their states can be aggregated to simplify the graph. Exercise 31.9 (Sect. 31.7.3): same exercise as Exercise 31.8 with repair priority as for Exercise 31.6 when there is a single repair team. Exercise 31.10 (Sect. 31.7.3): same exercise as Exercise 31.9 when one of the pumps is operated in standby position and can fail to start on demand. Exercise 31.11 (Sect. 30.3.3): extend Exercise 31.1 to calculate the unreliability over 20 years and the failure rate of the pumping system. Compare the asymptotic failure rate to the approximation provided in Chap. 31. Exercise 31.12 (Sect. 31.3.2): extend Exercise 31.9 to calculate the unavailability and the failure frequency over 500 h and the average unavailability and failure frequency over 1 year of the pumping system. Compare the asymptotic unavailability to the approximation provided in Chap. 31. Exercise 31.13 (Sect. 31.5.3): extend Exercise 31.7 to calculate the production availability and the average production availability of the pumping system when the production capacity (efficiency) of P1 is of 90% and this of P2 of 10%. It has to be noted that such a design covers situations (not considered in the exercise) where the production demand becomes too low for P1 to continue to run and where P2 has to take over alone. Exercises 31.1–31.10 are designed to be achieved by hand and Exercises 31.11– 31.13 need a Markovian software package. They have been dimensioned to be achievable by using the free demo version of the GRIF-Workshop (2020) software package.

34.1.3 Solutions of the Exercises Related to the Pumping System 34.1.3.1

Exercise 31.1—Reliability: Hot Redundancy

The aim of the exercise is fourfold: identify the various system states, split them between up and down state classes, build the corresponding reliability Markov graph of the pumping system when there is no limitation with regards

664

34 Dynamic Modelling Exercises

to the number of repair teams and when there is only a single repair team and write the equations for assessing the unreliability of the system. The simple way to identify the system states is to proceed in a systematic way by using a truth table as this has been done in Fig. 34.2: three binary items lead to identify 23 = 8 states, E1 to E8. According to its description, the pumping system is in down state if the valve is in down state (failed closed) or if both pumps P1 and P2 are failed. This allows to split the system states in two classes: • Up state class: E 1 to E 3 ; • Down state class: E 4 to E 8 . The 8 states identified above provide the basis to build the corresponding Markov graph. Fortunately, drawing the corresponding Markov graph leads to gather all the down states into a single absorbing state E abs which is not repaired. Therefore, only 4 states are remaining and this leads to the Markov graph presented in Fig. 34.3. Only one repair team is used at the same time: therefore, this Markov graph is valid in both cases: unlimited number of repair teams and single repair team. If Pi (t) is the probability for the pumping system to be in state E i , the formula giving the reliability is the following: R(t) = P1 (t) + P2 (t) + P3 (t)

Fig. 34.2 Identification of the states of the pumping system

V

P1

P2

States

Class

Up

Down

34.1 Markovian Approach Exercises Fig. 34.3 Reliability Markov graph of the pumping system

665

Up

Down

+

+ 34.1.3.2

Exercise 31.2—Reliability: Hot Redundancy + CCF

The aim of the exercise is to extend Exercise 31.1 with a common cause failure on pumps P1 and P2. When a CCF occurs on the pumps from state E 1 , they both stop to run at the same time and state E 4 is reached. But the same CCF can occur even if only one of the pumps is running, then this leads to a jump from E 2 to E 4 or from E 3 to E 4 . Gathering the above considerations leads to the Markov graph drafted in Fig. 34.4 which takes the CCF on the pumps into account. The only difference with the previous Markov graph in Fig. 34.3 is related to the transition rates from E1 to E 4 , from E 2 to E 4 and from E 3 to E 4 . If Pi (t) is the probability for the pumping system to be in state E i , the formula giving the reliability is the following: Fig. 34.4 Reliability Markov graph of the pumping system with CCF on the pumps

Up

Down

666

34 Dynamic Modelling Exercises

R(t) = P1 (t) + P2 (t) + P3 (t)

34.1.3.3

Exercise 31.3—Reliability: Standby Redundancy

The aim of the exercise is to extend Exercise 31.1 when pump P1 is running, pump P2 is kept in standby position with a perfect starting on demand (γ = 0). Therefore, by comparison with the Markov graph in Fig. 34.3, state E 1 = V.P1.P2 becomes E 1 = V.P1.P2 S B where P1 is running and P2 in standby position. According to the assumptions, when P1 and P2 are in up state, P2 comes back to the standby position. This leads to the Markov graph in Fig. 34.5. If Pi (t) is the probability for the pumping system to be in state E i , the formula giving the reliability is the following: R(t) = P1 (t) + P3 (t)

34.1.3.4

Exercise 31.4—Reliability: Alternated Standby Redundancy

The aim of the exercise is to extend Exercise 31.3 when P1 and P2 are alternatively running and in standby position and when the change occurs every month. The first step is to realize that a multiphase Markov model has to be developed because the configuration changes every month. The second step is to develop the Markov graph modelling what happens when P2 is in standby position and the Markov graph modelling what happens when P1 is in standby position. And the Fig. 34.5 Reliability Markov graph of the pumping system (P2 in standby position)

Up

Down

34.1 Markovian Approach Exercises Up

Down

667 Up

Down

P1 in standby

P2 in standby

Fig. 34.6 Reliability Markov graphs of the pumping system (alternate standby position)

third step is to define the linking matrices allowing to jump from a phase to the next one. With regards to the Markov graphs, a new state E 5 = V.P1 S B .P2 where P1 is in standby and P2 running has to be introduced. This allows to draw the Markov graph when P2 is in standby position which is presented on the left-hand side of Fig. 34.6. It is derived from Figs. 34.3 and 34.5: • The new state E 5 cannot exist in this phase (it is drafted in dotted line). • State E 2 cannot be reached in this phase but can, nevertheless, exist from the previous phase when P1 was in standby position. It has to be noted that, in the first phase (at t = 0), the probability to be in this state is equal to 0 and that the Markov graph in Fig. 34.6 is equivalent to this illustrated in Fig. 34.5. In the same way, the Markov graph when P1 is in standby position is presented on the right-hand side of Fig. 34.6. It is similar to the previous one: the new state E 1 cannot exist and state E 3 cannot be reached but can still exist from the previous phase. Figure 34.7 presents the linking matrices allowing to link the Markov graphs when the pump kept in standby position changes. These matrices are very simple as only the probabilities of E 1 and E 5 change. If Pi (t) is the probability for the pumping system to be in state E i , the formula giving the reliability is the following: P2 in standby

P1 in standby

P2 in standby

P1 in standby

Fig. 34.7 Linking of the Markov graphs to alternate which pump is kept in standby position

668

34 Dynamic Modelling Exercises

R(t) = P1 (t) + P2 (t) + P3 (t) + P5 (t)

34.1.3.5

Exercise 31.5—Reliability: Probability of Failure on Demand

The aim of the exercise is to extend Exercise 31.3 when P2 has a probability of failure, γ , to start on demand of P2 when P1 fails. In this case, when P1 fails, P2 starts with a probability of success 1 − γ and fails to start with a probability equal to γ . This can be modelled by using a zero-duration state (in dotted line) as this is done in Fig. 34.8. Then, when P1 fails from state E 1 , P2 is started with a probability 1 − γ and state E 3 is reached, and fails to start with a probability γ and state E 4 is reached. Again, if Pi (t) is the probability for the pumping system to be in state E i , the formula giving the reliability is the following: R(t) = P1 (t) + P3 (t)

34.1.3.6

Exercise 31.6—Availability: Hot Redundancy

The aim of the exercise is to extend Exercise 31.1 to the availability Markov graph of the pumping system when the number of repair teams is unlimited. The aim is also to draw the formulae to calculate the system availability and unavailability.

Up

Down

Fig. 34.8 Reliability Markov graph of the pumping system (P2 fails to start with a probability γ )

34.1 Markovian Approach Exercises

669

Contrary to the previous exercises dealing with reliability Markov graphs, the down states cannot be gathered into a single state and the eight states identified in Fig. 34.2 have to be considered. Eight states, this is not a large number of states but it is already difficult to manage by hand. Therefore, this is an opportunity to implement a technique allowing to master the hand-building of Markov graphs when the number of states increases: 1. Identify the states as this has been done in Fig. 34.2. 2. Take the first state, draw it, identify which other states are reachable and draw the corresponding transitions. 3. Take the next state, draw it, identify which other states are reachable and draw the corresponding transitions. 4. Continue until all the states have been covered. The process is illustrated in Fig. 34.9 for the 8 states of the pumping system. This leads to 8 sub-Markov graphs with the selected state located in the middle and with the failures/repairs related to pumps drafted on the right and the failure/repair drafted on the left. This is a non-conventional presentation which avoids to have crossing arcs which make the graph difficult to read. This presentation allows to analyse the jumps of the system state by state and this is a good way to avoid mistakes. In Fig. 34.9, the background of up states is in white when the background of down states is in grey. It has to be noted that the pumps in up state can fail only when the valve is itself in up state (E 1 to E 3 ) and not when the valve is in down state (E 5 to E 7 ). The number of repair teams being unlimited, the items can be repaired from any state.

Fig. 34.9 Availability Markov graph of the pumping system split into its various states

670

34 Dynamic Modelling Exercises

If Pi (t) is the probability for the pumping system to be in state E i , the formulae giving the availability and unavailability are the following: A(t) = P1 (t) + P2 (t) + P3 (t) U (t) = P4 (t) + P5 (t) + P6 (t) + P7 (t) + P8 (t)

34.1.3.7

Exercise 31.7—Availability: Single Repair Team

The aim of the exercise is to extend Exercise 31.7 when there is only one repair team and the priority for repair is V, P1 and P2. In this exercise, only one repair can occur at the same time. Therefore, compared to the previous exercise in Fig. 34.9, some transitions are forbidden and have to be removed. For example, from state E 4 the transition to E 2 is no longer possible as pump P1 has to be repaired first. In the same way for the states where the valve is in down state (E 5 to E 8 ), the repairs of the pumps are no longer possible because the valve has to be repaired first. Removing all the forbidden transitions leads to the Markov graph presented in Fig. 34.10. Again, if Pi (t) is the probability for the pumping system to be in state E i , the formulae giving the availability and unavailability are the following: A(t) = P1 (t) + P2 (t) + P3 (t) U (t) = P4 (t) + P5 (t) + P6 (t) + P7 (t) + P8 (t)

34.1.3.8

Exercise 31.8—Availability: Aggregated States

The aim of the exercise is to extend Exercise 31.6 by aggregating the states related to the similar pumps P1 and P2. The two pumps being similar, it does not matter which one is running or failed, provided that at least one of them is running (system up state) or both are failed (system down state). Therefore, the group of the two pumps can be considered as an item with 3 states: 2P (two pumps running), 1P (1 pump running) and 0P (0 pump running). This is one state less than in the previous cases and, combined with the states of the valve,

34.1 Markovian Approach Exercises

671

Fig. 34.10 Availability Markov graph of the pumping system (single repair team)

this leads to a graph with 6 states instead of 8. The 6 system states are represented in Fig. 34.11. For example: 

• E 1 is equivalent to E 1 in the previous model.  • E 2 gathers states E 2 and E 3 of the model in Fig. 34.10. The corresponding Markov graph is drafted in Fig. 34.12: 

• As two pumps can fail from state E 1 , the transition rate has been multiplied by two (2λ p ). Fig. 34.11 Identification of the states of the pumping system (aggregation of pump states)

V

Pumps

States

Class

Up

Down

672

34 Dynamic Modelling Exercises

Fig. 34.12 Availability Markov graph of the pumping system (aggregation of pump states) 



• In the same way, as two pumps can be repaired from E 3 or E 6 , the transition rate has been multiplied by two (2μ p ).   • The pumps being stopped in states E 4 and E 5 (due to the failure of the valve), they cannot fail. 



If Pi (t) is the probability for the pumping system to be in state E i , the formulae giving the availability and unavailability are the following: 







A(t) = P1 (t) + P2 (t) 



U (t) = P3 (t) + P4 (t) + P5 (t) + P6 (t)

34.1.3.9

Exercise 31.9—Availability: Repair Priority

The aim of the exercise is to extend Exercise 31.8 to take into consideration the repair priority V, P1 and P2 when there is a single repair team. The repair priority can be done directly by considering the graph in Fig. 34.12: 



• In states E 5 or E 6 , the repair of the valve has the priority and the transitions related to pump repairs have to be removed.  • In state E 3 , only one pump can be repaired at one time, then the transition rate has to be changed from 2μ p to only μ p . 

The result is given in Fig. 34.13. Again, if Pi (t) is the probability for the pumping  system to be in state E i , the formulae giving the availability and unavailability are the following: 







A(t) = P1 (t) + P2 (t) 



U (t) = P3 (t) + P4 (t) + P5 (t) + P6 (t)

34.1 Markovian Approach Exercises

673

Fig. 34.13 Availability Markov graph of the pumping system (aggregation of pump states + repair priority)

34.1.3.10

Exercise 31.10—Availability: Probability of Failure on Demand

The aim of the exercise is to extend Exercise 31.9 when one of the pumps is operated in standby position and can fail to start on demand. 

Compared to the Markov graph in Fig. 34.13, state E 1 becomes V.P.PS B where one of the pumps is running and the other in standby position. Therefore, from this state, only one pump can fail and the transition rate is λ p only. When the running pump fails, the standby pump is started and this works with probability (1 − γ ) but does not work with probability γ . This has been implemented in the Markov graph drafted in Fig. 34.14 where, in  addition, compared to the Markov graph in Fig. 34.13, state E 4 becomes V¯ .P.PS B .

Fig. 34.14 Availability Markov graph of the pumping system (aggregation of pump states + repair priority and failure on demand)

674

34 Dynamic Modelling Exercises

34.1.3.11

Exercise 31.11: Unreliability and Failure Rate

The aim of the exercise is to extend Exercise 31.1 to calculate the unreliability over 20 years of the pumping system and its failure rate. Compare the asymptotic failure rate to the approximation provided in Chap. 31. Contrary to the previous exercises which can be done by hand, the probabilistic calculations require the use of a Markovian package able to perform time-dependent probabilistic calculations to calculate the unreliability and also able to calculate the system failure rate, as explained in Chap. 31. The results presented hereafter in Fig. 34.15 have been performed on the Markov graph presented in Fig. 34.3 with the free demo version of the GRIF-Workshop (2020) software package-Markov module which provides them straightforwardly. It has to be remarked that the unreliability increases slowly from 0 toward 1 while the failure rate starts at 6.8 × 10−6 × h−1 (the failure rate of the valve) at time t = 0 and converges quickly (between 200 and 300 h i.e. about 4 times the mean overall repair time of the pumps) toward and asymptotic value equal to 7.22 × 10−6 h−1 . According to Sect. 31.5.2.3, the asymptotic failure rate can be approximated by identifying the shortest paths running from the perfect state, E 1 , to the down state, E 4 , as shown in Table 34.2. The sum of the paths gives Λ˜ = 7.27 × 10−6 h−1 which is very close to 7.22 × 10−6 h−1 and slightly conservative. Unreliability

Failure rate 7.4 E-06

0.5 0.4

7.2 E-06

0.3 0.2

7.0 E-06

0.1 0

0

Times (h) 20000

40000

60000

Times (h)

6.8 E-06

80000

0

100

200

300

400

500

Fig. 34.15 Unreliability and failure rate of the pumping system

Table 34.2 Failure rate approximation from the reliability Markov graph

Paths

Formulae

Failure rate contribution

E1 → E4

λv

6.80 × 10−6 h−1

E1 → E2 → E4

p v λ p λ p +λ v +μ p

E1 → E3 → E4

p v λ p λ p +λ v +μ p

Failure rate (approx.) Λ˜

λ +λ

2.33 × 10−7 h−1

λ +λ

2.33 × 10−7 h−1 7.27 × 10−6 h−1

34.1 Markovian Approach Exercises

675 Failure frequency (Failure per hour)

Unavailability 1.2E-04

7.4E-06 9.96E-05 7.2E-06

8.0E-05

7.18E-06

7.0E-06

4.0E-05

Times (h)

Times (h) 0.0E+00

0

100

200

300

400

500

6.8E-06

0

100

200

300

400

500

Fig. 34.16 Unavailability and failure frequency of the pumping system

34.1.3.12

Exercise 31.12: Unavailability and Failure Frequency

The aim of the exercise is to extend Exercise 31.9 to calculate the unavailability and the failure frequency over 500 h and the average unavailability and failure frequency over 1 year of the pumping system. Compare the asymptotic unavailability to the approximation provided in Chap. 31. Again, this exercise cannot be performed by hand and the use of a Markovian software package is needed. The results presented hereafter in Fig. 34.16 have been performed with the free demo version of the GRIF Workshop (2020)-Markov module which provides them straightforwardly. Convergence of the Unavailability/Failure Frequency Both the unavailability and failure frequency converge rather quickly toward asymptotic values (between 200 and 300 h i.e. about 4 times the mean overall repair time of the pumps). This gives: • Asymptotic unavailability: Uas = 1.06 × 10−4 . • Asymptotic failure frequency: was = 7.22 × 10−6 failure per hour. The average values over [0, 500 h] can be calculated from the curves. As shown in Fig. 34.16, they are slightly lower than the asymptotic value because the effect of the transient period is still perceptible. Again, according to Chap. 31, the failure rate can be approximated by identifying the shortest paths running from the perfect state, E 1 , to one of the down states, as shown in Table 34.3. The sum of the paths gives Λ˜ = 7.27×10−6 h−1 , which is very close to the failure frequency 7.22 × 10−6 h−1 , which is itself very close to the failure rate calculated with the reliability Markov graph. This is due to the quick repairs of the failed items. According to Chap. 31, the asymptotic unavailability can be approximated again from the paths running from E 1 to a down state and coming back to the up state (see Table 34.4). The sum of the paths gives 1.07 × 10−4 = h −1 , which is very close to the exact result Uas = 1.06 × 10−4 but slightly conservative (Table 34.4).

676

34 Dynamic Modelling Exercises

Table 34.3 Failure rate approximation from the availability Markov graph Paths

Formulae

Failure rate contribution

E1 → E4

λv

6.80 × 10−6 h−1

E1 → E2 → E3

2λ p λ p +λvp+μ p

4.21 × 10−7 h−1

E1 → E2 → E5

2λ p λ p +λλvv+μ p

4.4 × 10−8 h−1

Failure rate (approx.)

Λ˜

7.27 × 10−6 h−1

λ

Table 34.4 Asymptotic unavailability approximation from the availability Markov graph Paths

Formulae

Asymptotic unavailability contribution

E1 → E4 → E1

λv /μv

8.50 × 10−5

E1 → E2 → E3 → E2

2λ p λ p +λvp+μ p /μ p

λ

2.10 × 10−5

E1 → E2 → E5 → E2

2λ p λ p +λλvv+μ p /μv

5.51 × 10−7

Failure rate (approx.)

U˜ as

1.07 × 10−4

Average Unavailability over One Year One way to calculate the average availability/unavailability of the system can be, as above, to run the same model over one year instead of 500 h and averaging the curves. Nevertheless, another way is to use the accumulated sojourn times spent in the various states of the system over one year of operation. They are provided by the same calculations and the results are written in the third column of Table 34.5: • Adding them leads to the accumulated sojourn times in up and down states (column number 4). • Dividing the results by T = 8960 h gives the average availability and unavailability in the last column. This leads to an average unavailability equal to U¯ (T ) = 1.06 × 10−4 , which is equal to the asymptotic unavailability Uas = 1.06 × 10−4 calculated above. That Table 34.5 Accumulated sojourn times and average availability/unavailability (T = 1year )

Up states Down states

States

Accumulated sojourn times (over 1 year)

Total uptime/downtime

Average availability/unavailability

E1

8702.830327

8759.075

¯ ) = 0.999894 A(T

E2

56.244453

E3

0.181745

0.925

U¯ (T ) = 1.06 × 10−4

E4

0.738685

E5

0.004774

E6

0.000015

34.1 Markovian Approach Exercises

677

means that the memory of the transient period (from 0 to about 300 h) has been lost by the Markov process. Average Failure Frequency over One Year Obtaining the average failure frequency over [0, 8760 h] can be done just by extending the calculation interval used in Fig. 34.16 from 500 to 8760 hours. This gives w(T ¯ ) = 7.22 × 10−6 failures per hour. Again, w(T ¯ ) is equal to the asymptotic value, was , calculated above. This proves that, beyond one year of operation, the Markov process has lost the memory of the transient period.

34.1.3.13

Exercise 31.13: Production Availability

The aim of the exercise is to extend Exercise 31.7 to calculate the production availability of the pumping system when the production capacity (efficiency) of P1 is of 90% and this of P2 of 10%. In this exercise, the overall pumping capacity is of 100% and the pumps are no longer redundant. The efficiency of the states (production capacity) is given in Table 34.6: E 1 is the perfect state, E 2 and E 3 are degraded states and in the other states the production is completely lost. Therefore, it is no longer possible to aggregate the states and the 8 states identified in Fig. 34.10 have to be considered. This leads to the complete Markov graph drawn in Fig. 34.17. Thanks to the analysis done in 34.1.3.7, it is rather easy to build this graph because the analyst can focus on the layout of the graph rather than on the transitions which have been previously identified. Running the Markov graph in Fig. 34.17 leads to the production unavailability curve drafted in Fig. 34.18. Like the availability, the production availability converges toward an asymptotic value (Pdy As = 99.67%) between 200 and 300 h of operation. Over this period of time, the average production availability is equal to Pdy(T ) = 99.7% (i.e. an average production unavailability of 1 − Pdy(T ) = 0.3%). Table 34.6 State efficiency State

Efficiency (%)

State

Efficiency (%)

E 1 = V.P1.P2

100

0

E 2 = V.P1.P2

90

E 3 = V.P1.P2

10

E 5 = V¯ .P1.P2 E 6 = V¯ .P1.P2 E 7 = V¯ .P1.P2

E 4 = V.P1.P2

0

E 8 = V¯ .P1.P2

0

0 0

678

34 Dynamic Modelling Exercises

Fig. 34.17 Production availability Markov graph of the pumping system

Fig. 34.18 Production availability and average production availability of the pumping system over 500 h

1 0.999 0.998 0.997

99.7%

0.996 0

100

200

300

400

500

The average production unavailability of 3.0 × 10−3 can be compared to the average unavailability of 9.96 × 10−5 obtained when the pumps are fully redundant (i.e. when the pumping capacity is equal to 200%).

34.2 Petri Net Approach Exercises Table 34.7 Basic operational data for the service station

679

Operation

Mean time

Hazard rate

Car arrival

Every 2 min

η = 30 h−1

Refuelling from P1 or P2

3 min

ε1 = ε2 = 20 h−1

Refuelling from P3

4 min

ε3 = 15 h−1

Cash operation

30 s

ξ = 120 h−1

34.2 Petri Net Approach Exercises 34.2.1 Example: Service Station 34.2.1.1

Description of the Service Station System

Let us consider a pumping station as illustrated in Fig. 34.19: • • • •

Three fuel pumps deliver any kind of fuel. On arrival in the service station a car takes one of the free fuel pumps. If no fuel pump is free, the cars are queuing until a fuel pump becomes free. At the exit of the station the cars are waiting to pay at the cash.

Fig. 34.19 Schematic of the service station to be studied

Entrance queue

Fuel pump

P1

Fuel pump

P2

P3

Cash queue Cash

680

34.2.1.2

34 Dynamic Modelling Exercises

Assumptions and Numerical Data

Basic Modelling and Calculations The car arrival, refuelling and cash operations are governed by exponential distributions (noted exp(λ) ≡ 1 − e−λt where λ is the hazard rate) and the data to be used for basic calculations are proposed in Tables 34.7 and 34.8. Advanced modelling and calculations The data to be used for advanced calculations are proposed in Tables 34.9, 34.10 and 34.11.

34.2.2 Description of the Exercises Related to the Service Station Take time to try to find the solution of the exercises before looking at the solutions! The aim of the exercises is to build and/or perform Monte Carlo simulations on Petri net models as described in Chaps. 32 and 33. Exercise 33.1 (Sects. 33.4.5, 33.5.1 and 33.8.5): model with basic PNs (i.e. without predicates or assertions) the queuing and refuelling processes under the following assumptions (see Table 34.1): no failure for the fuel pumps, no limitations with Table 34.8 Basic repair data for the fuel pumps Item

MTTF

Failure rate (λ)

Fuel pumps

10 days

λ=

Nb of repair teams

/

/

4.17 h−3

MTTR

Repair rate (μ)

4h

μ=

/

/

0.25 h−1

Size / 1

Table 34.9 Advanced operational data for the service station Item

Mean time

Rate

Maximum size 7.5 h−1

Car arrival (night)

Every 8 min

ηN =

Refuelling from P2 and P3 (night)

Closed

/

/

Entrance queue

/

/

10

Cash queue

/

/

4

Day

12 h

/

/

Night

12 h

/

/

/

34.2 Petri Net Approach Exercises

681

Table 34.10 Spare part management Item

Maximum stock

Re-provisioning

Provisioning delay

Spare parts

3

1 (one at a time)

168 h

Table 34.11 Advanced repair data for the fuel pumps

Item

Delay

Mobilisation

1h

regards to the size of the entrance and cash queues and no difference between night and day. Exercise 33.2 (Sects. 33.5.1 and 33.4.5): model with basic PNs (i.e. without predicates or assertions) the fuel pump failure and repair processes under the assumption of a single repair team (see Table 34.8) and no difference between night and day. Exercise 33.3 (Sect. 33.5.1): link models developed in Exercises 33.1 and 33.2 under the assumption that when a pump fails, the ongoing refuelling is stopped and the car goes to the cash to pay. Exercise 33.4 (Sect. 33.4.3): think about the model developed in Exercise 33.3 with regards to event frequencies and simulation difficulties. Exercise 33.5 (Sects. 33.4.3, 33.5.1 and 33.6.1): improve the queuing model at the service station entrance to a more realistic situation: when the queue is greater than or equal to 10 cars, a new arriving car renounces to refuel its tank here and goes away toward another service station. Count the number of lost sales in order to realize if the service station is well dimensioned for the demand from consumers. Exercise 33.6 (Sects. 33.5.1 and 33.8.5): improve the queuing model at the cash to a more realistic situation: when the queue is greater than or equal to 4 cars (one at the cash and 3 waiting to pay), the cars having finished to refuel their tanks wait in front of fuel pumps until one place becomes free in the queue at the cash. Exercise 33.7 (Sects. 33.5.1 and 33.8.5): gather the improved queuing models developed in Exercises 33.5 and 33.6 with the refuelling model developed in Exercise 33.1 and build the overall queuing and refuelling processes for the overall service station. Exercise 33.8 (Sects. 33.5.1 and 33.6.2): model the night and day cycles in order to be used in combination with the queuing, refuelling and failure/repair processes developed in the previous exercises. Apply this model to the case where the arrival rate is divided by 4 at night. Exercise 33.9 (Sects. 33.5.1 and 33.6.2): extend Exercise 33.8 to the queuing model developed in Exercise 33.7 under the assumption that pumps P1 and P2 are closed at night.

682

34 Dynamic Modelling Exercises

Exercise 33.10 (Sects. 33.5.1, 33.6.1and 33.6.2): extend Exercise 33.8 to the failure/repair model developed in Exercise 33.2 under the assumption that the repair team does not work at night. Exercise 33.11 (Sects. 33.5.1 and 33.7.2.2): extend Exercise 33.10 to the mobilisation of the repair team and to the spare part provisioning as described in Tables 34.10 and 34.11. Use predicates and assertions to model the number of faulty pumps at any time. Exercise 33.12 (Chapter 32 and Sects. 33.5.1 and 33.8.5): reduce Exercise 33.10 when only one pump is open during the daytime and closed at night in order to obtain a more tractable model to perform a Monte Carlo simulation. Calculate how many cars refuel their tanks over 1 month and how many sales are lost due to a too long queue at the service station entrance. Note that this exercise has been dimensioned to be performed by the free demo version of the GIF-Petri module of the GRIF workshop (2020) software package. Exercise 33.13 (Chapter 32 and Sects. 33.8.5): extend the Monte Carlo simulation developed in Exercise 33.12 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station which is closed at night. Exercise 33.14 (Chapter 32 and Sects. 33.5.1 and 33.8.5): same Exercise as 33.12 when the single pump is open night and day. Perform a Monte Carlo simulation in order to calculate how many cars refuel their tanks over 1 month and how many sales are lost due to a too long queue at the service station entrance. Exercise 33.15 (Chapter 32 and Sect. 33.8.5): extend the Monte Carlo simulation developed in Exercise 33.14 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station. Exercises 33.1–33.11 are designed to be achieved by hand and Exercises 31.12– 31.15 need a Petri net software package. They have been dimensioned to be achievable by using the free demo version of the GRIF-Workshop (2020) software package.

34.2.3 Solutions of the Exercise Related to the Service Station 34.2.3.1

Exercise 33.1—Queuing and Refuelling Process

The aim of the exercise is threefold: identify the various processes involved in the queuing and refuelling processes, draw them in basic Petri net form and link them together. This has to be done under the simplifying following assumptions: no failure for the fuel pumps, no limitations with regards to the size of the entrance and cash queues and no difference between night and day.

34.2 Petri Net Approach Exercises

683

According to the description of the service station, three processes are involved in the PN modelling: • The queuing process at the service station entrance. • The car refuelling process at one pump. • The queuing process to pay at the cash and exit of the service station. Once modelled, these three processes lead to three sub-PN modules which are going to be used and linked together to model the complete service station. A token generator (see Sects. 33.8.5) can be used for modelling the car queue at the service station entrance and this can be done in different ways (see Fig. 34.20): • Classical way on the left-hand side of the figure. • With a double arc in the middle of the figure. • Without upstream place on the right-hand side of the figure. These three models are equivalent. Transition AoC (arrival of a car) is valid again after being fired. Then, it is fired again and again and, thanks to the use of an exponential distribution, the average delay between the firings is equal to δ = 1/η = 2 min. Therefore, one new token (i.e. a new car) arrives in place Qin every 2 min. The refuelling process of one car at one of the fuel pumps is modelled in Fig. 34.21: • When the pump is not busy (no token in place PB) and at least one token is present in Qin (at least one car is waiting for refuelling), transition SoRF (start of refuelling) is valid and can be immediately fired. • One token is removed from Qin and one token arrives in PB. This inhibits transition SoRF and validates transition EoRF (end of refuelling). Therefore, only one car can refuel its tank at the same time. • The refuelling delay is generated at random according to an exponential distribution law with an average delay equal to δ = 1/ε which depends on the pump (see Table 34.7). • When transition EoRF is fired, one token is removed from PB (the pump is free again) and one token is added in place Qout which models the car queue at the cash. The payment process at the cash (cash operations) is modelled in Fig. 34.22: Car "generator" AoC Cars queuing for a pump

Car "generator" Double AoC arc

Car "generator" Cars queuing for a pump

Qin

Qin Mean time between car arrivals:

Fig. 34.20 Car arrival process at the service station entrance

AoC

Cars queuing for a pump

Qin

684

34 Dynamic Modelling Exercises

Fig. 34.21 Refuelling car process at one of the fuel pumps

Cars queuing for a pump

Qin

Fuel pump Pi

SoRF

PB

Start of refuelling Pump busy

EoRF End of refuelling Cars queuing at the cash Fig. 34.22 Payment process at the cash

Qout

Paying at the cash Cars queuing at the cash

Way out Wout

Qout

EoC

End of cash operationg

• Place Qout contains as many tokens as cars waiting to pay. • When at least one token is present in Qout , transition EoC (end of cash operation) is valid. • The payment delay is generated at random according to an exponential distribution law with an average delay equal to δ = 1/ξ = 30s (see Table 34.7). • When transition EoC is fired, one token is removed from Qout . The last step is to combine the sub-PN developed above in order to encompass the three fuel pumps available in the service station. This is done in Fig. 34.23: the three fuel pumps share the same Qin and Qout places which therefore provide the links between the individual sub-PN models. 34.2.3.2

Exercise 33.2—Failure and Repair Process

The aim of the exercise is to model with basic PNs the fuel pump failure and repair processes under the assumption of a single repair team (see Table 34.8) and no difference between night and day.

34.2 Petri Net Approach Exercises

685

Cars queuing Qin for a pump

P1

SoRF1

P2

PB1

SoRF2

PB2

P3

SoRF3

PB3

EoRF2

EoRF1

Cars queuing at the cash

EoRF3

Wout Qout

EoC

Fig. 34.23 Overall modelling of queuing and refuelling processes of the service station

The sub-PNs involving a single repair team proposed in Sects. 33.4.5 (Fig. 33.7) can be used as a basis to model the failure/repair process of the fuel pumps. This leads to the model in Fig. 34.24 where the single repair team is modelled by an auxiliary place RT (repair team) with a single token in it: • When the pump is in up state, there is one token in place PU. • From this state, the pump can fail through transition PF (pump failure). • This is governed by an exponential distribution with a failure rate λ (see Table 34.8). • When the pump fails, PF is fired, the token is removed from PU, one token arrives in PW and the pump waits for repair as long as the repair team is not available (no token in place RT). • As soon as one token is available in RT, transition SoR (start of repair) is fired. • This results in removing the token from RT and this prevents any other repair at the same time. • This results also in introducing one token in PR (pump repair) and transition EoR (end of repair) is validated. • This is governed by an exponential distribution with a repair rate μ (see Table 34.8).

686

34 Dynamic Modelling Exercises

Fig. 34.24 Modelling of the failure/repair process related to a fuel pump

RT RT Pump failure and repair

End of repair

Auxiliary place

Repair team

PU Pump Up

EoR

PF

PR Pump repair

Pump failure SoR PW

Start of repair

Pump waits for repair

RT

• When the repair is completed, EoR is fired and one token is placed in PU (the pump is available again) and in RT (the repair team is available again for another repair task). Having developed the previous sub-PN allows to model the failure/repair process of the whole service station as done in Fig. 34.25.

RT Repair team P1

P3

P2 PU1

PU3

PU2

EoR1

EoR3

EoR2 PF1

PR1

SoR1

PF2

PR2

SoR3

SoR2 PW1

PF3

PR3

PW2

RT Fig. 34.25 Overall modelling of the failure/repair process of the service station

PW3

34.2 Petri Net Approach Exercises

34.2.3.3

687

Exercise 33.3—Link Between Queuing and Failure/Repair Processes

The aim of the exercise is twofold: link the models developed in Exercises 3.1 and 3.2 for one pump under the assumption that when a pump fails, the ongoing refuelling is stopped and the car goes to the cash to pay. Build the model for the overall service station. As shown in Fig. 34.26, the link between the queuing process (Fig. 34.21) and the repair/failure process (Fig. 34.24) can be done through place PU: • Transition SoRF is valid only if the pump is in up state (one token inside PU). • When SoRF is fired, one token is removed from Qin but no change occurs for PU due to the double arc (the token removed from PU is immediately back). • The refuelling is possible as long as the fuel pump is in up state and it stops immediately when the pump fails. This is modelled by transition StopRF (stop of refuelling) which is inhibited as long as the fuel pump is available. The overall model for the service station is obtained by: • Replacing in Fig. 34.23 the sub-PN of Fig. 34.21 by the sub-PN of Fig. 34.26, which gives the global PN illustrated in Fig. 34.27; • And gathering this PN with the PN presented in Fig. 34.25. The link between the PNs is made through the places PU related to each of the fuel pumps. Qin Fuel pump Pi

PU SoRF PB

EoRF StopRF Qout Fig. 34.26 Link between the refuelling car process and the failure/repair process for one fuel pump

688

34 Dynamic Modelling Exercises

Qin

PU1

P1

PU2

P3

P2

SoRF1

PU3 SoRF3

SoRF2

PB

PB

PB EoRF1 StopRF1

EoRF3

EoRF2 StopRF3

StopRF2

Qout

Wout Qout

Qout

EoC

Fig. 34.27 Link between the overall queuing and failure/repair processes of the service station

34.2.3.4

Exercise 33.4—Simulation Difficulty Identification

The aim of the exercise is to think about the model developed in Exercise 33.3 with regards to event frequencies and simulation difficulties. The question may be surprising but this is an issue which should be considered every time a Monte Carlo simulation model is built: compare the frequency of the events to see if they are within the same order of magnitude. In the exercise, the queuing process is mainly governed by the car arrivals (every 2 min in average) whereas the failure/repair process is mainly governed by the pump failures (every 10 days = 14, 400 min for one pump). Therefore, in average, a pump failure is going to occur after 7200 cars have been refuelled. In other words, with three pumps that means that, roughly and in average, 2400 car arrivals (and then 2400 refuelling and queuing at the cash) have to be simulated before one pump failure is observed. Obviously, many simulations are needed before observing an impact of the pump failure on the queuing process. This is a typical problem which appears every time when in a same model are merged frequent events with rare events.

34.2 Petri Net Approach Exercises

34.2.3.5

689

Exercise 33.5—Limited Queue at the Entrance

The aim of the exercise is twofold: develop a more realistic queuing model at the service station entrance (queue limited to 10 cars) and count the number of lost sales due to arriving cars renouncing to refuel their tanks here and going away to look for another service station. In order to limit to 10 the size of the entrance queue, the first idea is to inhibit the token generator developed in Fig. 34.20 when Qin contains 10 tokens. Unfortunately, this does not allow to count the lost sales and this is why another solution is proposed in Fig. 34.28: • The token generator works as usual on the left-hand side of the figure. • When a car is arriving, one token is placed in place VC (visiting cars). • This token is immediately removed by one of the transitions WfRF (wait for refuelling) or ABd (abandonment). • Transition WfRF is inhibited when the number of tokens in Qin increases until 10. • When ABd and WfRF are conflicting (i.e. valid at the same time), WfRF is fired first thanks to its higher priority (1) over ABD (priority of 0). Priority on transitions has been introduced in Sect. 33.6.1. Therefore, the sub-PN proposed in Fig. 34.28 both limits the number of waiting cars to 10 and counts the number of cars having renounced to refuel their tanks in this service station.

Fig. 34.28 Car arrival/renouncement process at the service station entrance

Abandonment ABd

Lost sales

Visiting cars LS

VC

Qin Wait for refuelling WfRF

10

690

34 Dynamic Modelling Exercises

Fig. 34.29 Limitation of the queue size to 4 at the cash

Qin Fuel pump Pi

SoRF

PB End of queuing at pump EoQ

EoRF

QaP Queuing at pump

34.2.3.6

Qout 4

Exercise 33.6—Limited Queue at the Exit

The aim of the exercise is to develop a more realistic queuing model at the cash before leaving the service station: queue limited to 4 cars and other cars blocked in front of the fuel pumps until a place becomes free in the queue at the cash. When place Qout is full with four tokens in it, a mechanism allowing to block the other cars in front of the pumps has to be implemented. This can be done as illustrated in Fig. 34.29 by adding a new place QaP (queuing at pump) and a new transition EoQ (end of queuing at the pump): • When EoRF is fired (the tank has been refuelled), one token is added in the new place QaP and this inhibits transition SoRF and it is no longer possible to start a new refuelling. • If Qout contains less than 4 tokens, transition EoQ is valid. It is fired immediately, the token in QaP is removed and transition SoRF is no longer inhibited (and a new refuelling can start). • If Qout contains 4 tokens, transition EoQ is inhibited. The token remains in QaP and SoRF remains inhibited, and it is no longer possible to refuel a new car until the queue at the cash is reduced by one. 34.2.3.7

Exercise 33.7—Overall Queuing and Refuelling Model

The aim of the exercise is to gather the improved queuing models developed in Exercises 33.5 and 33.6 with the refuelling model developed in Exercise 33.1

34.2 Petri Net Approach Exercises

691

and to build the overall queuing and refuelling processes for the overall service station. The Petri net modelling the overall service station when the size of the queues is limited both at entrance and exit is very easy to build by combining the sub-PNs developed in the previous exercises. Again, the link is made through the repeated places modelling the size of the queues at the entrance (Qin ) and at the cash (Qout ): this is done in Fig. 34.30.

ABd LS 10

VC

WfRF

Qin

P1

SoRF2

P3

P2 SoRF3

SoRF1 PB1

PB2

EoRF1

PB3

EoRF2

QaP1

EoRF3

QaP2

EoQ1

QaP3

EoQ2

4

EoQ3

4

4 Wout

Qout

Qout

EoC

Qout

Fig. 34.30 Overall PN with limitations of the queue size at the entrance and at the exit

692

34 Dynamic Modelling Exercises

Day

Night

N

Arrival at day AaD Visiting cars

Abandonment ABd

Lost sales

LS

VC AaN N

Night

Day

D

Arrival at night

Wait for WfRF refuelling

10

Qin

Fig. 34.31 Night and day cycles and arrival rates during days and nights

34.2.3.8

Exercise 33.8—Night and Day Model

The aim of the exercise is twofold: model the night and day cycles so as to use them in combination with the queuing, refuelling and failure/repair processes developed in the previous exercises. Apply this model to the case where the arrival rate is divided by 4 at night. The alternation day/night is easy to model and has already been analysed in Sect. 33.6.2. This leads to the sub-PN on the left-hand side of Fig. 34.31. According to Table 34.9, the durations of days and nights are of 12 h. The application of this model to the car arrival process is illustrated on the righthand side of Fig. 34.31. It is derived from the sub-PN in Fig. 34.28 where the token generator has been divided into two transitions: • Transition AaD (arrival at day) which is inhibited at night; • Transition AaN (arrival at night) which is inhibited during the daytime and where the arrival rate has been divided by 4, according to Table 34.9. 34.2.3.9

Exercise 33.9—P1 and P2 Closed at Night

The aim of the exercise is to extend Exercise 33.8 to the queuing model developed in Exercise 33.7 under the assumption that pumps P1 and P2 are closed at night. This exercise consists simply in introducing the closure of the pumps at night in the PN developed in Fig. 34.30 and to connect it to the car arrival process developed in Fig. 34.31. The result is shown in Fig. 34.32 where the transitions SoRF1 and SoRF2 are inhibited at night.

34.2 Petri Net Approach Exercises

693

ABd

N

LS

AaD 10

VC AaN

Qin WfRF

D P1

N

P2

N

P3

SoRF2

SoRF1

SoRF3 PB3

PB2

PB1

EoRF3

EoRF2

EoRF1

QaP1

QaP3

QaP2

EoQ1

EoQ3

EoQ2 4

4

4 Wout

Qout

Qout

EoC

Qout

Fig. 34.32 Overall modelling of queuing and refuelling processes with night and day alternation

34.2.3.10

Exercise 33.10—Repair Team Unavailable at Night

The aim of the exercise is to extend Exercise 33.8 to the failure/repair model developed in Exercise 33.2 under the assumption that the repair team does not work at night. The first step to solve this exercise is to analyse what happens for the failure/repair process of one pump which is closed at night and to modify the sub-PN illustrated in Fig. 34.24 accordingly:

694

34 Dynamic Modelling Exercises RT

N

RT

N

PU

PU EoR (mem)

EoR (mem) PR

PF (mem) SoR PW

RT

PR

PF SoR

Pump Pi closed at night

PW

Pump Pi running at night

RT

Fig. 34.33 Modelling of the failure/repair process when the repair team is not available at night

• When pump Pi is closed at night, it seems realistic to consider that it cannot fail at night (i.e. that the failure is suspended at night). This leads to inhibit transition PF (pump failure) at night and to introduce “memory” in this transition (see Sect. 33.6.2). Of course, if the pump is not closed at night, it can continue to fail and PF is not inhibited. • If the pump is already failed when night falls (pumps closed at night) and waiting for repair or fails at night (pump not closed at night), the repair cannot be started before the next day and this implies to inhibit transition SoR (start of repair) during the whole night. • If the pump is under repair when night falls, the repair is suspended until the next day and this implies to inhibit transition EoR (end of repair) and to introduce “memory” in this transition. This leads to the sub-PN shown in Fig. 34.33 which models the failure/repair processes of pumps according to their status at night: closed on the left-hand side or open on the right-hand side. These sub-PNs are similar to the one illustrated in Fig. 34.24 but inhibitor arcs have been added: • When the pump is closed at night, it cannot fail and a repair cannot be achieved at night. Then, transitions PF (pump failure), SoR (start of repair) and EoR (end of repair) are inhibited when one token is present in place N (night). • When the pump is open at night, it can fail but cannot be repaired at night. Then, the transitions PF can be enabled but SoR and EoR are still inhibited when one token is present in place N (night). The model presenting the overall service station with no repairs during nights is presented in Fig. 34.34.

34.2 Petri Net Approach Exercises

695

Repair team N

RT

N

P1

N P3

P2 PU1

PU3

PU2 EoR3 (mem)

EoR2 (mem)

EoR1 (mem) PF1

PR1

PF2

PR2

(mem) SoR1

(mem)

(mem) SoR3

SoR2 PW1

PF3

PR3

PW2

PW3

RT Fig. 34.34 Failure/repair process of the overall service station with no repairs during nights (P3 open 24/24)

34.2.3.11

Exercise 33.11—Mobilisation and Spare Part Modelling

The aim of the exercise is twofold: extend Exercise 33.10 to the mobilisation of the repair team and to the spare part provisioning as described in Tables 34.10 and 34.11, and use predicates and assertions to model the number of faulty pumps at any time. According to Table 34.10, the maximum stock of spare parts is equal to three. This can be interpreted as follows: 3 parts in stock and 1 spare part is ordered each time one of them has been used to repair one pump. The spare part provisioning process is illustrated on the left-hand side of Fig. 34.35: • Place SP (spare parts) contains a number of tokens equal to the available spare parts. • When this number becomes lower than three, then transition SPP (spare part provisioning) becomes valid and, 168 h after, one new spare part is available. • When SPP is fired, the token is removed and immediately given back to place TG (token generator) and therefore SPP becomes ready to be fired again as soon as the stock of spare parts goes below 3 units. According to Table 34.11, when the repair team is mobilised, a delay equal to 1 h is needed to be ready to start the repair. This implies to introduce a more sophisticated

696

34 Dynamic Modelling Exercises

Spare part provisioning

TtL SMb Mob

TG nMb dMob2 SPP

RT 3

SP

SP dMob1

Mobilisation process

Fig. 34.35 Spare part provisioning and mobilisation processes

model than in the previous exercise. As this is not really easy by using only basic PNs, this is an opportunity to use the assertions and predicates described in Sect. 33.5.2. The mobilisation process of the repair team is illustrated on the right-hand side of Fig. 34.35: • Place nMb (not mobilised) contains one token when the repair team is not mobilised. • From this state, transition Mob (mobilisation) is inhibited as long as the number, NF, of faulty pumps is equal to 0 (i.e. the predicate ??NF > 0 is “false”) and as long as at least one spare part is available. • When a pump fails, NF becomes positive (see Fig. 34.36) and Mob is fired if, in addition, at least one spare part is available. • One token appears in SMb (start of mobilisation), the mobilisation properly speaking starts and transition TtL (time to be on location) becomes valid. • When the delay is elapsed, transition TtL is fired and one token arrives in place RT (repair team), which can be used as usual in the previous models. • When the faulty pump is repaired, the token comes back to place RT (see Fig. 34.36). • From this state, the repair team is demobilised either if no more pumps are faulty (transition dMob1) or if no more spare part is available (transition dMob2). To take into account the spare part and repair team procedures, the PN in Fig. 34.34 has to be adapted as done in Fig. 34.36: • Each time one of the pumps fails, variable NF is incremented by 1 thanks to the assertions !!NF = NF + 1 which are updated by transitions PF1 , PF2 and PF3 . • Each time one of the pumps is repaired, variable NF is decremented by 1 thanks to the assertions !!NF = NF − 1 which are updated when transitions EoR1 , EoR2 and EoR3 are fired. • In addition, each time one of the pumps is repaired, one token is removed from place SP (spare parts) when transitions EoR1 , EoR2 and EoR3 are fired.

34.2 Petri Net Approach Exercises

697

Repair team SP

N P1

EoR1 (mem)

RT

N P2

P3

EoR2 (mem)

PU1

PF1

PR1

SP EoR3 (mem) PU3

PU2

PF2

PR2

(mem) SoR1

N

SP

PF3

PR3

(mem)

(mem) SoR2

PW1

SoR3 PW2

PW3

RT

Fig. 34.36 Failure/repair process of the overall service station with no repairs during nights (P3 open 24/24) and spare part and mobilisation procedures

It has to be noted that, when place SP is empty, transitions EoR1 , EoR2 and EoR3 are inhibited until a new spare part has been provisioned and the repair team is demobilised (see above).

34.2.3.12

Exercise 33.12—Monte Carlo Simulation with One Pump Closed at Night

The aim is to reduce Exercise 33.10 when only one pump is open during the daytime and closed at night and to perform a Monte Carlo simulation to calculate how many cars refuel their tanks over 1 month and how many sales are lost due to a too long queue at the service station entrance and the closing at night. The first step is to draw the PN related to the arrival process for a service station which is open during the daytime and closed at night. This has been done in Fig. 34.37 on the basis of the sub-PN presented in Fig. 34.31 modified by introducing new places and transitions: • Places NTd count how many cars arrive at the station during the daytime and NTn at night. • Place VCd models the visiting cars during the daytime.

698

34 Dynamic Modelling Exercises

N

D

LSd

ABd

AaD VCd

Day

NCd PC

NTd TC

WRFd

10 Qin

NTn N AaN

Car arrival process Pump closed at night

D

Night

Day/night process

Fig. 34.37 Car arrival process with a single pump closed at night

• Place NCd counts how many cars refuel their tanks during the daytime. • Place LSd (lost sales during day) counts the number of cars abandoning due to too long queue or closure of the service station. • Transition PC (pump closure) transforms in lost sales the pump waiting in Qin at the end of the day when the pump closes due to night. This sub-PN works as in the previous case: Qin is limited to 10 tokens (10 cars in the entrance queue). Beyond this value, the cars abandon and the fuel sale is lost. The day/night process is similar to the one in Fig. 34.31 and has been presented as a reminder. The second step is to draw the PN related to a single fuel pump open during the daytime and closed at night under the same assumptions as in Exercise 33.10. Figure 34.38 models such a single pump operating during the day and closed at night: • The failure/repair process for a pump closed at night is modelled on the righthand side of this figure and is similar to the sub-PN shown on the left-hand side of Fig. 34.33. PU

Qin

N

SoRF

N

Cash process Pump closed at night

PB EoQ

Wout

PU

PR

EoC

EoRF

EoR (mem)

Qout

QaP 4

Fig. 34.38 PN model of a single pump closed at night

PF (mem)

SoR Failure/repair process

PW

34.2 Petri Net Approach Exercises

699

• The refuelling process for a pump closed at night is modelled on the left-hand side of this figure and it mixes the sub-PNs proposed in Figs. 34.27 and 34.32 for pump P1 or P2. • The payment process at the cash is similar to the sub-PN developed in Fig. 34.21 and the queue limited to 4. The next step before launching the Monte Carlo simulation properly speaking is to use, when available, the stepper (i.e. the possibility to animate the simulated model by hand) which is often a feature of the Monte Carlo simulation software packages. This provides a good opportunity to verify that the simulated model behaves as expected. GRIF-Worksop (2020) Petri net module provides such a stepper allowing to play with the PN by triggering the transitions by hand and seeing how the marking is modified. This can be used, for example, to verify that the queuing processes (Fig. 34.37) and the refuelling and failure/repair processes (Fig. 34.38) work correctly, in particular with regards to the tuning of the priority of conflicting transitions. The final step is to proceed to the Monte Carlo simulation itself: simulating 1000 histories over one month with GRIF-Petri with the input data provided in Tables 34.7 to 34.9 leads to the results related to the marking of the various places given in Table 34.12. Therefore, according to these results and over 1 month (720 h): • 10,913 cars have tried to refuel at the station during the daytime (place NTd); Table 34.12 Single pump closed at night: place marking over 1000 Monte Carlo simulations Place

Sojourn Time

Standard deviation

Average token number

Standard deviation

Token Standard number at deviation end of history

TC

720.00

0.00

1.00

0.00

1.00

0.00

NTd

719.97

0.04

5571.16

48.05

10912.64

95.53

VC

0.00

0.00

0.00

0.00

0.00

0.00

LSd

718.91

0.67

1983.84

91.33

3876.44

160.65

NCd

719.97

0.04

3587.31

82.95

7036.20

136.52

NTn

707.86

0.17

1336.40

29.32

2704.33

46.69

Qin

350.80

1.48

3.92

0.05

0.00

0.00

Day

360.00

0.00

0.50

0.00

1.00

0.00

Night

360.00

0.00

0.50

0.00

0.00

0.00

PU

708.93

13.90

0.98

0.02

0.98

0.14

PW

0.00

0.00

0.00

0.00

0.00

0.00

PR

11.07

13.90

0.02

0.02

0.02

0.14

PB

351.46

6.40

0.49

0.01

0.00

0.00

QaP

0.04

0.03

0.00

0.00

0.00

0.00

Qout

58.19

1.34

0.10

0.00

0.00

0.00

Wout

719.90

0.14

3582.81

82.95

7036.20

136.52

700

34 Dynamic Modelling Exercises

• And among them 3876 have renounced due to a queue greater than 10 (place LSd); • 7036 cars have taken place in the queue to refuel their tanks (place NCd); • And among them 7036 cars have paid at the cash (place Wout ); • Because 720 h correspond to the beginning of a new day, no cars are stuck in the entrance queue (Qin ), are completing the refuelling (PB) or stuck in the cash queue (Qout ) yet. Of course, all the 2704 cars attempting to refuel at night (NTn) have not succeeded because the station was closed. Therefore, during the daytime, only 64.5% of the cars wanting to refuel their tanks at this station have succeeded and 35.5% of the potential sales have been lost due to a too long queue at the entrance or because night arrived when they were waiting. Of course, all the potential sales have been lost at night. These results are not very good but should be obviously improved by opening the station at night (see Sect. 34.2.3.14, Exercise 33.14) or by increasing the number of fuel pumps (unfortunately, the model being too large to be handled by the free demo version of GRIF, it is not proposed here). In addition, the above results inform that: • • • • • • •

For about 351 h, there was at least one car waiting in the entrance queue (Qin ); The average length of the entrance queue has been of about 4 cars; For about 58 h, there was at least one car waiting in the exit queue (Qout ); The average length of the exit queue has been of about 0.1 car; The pump has been available (PU) during about 709 h; The pump has been busy (PB) during about 351 h; etc.

As the pump is stopped at night, it has been available 709 − 360 = 349 h during the daytime. This leads to an average availability of 97%, which is lower than the theoretical value of μ/(λ + μ) = 98.4% when repair starts immediately. This is due to the fact that some repairs wait for the whole night to be done or completed. It has to be noted that looking at the transition results gives an average firing number of 1.45 for the pump failure (transition PF). This has to be compared to the 10, 913 + 2704 = 13, 616 (tokens in NTd + tokens in NTn) cars wanting to refuel their tanks which have been simulated over the same period of 1 month. This highlights the difficulties discussed in Exercise 33.4 when frequent and rare events are mixed within the same Monte Carlo simulation: here, 9391 car simulations for observing 1 pump failure.

34.2 Petri Net Approach Exercises

34.2.3.13

701

Exercise 33.13—Evolution of the Entrance and Exit Queues (Station Closed at Night)

The aim is to extend the Monte Carlo simulation developed in Exercise 33.12 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station which is closed at night. The station being closed at night, the number of cars in the entrance queue (Qin ) and in the exit queue (Qout ) is equal to 0 at night and, therefore, it is also equal to 0 at the beginning of every day. Then, to perform the exercise, it is sufficient to consider what happens over the interval of time [0, 12 h] because the evolution will be the same over [24 h, 36 h], [48 h, 72 h], etc. Using the GRIF workshop (2020) software package, the variable #Qin (respectively #Qout ) gives the number of tokens in place Qin (respectively in place Qout ) at any time. Then the evolution of the queues is given by the curves related to these variables when time elapses. The results obtained after 1000 simulations of the service station over the interval [0, 12 h] by steps of 0.2 h (1.2 min) are presented in Fig. 34.39. The evolution of Qin is presented on the left-hand side of the figure: the average number of tokens in this queue rises quickly toward a value comprised between 7 and 8. The average value over the interval [0, 12 h] is equal to 7.3 due to the transient period which starts from 0. As the number of tokens is equal to 0 during the 12 h of the night, this leads to an average value of 7.3/2 = 3.65 over [0, 24 h]. This can be compared to the average result of 3.92 calculated over [0, 720 h]. The evolution of Qout is presented on the right-hand side of the figure: the average number of tokens in this queue rises quickly toward a value comprised between 0.17 and 0.23. The average value over the interval [0, 12 h] is equal to 0.195 due to the transient period which starts from 0 (see Fig. 34.2 on the right-hand side). Length of Qin

10

Average over [0, 12]

Length of Qout

0.3

Average over [0, 12]

8 0.2

6 4

0.1

2 0

0 0

2

4

6

8

10

12

14 hours

0

2

4

6

8

10

12

Fig. 34.39 Evolution of Qin and Qou t during the daytime of the station closed at night

14 hours

702

34 Dynamic Modelling Exercises

Contrary to the number of cars in Qin , the number of cars in Qout does not drop immediately to 0 when night falls because the cars already in the station complete their refuelling before paying at the cash and going out. As the number of tokens is practically equal to 0 during the 12 h of the night, this leads to an average value of 0.195/2 = 0.098 over [0, 24 h]. This can be compared to the average result of 0.1 calculated over [0, 720 h]. Therefore, after one day, the process has already almost converged to a steady state.

34.2.3.14

Exercise 33.14—Monte Carlo Simulation with One Pump Open Night and Day

The aim is to extend Exercise 33.12 when the single pump is open night and day (24/24) and to perform a Monte Carlo simulation in order to calculate how many cars refuel their tanks over 1 month and how many sales are lost due to a too long queue at the service station entrance. Again, the first step is to draw the PN related to the arrival process for a service station which is open during night and day but with different car arrival rates. This has been done in Fig. 34.40 on the basis of the sub-PN presented in Fig. 34.37 extended in order to model the arrival process at night: • The mechanism to empty the entrance queue when night falls has become useless as the same queue (Qin ) is shared between night and day. • The rest of mechanisms developed for the day has just been duplicated to model the arrival at night. In the same way, the sub-PN in Fig. 34.38 has to be adapted for a pump open night and day. This is done in Fig. 34.41:

N AaD

LSd

ABd VCd WRFd

NTd

Day

NCd

TC

10 Qin

10

NTn WRFn AaN D

N

NCn

VCn ABn

LSn

Car arrival process 24/24

Fig. 34.40 Car arrival process with a single pump open 24/24

Night

Day/night process

34.2 Petri Net Approach Exercises PU

Qin

SoRF

703 N

Cash process Pump open 24/24

Wout

PU EoR (mem)

PB EoQ

PR

EoC

EoRF

Qout

QaP 4

PF (mem)

SoR Failure/repair process

PW

Fig. 34.41 PN model of a single pump open 24/24

• Transition SoRF (start of refuelling) is no longer inhibited at night. • Transition PF (pump failure) is no longer inhibited at night. The remaining part of the sub-PN is not modified and, in particular, the repair team is still unavailable to perform repairs at night (transitions SoR and EoR). As in Exercise 33.12, the next step before launching the Monte Carlo simulation properly speaking is to use, when available, the stepper which is often a feature of the Monte Carlo simulation software packages. This provides a good opportunity to verify that the simulated model behaves as expected. GRIF-Petri provides such a stepper allowing to play with the PN by triggering the transitions by hand and seeing how the marking is modified. This can be used, for example, to verify that the queuing processes (Fig. 34.40) and the refuelling and failure/repair processes (Fig. 34.41) work correctly, in particular with regards to the tuning of the priority of conflicting transitions. The final step is to proceed to the Monte Carlo simulation itself: simulating 1000 histories over one month with GRIF-Petri with the input data provided in Tables 34.7 to 34.9 leads to the results related to the marking of the various places given in Table 34.13. Therefore, according to these results and over 1 month (720 h): • 10,895 cars have tried to refuel at the station during the daytime (place NTd); • and among them 3717 have renounced due to a queue greater than 10 (place LSd); • 7178 cars have taken place in the queue to refuel their tanks during the daytime (place NCd); • 2717 cars have tried to refuel at the station at night (place NTn); • and among them 110 have renounced due to a queue greater than 10 (place LSn); • 2607.8 cars have taken place in the queue to refuel their tanks at night (place NCn); • 7178 + 2607.8 = 9785.8 cars (NCd + NCn) have entered into the service station and among them 9784.49 have paid at the cash (place Wout ); Therefore, during the daytime, only 65.9% of the cars wanting to refuel their tanks at this station have succeed and 34.1% of the potential sales have been lost due to

704

34 Dynamic Modelling Exercises

Table 34.13 Single pump open 24/24: place marking over 1000 Monte Carlo simulations Place

Sojourn Time

Standard deviation

Average token number

Standard deviation

Token number at end of history

Standard deviation

TC

720.00

0.00

1.00

0.00

1.00

0.00

NTd

719.97

0.04

5540.28

51.52

10895.40

89.72

VCd

0.00

0.00

0.00

0.00

0.00

0.00

LSd

718.91

0.67

1889.52

133.73

3717.38

227.18

NCd

719.97

0.04

3650.77

120.47

7178.02

201.02

Vcn

0.00

0.00

0.00

0.00

0.00

0.00

NTn

707.86

0.17

1344.40

32.34

2717.71

57.08

LSn

626.84

83.43

54.81

50.92

109.94

87.90

NCn

707.36

3.85

1289.59

60.76

2607.77

103.62

Qin

430.66

11.16

4.36

0.19

0.85

2.46

Day

360.00

0.00

0.50

0.00

1.00

0.00

Night2

360.00

0.00

0.50

0.00

0.00

0.00

PU

693.63

20.07

0.96

0.03

0.93

0.25

PW

8.57

8.23

0.01

0.01

0.00

0.00

PR

17.81

16.74

0.02

0.02

0.07

0.25

PB

486.54

13.77

0.68

0.02

0.39

0.49

QaP

0.05

0.03

0.00

0.00

0.00

0.00

Qout

81.16

2.37

0.13

0.00

0.05

0.23

Wout

719.90

0.14

4935.19

163.00

9784.49

271.70

a too long queue at the entrance. This is a little bit better than when the station is closed at night because the cars waiting when night falls are not lost as in Exercise 33.11. In addition, the above results inform that: • • • • • • •

For about 431 h, there was at least one car waiting in the entrance queue (Qin ); The average length of the entrance queue has been of about 4.4 cars; For about 81 h, there was at least one car waiting in the exit queue (Qout ); The average length of the exit queue has been of about 0.13 cars; The pump has been available (PU) during about 694 h; The pump has been busy (PB) during about 4867 h; etc.

As the pump is not stopped at night, it has been available 693.63 h during the 720 h of the simulation. This leads to an average availability of 96.3% which, again, is lower than the theoretical value of μ/(λ + μ) = 98.4% when repair starts immediately. This is due to the fact that some repairs wait for the whole night to be started or completed.

34.2 Petri Net Approach Exercises

705

It has to be noted that looking at the transition results gives an average firing number of 2.86 for the pump failures (transition PF). This has to be compared to the 10, 895 + 2718 = 13, 613 (tokens in NTd + tokens in NTn) cars wanting to refuel their tanks which have been simulated over the same period of 1 month. This highlights the difficulties discussed in Exercise 33.4 when frequent and rare events are mixed within the same Monte Carlo simulation: here, 4761 car simulations for observing 1 pump failure.

34.2.3.15

Exercise 33.15—Evolution of the Entrance and Exit Queues (Station Open 24/24)

The aim is to extend the Monte Carlo simulation developed in Exercise 33.14 to obtain the curves related to the evolution of the number of cars in the queues at the entrance and exit of the service station. The station being open night and day, the number of cars in the entrance queue (Qin ) and in the exit queue (Qout ) are not equal to 0 at night. Then, to perform the exercise, it is necessary to consider what happens over a whole interval of time [0, 24 h]. As the queues are not necessarily empty at the beginning of a new day, it would not be representative to analyse only what happens in the first day. Then a day far enough from the origin has to be chosen in order to leave the Markovian processes modelled within the PN reach a steady state: leaving 15 days to converge seems long enough and the calculations are performed over the interval [360, 384 h], which seems to be a good choice. Again, when using the GRIF-Workshop (2020) software package, the length of the queues Qin and Qout is given by variables #Qin and #Qout which give the number of tokens in places Qin and Qout at any time. The evolutions of these variables obtained after 1000 simulations of the service station over the interval [360, 384 h] by steps of 0.2 h (1.2 min) are presented in Fig. 34.42.

10

Average over day

Length of Qin

Length of Qout Average over day

0.3

8 Average over day + night

6 4

Average over night

2

Average over day + night

0.2 0.1

0

Average over night

0 360

365

370

375

380

hours

360

365

370

375

Fig. 34.42 Evolution of Qin and Qout during the daytime of the station open 24/24

380

hours

706

34 Dynamic Modelling Exercises

The evolution of Qin is presented on the left-hand side of the figure: the average number of tokens in this queue rises quickly toward a value around 8. The average value over the interval [360, 372 h] is equal to 7.9. It is slightly higher than 7.3 found when the station is closed at night because the queue is not empty at the beginning of a new day. When night falls, the size of Qin drops quickly to a minimum value of 0.35 (after that the cars arrived at the end of the day have refuelled their tanks) before rising to about 0.8 at the end of the night. The average value over the interval [372, 384 h] is equal to 0.77. Therefore, in average, the queue is 10 times lower at night than by day. This is normal as the refuelling process lasts at the minimum 3.5 min per car (3 min refuelling + 30 s at the cash): this is 1.75 times the mean time between car arrivals during the daytime (every 2 min) and 0.44 times the mean time between car arrivals at night (every 8 min). Then, not all arriving cars can be refuelled during the daytime (then the probability to have cars waiting in the arriving queue is very high) while there is enough time at night (then the probability to have cars waiting in the arriving queue is very low). Over the whole day, [360, 384 h], the average value is equal to 4.32, which can be compared to the value of 4.36 found in Table 34.13. The evolution of Qout is presented on the right-hand side of the figure: the average number of tokens in this queue rises quickly toward a value comprised between 0.1 and 0.3. The average value over the interval [360, 372 h] is equal to 0.185. The number of cars in Qin drops toward a value comprised between about 0.04 and 0.1. The average value over the interval [372, 384 h] is equal to 0.07. Over the whole day, [360, 384 h], the average value is equal to 0.128, which can be compared to the value of 0.43 found in Table 34.13.

Reference GRIF-Workshop (2020) Markov and Petri2 modules. Funded and developed by total, http://grif-wor kshop.fr/. Accessed Aug 2020 e pumping system (aggregation of pump states + repair priority)

Part V

Production Availability and Functional Safety (SIL) Modelling and Calculations

Chapter 35

Production Availability Related Modelling and Calculations

35.1 Characteristics of Production Systems 35.1.1 Size and Complexity of the Systems Production systems are of any size and of several levels of complexity as for example: • Internal reconfiguration of the systems upon failure, with subsequent multiple production levels, or restrictions on the use of some systems (e.g. caused by safety or environmental requirements) creating multiple relationships between the items. • The limitations on the number and skills of the maintenance teams, restricted availability of spare parts and logistic support, etc. create additional relationships between the items. For the purpose of this book, the production systems are classified as: • Basic systems. Full series systems are typical of such systems. Indeed, whatever the number of items, the availability of such systems can be easily modelled (Chaps. 22, 31 and 33). Other basic systems are systems with few redundancies and no constraints on operation policy, maintenance policy and resources. • Small size and low complexity systems. The main constraints arise from the operation policy, maintenance policy and limitations not due to the size of the systems. Accordingly, these low complexity systems are e.g. systems with a high number of repair teams, no spare parts limitations and no constraints on system use. • Real industrial production systems: they are usually of large size (more than 100 items) but their main characteristic is the number of constraints on the production policy and maintenance/logistic policy, as well as drastic limitations on the available means. They are qualified as high complexity systems in this chapter.

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_35

709

710

35 Production Availability Related Modelling and Calculations

35.1.2 Multistate and Multiphase Systems As explained in Sect. 6.1, real industrial production systems are often multistate systems: their production levels are not only the basic 100% and 0% production levels. According to their level of redundancy, more production levels (66%, 50%, etc.) are to be identified and considered in the production availability model. In addition: • The capacity of some equipment can be boosted (as long as needed or for some limited duration). If this increase is of 10%, the production level of 50% becomes of 55%, adding to the complexity of the system. • Sometimes, due to external constraints, the production must be limited for some time (e.g. the end user does not need the maximum production1 ). Real production systems are also multiphase systems: • Preventive maintenance activities (see Sect. 35.2.3) are not always performed by shutting down the production system: preventive maintenance on redundant equipment (e.g. on two redundant pumps) is carried out on the standby equipment while the other one is running. Thus, this phase of the system is to be modelled. • The inputs to the production system and the requested outputs may vary with time. These constraints cause the configuration of the system to be modified (equipment switched off-line or equipment switched on-line) which means new phases to be modelled. • Due to weather conditions, a system can be maintained during some periods and not during other ones (e.g. in spring, summer and autumn but not in winter).

35.1.3 Multiple Product Systems There are often several streams within a processing plant and, as such, several outputs to consider for the production availability modelling. For an oil and gas producing plant, the typical outputs are: • Liquid oil flow and hydrocarbon gas flow; • Amount of water produced and amount of water re-injected within the reservoir. For some systems, the production of products of degraded quality level is also to be considered (“off-specification product” in the process industry). These products are either sold as they are or to be re-processed later on. Again, for the purpose of providing a figure as accurate as possible (see Sect. 35.3.1) for the economic calculations, these downgraded products are to be identified and their production level assessed. 1 (ISO

20815 2018) gives the definition of other production performance measures than production availability. It is considered in this book that “production availability” includes all these definitions.

35.1 Characteristics of Production Systems

711

35.1.4 Multiple Information Sources The information on how a high complexity production system is operated and maintained comes from several departments: • Production for information on production management and on any additional protection to be considered for equipment working in harsh environment. • Design for the definition of possible system reconfigurations and associated outputs, etc. • Maintenance for the planning and organization of planned and unplanned maintenance activities (number of repair teams, working hours, etc.). • Logistics for definition of the conditions for use of heavy logistics supports and for the ordering and storage of spare parts, etc. • Safety for assessment of the effects on production of testing safety systems and for definition of the specific safety measures to be implemented in the case of work in hazardous areas (e.g. H2S gas). • Marine if the production system is (totally or partially) located offshore. These flows of information come under different forms (engineering drawings, sketches, spreadsheets, oral descriptions, etc.), not necessarily fully compatible between them. Then, the design of automated model building for production availability assessment seems highly unlikely, on the opposite of automated fault tree building (Chap. 28). Furthermore, several questions are raised in the course of the production availability modelling. Most of the time, the reliability engineer needs the collaboration of the operations engineers, maintenance engineers, logistic engineers, etc. The collection and management of these flows of information is always difficult.

35.2 Classification of Failure and Restoration Events 35.2.1 Failure Events Failures considered in production availability studies are generally self-revealed failures (see Chap. 4). They can be classified according to their severity (ISO 14224 (2006): • Critical failures: failures causing the shutdown of the equipment, either automatically or upon action of the operator. • Degraded failures: failures causing a decrease (instantaneously or soon) in the capacity of the equipment (partial plugging of heat and tube exchanger, reduced rotating speed of pump, etc.) or a decrease in safety (e.g. which can oblige to shut down the production system).

712

35 Production Availability Related Modelling and Calculations

• Incipient failures: imperfection in the state or condition of the equipment but this has no impact on the function of the equipment. Critical failures can furthermore be classified according to their impact on the production: • The failed equipment is provided with a full redundancy: no impact on the production. The repair team is mobilised and the repair is initiated when all maintenance means are available. • The failed equipment is not provided with a redundancy: as the production is decreased (even equal to 0), the maintenance team is mobilised and the repair made as soon as possible. Critical failures can also be classified according to the type of repair means to be used: • The failed equipment can be repaired with available standard spare parts. • The failed equipment is not repaired but replaced by a new one (use of “expendables”). • The repair of the failed equipment requires the use of costly spare parts. As expensive spare parts (“insurance spare parts”) are often not readily available, the impact of these failures (“breakdown”) on the production or on the capital expenses and operating expenses is high. Critical failures can also be classified according to the urgency of their repair: • First priority repairs: failures which could jeopardize human life or the environment and failures causing production losses. • Second priority repairs: production system can be re-arranged so as to prevent a production loss or failed equipment is provided with a redundancy (see above), mainly. Some other cases can also occur, such as second priority repairs to be stopped if a first priority repair occurs. Then several failure modes are to be considered for a single equipment. Safety equipment is considered upon: • spurious failures which can cause a production decrease or even shutdown; • planned testing as some tests require part of the system to be shut down. In some cases, a failure detected by the test also requires the system to be stopped during the repair of the faulty item. Common cause failure related to redundant items should not be forgotten, for example: • • • •

Loss of common source of energy (electricity, fuel, fuel gas); Loss of command-control (e.g. loss of air instrument); Spurious failure of a common safety system; Loss of a common lubricating system;

35.2 Classification of Failure and Restoration Events

• • • •

713

Bad quality of fuel (e.g. polluted with water, mud); Pump cavitation due to insufficient upstream pressure; Fire or flooding; Human errors (e.g. confusion between the machine to stop).

35.2.2 Restoration Events 35.2.2.1

Repair Times

As explained in Sect. 4.3.3, only the active repair time is intrinsic to the item itself. For onshore items it is, most of the time, the greater part of the restoration time.

35.2.2.2

Restoration Times

Ramp-up and run-down times (see Figs. 1.1 and 1.2 of ISO 20815 2018) The ramp-up (start-up) period of an item is the time needed to (re)start the unit from 0 to 100% of its nominal capacity. During this ramp up period, the increase of the capacity can be: • Linear from 0 to 100%, • Null for some time, then linear. Sometimes this increase is even more complex. In case of repair on e.g. cold Liquefied Natural Gas (LNG) equipment, the item must be warmed up before the work can start. After the repair intervention, the item concerned must be cooled down before to restart, which takes more than 20 h. The time to restart the production may also depend on the duration of the shutdown (e.g. xenon poisoning for nuclear power plants). The production (even degraded level production) during the duration of these ramp-up and run-down times is not always negligible and, as such, is to be considered in the modelling of the production availability. Additional safety times Before initiating any task on any item containing radioactive, flammable or toxic products, safety measures are to be implemented: item isolation, item purging, etc. In case of not permanently manned systems the time to detect the origin of the failure has also to be added. These times are often greater than the repair times. Logistic times For items very difficult to access (subsea units, nuclear power plants), the logistic delays for mobilising maintenance teams, maintenance means, spare parts are quickly

714

35 Production Availability Related Modelling and Calculations

significant. These times2 are heavily dependent of the maintenance policy of the production system. Administrative delays Administrative delays (e.g. authorization to get an item from the warehouse) can, in some cases, be high.

35.2.3 Planned Maintenance Planned maintenance is obviously also to be considered as it is one of the main causes of production losses. The consequences of the planned maintenance activity are of two types: • Full shutdown of the production system; • Shutdown of one of the fully redundant production trains (e.g. 2 × 100%) or equipment causing, or not, a production loss. If the duration of this shutdown is significant, the probability of failure of the still running train/equipment is not negligible and is to be included in the calculations. These planned maintenance tasks are performed at planned dates and for constant durations, they cannot be modelled as random variables. Some of these tasks are often delayed if they may cause a production stop (e.g. the planned preventive maintenance task on a redundant pump is delayed if the other one is failed).

35.3 Characteristics of Production Availability Studies 35.3.1 Economic Calculations 35.3.1.1

Need for Best Estimates

When probabilistic calculations are related to safety, pessimistic assumptions can be made in order to be sure to be on the “safe side” and obtain conservative results. As such, pessimistic assumptions (on the magnitude of the reliability data, on the value of the common cause failure parameters, etc.) can be made in the course of this assessment in order to be sure to be on this “safe side”. Such an approach has no meaning with production availability studies as their aim is to determine the economic optimum between revenues and expenditures of the production system: • The revenues are provided by the production system, as it was designed and as it is maintained and used. 2 Fault

detection times are generally nearly negligible for running items of production systems.

35.3 Characteristics of Production Availability Studies

715

• The expenditures are the investment for the design/building of the system (most of them incurred before the start-up of the system) and the operating expenses (recurrent costs for the running and maintenance of the system). It may also include possible taxes on production depending on the country where the system is operated. A production availability study provides estimations of e.g. the yearly production of the production system. Then, the balance can be made between revenues and expenses. Sometimes all the input data needed are not fully available. So, if assumptions are to be made for these input data, they are to be as close as possible to the reality i.e. best estimates, instead of conservative estimates, are to be made whenever necessary. Performing production availability studies requires best estimates.

35.3.1.2

Need for Accurate Estimates for the Early Years

Coarsely written, the (operational) cash flow (CF) of the production system for year i is: CFi = Qi × SPi −OPEXi − CAPEXi − TXi where: • Qi is the anticipated quantity of items produced on year i (or the system yearly flowrate); • SPi is the anticipated sale price of an item (or of e.g. one cubic meter) on year i; • OPEXi is the anticipated operating expenditures on year i; • CAPEXi is the anticipated capital expenditures on year i; • TXi are the taxes on production on year i. Obviously: • At t = 0 (before the start-up of the system): Q0 = 0, OPEX0 = 0 and TX0 = 0. • CAPEX0 is the highest CAPEX as it is the initial investment. Then, for the purpose of assessing the economic interest of a project at e.g. t = 0, the future cash flows are to be discounted back to their present value. This is done using the discount rate r which is a measure of the preference for the present. It is expressed in %. One of the economic criteria used for selecting a project is the discounted cash flow (DCF). The DCF at t = 0, over n years of production, is the sum of the discounted annual cash flows:

716

35 Production Availability Related Modelling and Calculations

DCF =

n  i=1

CFi (1 + r)i

This formula shows that the first years of production are given a heavier weight, so it is vital to accurately assess the Qi for the first years (including the early-life period which can extend up to two years). As an example, if the discount rate (including the inflation rate) is taken as equal to 10%, the discounted cash flow for year 1 is around 50% greater than the discounted cash flow for year 5. At the end of the life of the installation, the dismantling operations and the restoration and remediation of degraded soils may have an impact on the economic estimation and should be considered as well.

35.3.2 Rare Events There are events which deserve a specific treatment: the so-called rare events (Leroy 2018). They are events with very low frequency of occurrence and catastrophic consequences on the production. Typically, they result from accidents: fire or explosion, load drop, etc. They need to be considered differently from frequently occurring events which are part of the daily running of the facilities. The reason for this is that these events will have no effect on the mathematical expectation of the actual output production. However, if they actually occur, they have a big impact on the production of the related system. However, following such an occurrence, the contribution to production shortfalls, safety or environment could be very significant. In the estimation of the global production availability, the contribution factor of rare/catastrophic events could be averaged-out but the effect of such incidents would give a low and unrepresentative contribution.

35.4 Case Study for Comparison of Production Availability Models The authors of this book have performed dozens of production availability studies. From this data base they have selected a production system which exhibits most of the unique characteristics described in the previous paragraphs. For the purpose of allowing the production availability of the system to be modelled within the content of this chapter, the architecture of this system had to be heavily simplified.

35.4 Case Study for Comparison of Production Availability Models

717

35.4.1 Description of the Production System Plant architecture The production system is shown in Fig. 35.1. It is an oil and gas plant receiving the flow from near-by wells and exporting gas on one side and a mix of oil and water (named oil hereafter) on the other side. Only the first two years will be considered for the purpose of this case study. The plant is made up of: • An overpressure protection system, a safety instrumented system (SIS) at the inlet made up of a pressure sensor high high (PSHH) ordering a shutdown valve (SDV) to close if the incoming pressure is too high; • A separation unit delivering gas to the compression unit on one side and oil to the oil export unit on the other side. This separation unit is made up of: – The main separator SEP1 able to handle 90% of the nominal incoming flow. The plant is shut down if SEP1 is down; – The test separator SEP2 able to handle 10% of the nominal incoming flow. Each separator is provided with: – A pressure transmitter (PT) controlling the outgoing gas flow with a pressure control valve (PCV); – A level transmitter (LT) controlling the outgoing oil flow with a level control valve (LCV); • A gas compression unit made up of two gas compressors K1 and K2, each of these compressors being able to handle 80% of the nominal incoming gas flow; • An oil export pumping system made up of two fully redundant oil pumps P1 and P2 (one in duty, one on standby).

Fig. 35.1 Oil and gas process plant for the case study

718

35 Production Availability Related Modelling and Calculations

Table 35.1 Reliability parameters of the items of the production system Failure Item

Failure mode

Repair Parameter 10−7

Law h−1

Parameter

PSHH

Spurious signal

λ=2

UNI

2–4 h

SDV

Spurious closure

λ = 2 10−6 h−1

EXP

0.15 h−1

SEP

Structural deficiency

λ = 2 10−6 h−1

EXP

0.08 h−1

PT

Failure to detect

λ = 7 10−7 h−1

UNI

2–4 h

PCV

Failure to regulate

λ = 5 10−6 h−1

EXP

0.2 h−1

LT

Failure to detect

λ = 1 10−6 h−1

LCV

Failure to regulate

UNI

3–5 h

λ=5

10−6

h−1

EXP

0.1 h−1

10−6

h−1

UNI

9–14 h

EXP

0.03 h−1

K

Breakdown

λ=5

P

Failure while running

λ = 5 10−5 h−1

Failure to start

γ = 0.002

The electric motors powering the gas compressors and the pumps are not considered in the analysis nor the logic units. Reliability data The reliability characteristics of the items are given in Table 35.1 (EXP = exponential distribution, UNI = Uniform distribution). All times to failure are assumed to be exponentially distributed. The failure modes have been chosen to have an impact on the production of oil, gas or both. It is assumed that the repair rates of the failure modes “failure while running” and “failure to start upon demand” of the pumps are the same, while they are different values in actual life. Repair strategy A preventive maintenance (PM) task lasting three days is performed on each of the compressors in turn by the end of each year. This PM task can be performed during a repair operation. The gas production drops to 80% only but the oil production is kept at 100% by adjusting the incoming flow from the wells. However, this assumption is valid only during the PM period and not within the period of normal operation. The repair on the separators cannot be initiated before 15 h (safety delay, constant duration) and the repaired unit is back on-line (recommissioning, etc.) 12 h (constant duration) after repair completion. There are: • One maintenance crew for instrumentation and valves; • One maintenance crew for static equipment (separators); • One maintenance crew for rotating machines (compressors and pumps). However, the PM is performed by a dedicated maintenance team (if the compressor is not already under repair).

35.4 Case Study for Comparison of Production Availability Models

719

The spare part needed to repair the breakdown of a compressor is provisioned in two days (constant duration). These two days come in addition to the repair time given in Table 35.1. A new spare part is ordered immediately for a further repair and is available on site 180 days later. A new spare part is ordered only as soon as the spare part present on site is mobilised to repair one compressor failure.

35.4.2 Modelling with Flow Diagrams In Fig. 35.2, the production system has been split within several functions having the same impact with regards to the oil or gas production: • • • • •

Ops: overpressure protection system (Fig. 35.3). Sepa1: main separation (Fig. 35.4). Sepa2: test separation (Fig. 35.5). K1 and K2: gas compressors (Fig. 35.6). P1 and P2: export pumps (Fig. 35.7).

The splitting of the system proposed in Fig. 35.2 allows to identify blocks which can be used to build flow diagrams modelling gas production and oil production as well.

Fig. 35.2 Oil and gas process plant divided into parts useful to build flow diagrams

Fig. 35.3 Ops composite block

720

35 Production Availability Related Modelling and Calculations

Fig. 35.4 Sepa1 composite block

Fig. 35.5 Sepa2 composite block

Fig. 35.6 K1 and K2 blocks

Fig. 35.7 P1 and P2 blocks

The flow diagram related to gas production is illustrated in Fig. 35.8. The architecture of this flow diagram is close to the architecture of the system itself. The idea is to calculate the flow at the output of one block as the minimum between the flow at its input and the production capacity of the subsystems related to this block (see Sect. 33.9).

Fig. 35.8 Flow diagram related to gas production

35.4 Case Study for Comparison of Production Availability Models

721

Fig. 35.9 Flow diagram related to oil production

For example, if the input of Sepa1 is equal to 100% and Sepa1 is in up state, the output of Sepa1 will be equal to 90%. If Sepa1 is in down state, then its output is equal to 0. In the same way, on the right-hand side, the flow at the output of the compressor has been limited to 100%, which is the maximum possible production. The flow diagram related to oil production is illustrated in Fig. 35.9. The principle is exactly the same as for gas production. At the output, on the right-hand side, the oil production has been limited by the gas production available because due to environmental regulations it is forbidden to flare the gas in normal production conditions. It has to be noted that, according to the repair strategy described above, this assumption does not hold during the preventive maintenance of the compressors. In this case, the oil production is limited to 100%.

35.4.3 Modelling with Reliability Block Diagrams 35.4.3.1

Building the RBD Model

Flow diagrams and reliability block diagrams are very similar models. The difference is that in a flow diagram the flows may have several values whereas in a reliability diagram, the logic flow has only two values, 0 and 1. In addition, in the case study, each of the items has only two states. Therefore, the RBD approach should be practicable providing that the configurations leading to the various production levels are considered. Figure 35.10 is derived from Fig. 35.9 and models the configuration for an oil production at a level of 100% without regards to gas compression availability. This is a classical reliability diagram except for the pumps which are operated in standby redundancy (i.e. one running and one in standby position). To perform the calculation, this implies to introduce the composite block in dotted line in the figure. The probability of functioning of this composite block could be calculated, for example, by using a Markov graph (see Chap. 31). Therefore, no theoretical problem arises to calculate the average production achieved at a level a 100%.

722

35 Production Availability Related Modelling and Calculations

Fig. 35.10 Simplified reliability block diagram related to oil production at 100% level

Fig. 35.11 Simplified reliability block diagram related to oil production at 90% level

Let us now look to the production at 90% when Sepa2 is failed. This leads to the configuration modelled by the RBD illustrated in Fig. 35.11 where an inverted block (see Chap. 15) has been introduced to model the down state of Sepa2. Then, again there is no theoretical problem except to manage both a composite block and an inverted block. Figures 35.10 and 35.11 describing disjoint configurations, the productions related to each of the RBD can be added and this leads to: PrdOil = 100% · A¯ 100% + 90% · A¯ 90%

(35.1)

where A¯ 100% is the average availability related to the RBD in Fig. 35.10 and A¯ 90% is the average availability related to the RBD in Fig. 35.11. In the above calculations, the impact of the gas compression failure is not considered. Again, the flow diagrams developed above allow to identify various configurations taking the gas production into account with regards to the oil production. The RBD in Fig. 35.12 models the system configuration leading to a 100% oil production level. For doing that, both compressors K1 and K2 have to be in up state.

35.4 Case Study for Comparison of Production Availability Models

723

Fig. 35.12 Completed reliability block diagram related to oil production at 90% level

Fig. 35.13 Completed reliability block diagram related to oil production at 90% level

In the same way, the RBD in Fig. 35.13 models the 90% oil production level. Again, both compressors K1 and K2 have to be in up state. The last case to analyse occurs when one of the compressors is in down state: in this case (and except during the preventive maintenance period), the oil production drops to 80%. This is modelled in Fig. 35.14 in case K2 is in down state. Of course, similar configurations must be considered when this is K1 which is in down state.

Fig. 35.14 Completed reliability block diagram related to oil production at 80% level

724

35 Production Availability Related Modelling and Calculations

The above modelling technique implies to identify disjoint configurations (i.e. disjoint tie sets) with their associated production levels and then to build as many RBDs as the number of these configurations. Therefore, it is not possible to model the production availability of a production system with a single RBD. In the case of the plant architecture over the two-year period analysed above, at least 4 different configurations have to be considered (as K1 and K2 are similar).

35.4.3.2

RBD with Regards to Repair Strategy

According to the description of the system under study, several assumptions about the repair strategy should be taken into consideration to calculate the production availability: • Number of repair teams As explained in Sect. 15.1, an RBD is a static model implying that repairs of repaired blocks are independent, which means that each of the items has its own repair team. Therefore, the limited number of repair teams cannot be considered in an RBD model. • Limited number of spare parts For the same reason as above, the limited number of spare parts (breakdown of a compressor) cannot be considered in an RBD model. Nevertheless, an approximation would be to consider that the repair time of the broken compressor is the sum of the repair time and of the 5 days logistic time. • Additional safety times For the same reason as above, the additional safety times cannot be considered in an RBD model. However, the same approximation as above could be done.

35.4.3.3

Conclusion About RBD Modelling

Except in very simple cases, calculations of production availability by using RBDs is not possible with a single RBD and without approximations related, in particular, to the repair strategy. Calculations are made difficult due to the presence of composite blocks (e.g. pumps operated in standby redundancy in the example above) and of inverted blocks (needed to make the configurations disjoint from each other). Identifying the relevant disjoint configurations related to the various production levels is likely to be a more and more difficult task when the number of components of the production systems increases. Then, definitely, RBDs should not be used for assessing the production availability of production systems.

35.4 Case Study for Comparison of Production Availability Models

725

35.4.4 Modelling with Markov Graphs 35.4.4.1

Introduction to Markov Modelling

The production system presented in Fig. 35.1 comprises 16 binary components. A quick assessment potentially leads to 216 = 65,536 different states and in addition several phases due to the preventive maintenance of the compressors have to be taken into consideration. Even if some state aggregation can be done (see Sect. 35.4.4.5) because some components are similar, the number of states related to this example is too large to be presented here. Therefore, the use of Markov graphs (Chap. 31) for modelling the production availability of the production system described in Sect. 35.4.1 is explained step by step in the next subsection but the Markov graphs are not actually drafted.

35.4.4.2

Preventive Maintenance Considerations

As described in Sect. 35.4.1, preventive maintenance is achieved every year on the compressors and this implies different phases of functioning of the production system. This can easily be modelled by implementing multiphase Markov graphs (Sect. 31.5.4). Taking the PM into account implies 3 phases per year: no PM, PM on K1, PM on K2. These phases being recurring, this leads to a sequence of six Markov graphs (one for each phase) over two years: one for Year 1/no PM, one for Year 1/PM on K1, one for Year 1/PM on K2, one for Year 2/no PM, one for Year 2/PM on K1, one for Year 2/PM on K2.

35.4.4.3

Markov Graph with Regards to Reliability Data

The transition rates in homogeneous Markov processes being constant (Sect. 31.1.2), the non-exponentially distributed reliability parameters in Table 35.1 must be modified to equivalent constant failure rates. As an example, the times to repair distributed according to a uniform distribution [3–5 h] with an average of 4 h becomes a constant failure rate with the value = 1/4 h = 0.25 h−1 .

35.4.4.4

Markov Graph with Regards to Repair Strategy

Contrary to the RBDs, the repair strategy can be rather easily modelled by using Markov graphs: • Repair priority As the number of repair teams is not unlimited, a repair priority can be assigned (see Sect. 31.1.1), then:

726

35 Production Availability Related Modelling and Calculations

– A failed compressor is repaired before a failed pump. – Any failed item on the separation unit 2 is repaired after the failure of any other failed item. • Other characteristics All other characteristics given in Sect. 35.4.1 under the “Repair strategy” heading can be input in the Markov graphs (provided all parameters become exponentially distributed).

35.4.4.5

Size of the Markov Graph

Often, the size of Markov graphs can be reduced using the technique of the aggregation of states (Sect. 31.6.1). It has to be noted that this technique can only be used with great care for the system under study. Let us consider the macro-component Sepa1. Assuming all reliability parameters (see Sect. 35.4.3.2) are exponentially distributed: • Sep1 is to be considered as a single item as its repair team is the “static equipment” one. • LT1, LCV1, PT1, PCV1 can be grouped into a single macro-component (with two states: up and down) as their repair team is the “instrumentation and valves” one. Then, the 10 states of Sepa1 are reduced to 4 states. This technique has been applied to Sepa1, Sepa2, the set of the two compressors and the set of the two pumps by one of the authors and the size of the Markov graph has dropped to 208 states for each phase. Of course, this is much smaller than 65,536 states but still a quite large Markov graph to be built by hand.

35.4.4.6

Conclusion About Markov Graph Modelling

Once the Markov graph is built, the use of formula 31.90 in Chap. 31 would allow to assess the oil production availability and the gas production availability. Therefore, the main problems in using Markov graph to calculate production availability are not different from the problems encountered for classical availability calculations: • The number of states to consider and the multiple links between them. • The need for considering constant transition rates only. It has to be noted that the computation engines are generally able to handle thousands of states or even millions for MOCA-RP implemented in the GRIF-Markov module (GRIF-Workshop 2020). Then the very problem is more the building of the graph than the size of the graph itself. Therefore, except in the case of small systems,

35.4 Case Study for Comparison of Production Availability Models

727

this is not really manageable without using a computer aided Markov graph building which can be based on PNs (see Chap. 33) or formal language (Brameret et al. 2015).

35.4.5 Modelling with Petri Nets 35.4.5.1

Introduction to Petri Net Modelling

There are several methods for building a Petri net. In this paragraph, two methods are described: • Direct building of the Petri net method: Sect. 35.4.5.2. • Use of the flow diagrams driven Petri nets method: Sect. 35.4.5.3. 35.4.5.2

Method of Direct Building of Petri Nets

This method has been successfully used by one of the authors for dozens of production availability studies. It is a direct implementation of the definitions and principles provided in Chap. 33.

Petri Net Modelling When using the direct building of Petri nets method, the modelling of the production availability of a system (based on Leroy 2018) consists in: • The identification of the items, or group of items, causing the system to be down upon failure. A Boolean variable named PLANT (“true” at t = 0.) is switched to “false“ upon such failures. Variable PLANT is used as assertion or predicate (see Sect. 33.5.2) in the Petri nets to prevent an item to fail if the system is down (and to prevent endless loops to occur). • The identification of the items, or group of items, having the same impact with regards to the production capacity. Following groups of items are easily identified: – The overpressure protection system and the oil export pumping system have the same capacity (100% or 0%). They are then named “common”. The real variable PROD_common (with two values: 0% or 100%) is then associated to the Petri nets modelling the failure-to-repair cycles of these systems. – The main separation unit. The real variable PROD_SEP1 (with two values: 0% and 90%) is then associated to the Petri nets modelling the failure-to-repair cycle of this unit.

728

35 Production Availability Related Modelling and Calculations

– The test separation unit. The real variable PROD_SEP2 (with two values: 0% and 10%) is then associated to the Petri nets modelling the failure-to-repair cycle of this unit. – The gas compressors. The real variable PROD_K1 (with two values: 0% and 90%) is associated to the Petri net modelling the failure-to-repair cycle of K1 and the real variable PROD_K2 (with two values: 0% and 90%) is associated to the Petri nets modelling the failure-to-repair cycle of K2. • The definition of the Boolean variables allowing to model the limited number of repair crews (“true“ if the repair crew is available): – REPinst_Av: instrumentation and valve repair team available (“true” at t = 0.). – REPstat_Av: static equipment repair team available (“true” at t = 0.). – REProt_Av: rotating machine repair team available (“true” at t = 0.). • The definition of the Boolean variables allowing to model the switch of the compressors to their maintenance period (“true” if K1 or K2 is under preventive maintenance): – PM _K1: K1 is under PM (“false” at t = 0.). – PM _K2: K2 is under PM (“false” at t = 0.). • The definition of the Boolean variables allowing to model the availability of the compressor spare parts: – SPARE_K_mob for the mobilisation of the spare part (“false” at t = 0.). – SPARE_K_Av for the availability of the spare part. This variable is set as “false” at t = 0. as the spare part becomes available only after being mobilised. • The Petri net modelling of the failure-to-repair cycle of each item of the system: – Places represent the status of the item: on standby, running, failed (and waiting for repair), under repair, under preventive maintenance. – Transitions represent events modifying the status of the item: call on duty, failure, start of repair (repair team available), repair completed by repair team. Such Petri nets are given below. The sub-PN on the left-hand side of Fig. 35.15 is a typical Petri net modelling the behaviour of a single item: • At t = 0 the PSHH is “on” (one token in place “PSHH on”). • If the plant is not down (Boolean variable PLANT is “true“) the PSHH can fail (with a failure rate λpshh ) and the plant is down (Boolean variable PLANT becomes “false”). Then the production capacity of the “common” drops to 0%: the real variable PROD_common is switched to 0. The token is “moved”3 to place “PSHH failed”. 3 The

token in “PSHH on” seems to move to “PSHH failed” when the transition is fired but, in fact, one token is removed from “PSHH on” and one token is added into “PSHH failed”. Nevertheless, as this simplifies the writing and helps the understanding of the PN behaviour, the term “is moved”

35.4 Case Study for Comparison of Production Availability Models

729

Fig. 35.15 Petri net for the overpressure protection system

• If the repair team for the instrumentation and valves is available (REPinst_Av is “true“), then the token is moved to place “PSHH under rep”. As the repair team is mobilised, variable REPinst_Av becomes “false”. • The PSHH can be repaired according to the uniform law [2 h−4 h]. When the repair is completed: – The repair team for the instrumentation and valves becomes available (variable REPinst_Av becomes “true“). – The plant is now running: PLANT becomes “true“. – The production capacity of the “common” is back to normal i.e. of 100%: variable PROD_common is switched to 100. – The token is moved to place “PSHH on.” The sub-PN on the right-hand side of Fig. 35.15 is built on the same principle. The sub-PNs on the left-hand side and in the middle of Fig. 35.16 are built on the same principle as the sub-PNs of Fig. 35.15. The sub-PN on the right-hand side is more complex as the repair of the main separator cannot be initiated before 15 h and upon repair is back on-line (recommissioning, etc.) only 12 h after repair completion: • At t = 0 the main separator SEP1 is “on” (one token in place “SEP1 on)”. • If the plant is not down (variable PLANT is “true”), SEP1 can fail (with a failure rate λsep ) and the plant is down (variable PLANT becomes “false”). Then the production capacity of the main separation unit drops to 0%: variable PROD_SEP1 is switched to 0. The token is moved to place “SEP1 failed”. • After 15 h, the main separator can be repaired and the token is moved to place “SEP1 ready rep”.

is going to be used as shortcut with this meaning (remove and add tokens) in the following part of the text.

730

35 Production Availability Related Modelling and Calculations

Fig. 35.16 Petri net for the main separation unit

• If the repair team for the static equipment is available (variable REPstat_Av is “true”), then the token is moved to place “SEP 1 under rep”. As the repair team is mobilised, variable REPstat_Av becomes “false”. • The main separator can be repaired according to an exponential law of parameter µsep . When the repair is completed: – The repair team for the static equipment becomes available (variable REPstat_Av becomes “true”). – The token is moved to place “SEP1 ready on”. • After 12 h, the main separator is back on line: – The plant is now running: variable PLANT becomes “true”. – The production capacity of the main separation unit is back to normal i.e. of 90%: variable PROD_SEP1 is switched to 90. – The token is moved to place “SEP1 on”. The sub-PNs of Fig. 35.17 are built on the same principle as the sub-PNs of Fig. 35.15 with a slight difference: the plant is not down if the test separator is failed. Then the items of the test separation unit SEP2 can fail only if both plant and test separation unit are on: Boolean variables PLANT and SEP2_Av are to be “true” for any item of SEP2 to fail. Upon failure of any of the items: • The test separation unit is down (variable SEP2_Av becomes “false“) but the plant keeps running. • The production capacity of the test separation unit drops to 0%: variable PROD_SEP2 is switched to 0. Upon repair of the failed item: • The test separation unit is “on” (variable SEP2_Av becomes “true”).

35.4 Case Study for Comparison of Production Availability Models

731

Fig. 35.17 Petri net for the test separation unit

Fig. 35.18 Petri net for the oil export pumping system

• The production capacity of the test separation unit is back to normal i.e. of 10%: variable PROD_SEP2 is switched to 10. Although the oil pumping system could have been modelled in a similar way to the previous Petri nets, a more compact model was selected (see Fig. 35.18):

732

35 Production Availability Related Modelling and Calculations

• At t = 0 the two pumps are on (one running, one in standby): one token in place “2Pi on”. • If the plant is not down (variable PLANT is “true”), the running pump can fail (with a failure rate λp ) and the token is moved to place “Pi failed”. • Immediately the pump on standby tries to start. This is modelled with a probabilistic switch (see Sect. 33.6.3) with two outputs: – The start-up may fail with the probability to fail on demand γ p and the token is moved to place “P1, P2 trans” (“trans” for transitory as the residence time in this place is 0.). The Dirac transition is immediately fired, the plant is now down (PLANT becomes “false“), variable PROD_common is switched to 0 and the token is moved to place “P1, P2 failed”. – The start-up may succeed with the complementary probability to success on demand (1-γ p ) and the token is moved to place “Pi failed, Pj on”. • There are two possibilities from place “Pi failed, Pj on”: – The running pump may fail (failure rate λp ) while waiting for the rotating machine repair team: the token is moved to place “P1, P2 failed”, the plant is now down (PLANT becomes “false“) and variable PROD_common is switched to 0. – The rotating machine repair team is available (REProt_Av is “true”) and the token is moved to place “Pi rep, Pj on” (variable REProt_Av does not become “false“ as there is still one pump to repair). • From place “P1, P2 failed” if the rotating machine repair team is available (REProt_Av is “true”), the token is moved to place “Pi failed, Pj ready rep” and variable REProt_Av becomes “false”. • From place “Pi failed, Pj ready rep” the pump ready for repair can be repaired according to an exponential law of parameter μp . When the repair is completed, the token is moved to place “Pi failed, Pj on”, variable REProt_Av becomes “true”, the production capacity of the “common” is back to 100% (variable PROD_common is switched to 100). • From place “Pi rep, Pj on” there are two possibilities: – The running pump may fail (with parameter λp ) if the plant is not down (variable PLANT is “true”). Then the plant is down (variable PLANT becomes “false“) and the production of the “common” is back to 0. (variable PROD_common is equal to 0.) – The failed pump may be repaired (with parameter μp ) as long as the rotating machine repair team is available (REProt_Av is “true”). The Petri net shows that the repair is stopped if the repair team is called to repair another item causing a production loss (e.g. if any compressor is failed). The transition is with memory in order not to forget the time spent to perform the repair before stopping it. Upon repair the token is moved to place “2Pi on”.

35.4 Case Study for Comparison of Production Availability Models

733

Fig. 35.19 Petri net for the gas compression unit

The sub-PNs modelling the failure-to-repair cycle of the two compressors are similar, the one of K1 is then considered (sub-PN on the left-hand side of Fig. 35.19): • At t = 0 K1 is “on”: one token in place “K1 on”. • If the plant is not down (variable PLANT is “true”), K1 can fail (with a failure rate λK ) and the token is moved to place “K1 failed”. The production of K1 drops to 0. (real variable PROD_K1 is equal to 0.), the spare part is mobilised (Boolean variable SPARE_K_Mob becomes “true”) and K1 is no longer available (Boolean variable K1_Av becomes “false”). The use of variable K1_Av is explained below. • If the spare part is available (variable SPARE_K_Av is “true”), the token is moved to place “K1 ready rep” and variable SPARE_K_Av becomes “false”. • After 2 days, K1 is ready to be repaired (as the spare part is on site) and the token is moved to the place” K1 wait rep”. • If the rotating machine repair team is available (variable REProt_Av is “true”), the token is moved to place “K1 rep” and variable REProt_Av becomes “false”. • The repair is performed according to a uniform law of parameters [9h − 14h] and the token is moved to place “K1_on” while messages REProt_Av and K1_Av become “true”, variable PROD_K1 being set equal to 80. • If message PM _K1 becomes “true” (see Fig. 35.20) and if there is a token in place “K1 on”, the transition is fired and the token is moved to place “K1 PM”, variable K1_Av becoming “false”. When variable PM _K1 becomes “false”, the token is moved to place “K1 on” and variable K1_Av becomes “true”. It should be noted that this Petri net takes into account the assumption that the PM can be performed (totally or partially) during a repair period.

734

35 Production Availability Related Modelling and Calculations

Fig. 35.20 Petri net for the PM period of the compressors

no PM

δ = 359 days

PM K1

δ = 3 days δ = 3 days PM K2

The sub-PN on the right-hand side of Fig. 35.19 models the effect of both compressor failure on plant availability: • At t = 0 at least one compressor is “on”: one token in place “At least 1 K on”. • If both compressors are failed (variables K1_Av and K2_Av are “false“), the token is moved to place “2 K down” and variable PLANT becomes “false“. • If any of the compressors becomes available (variable K1_Av or variable K2_Av is “true”), the token is moved to place “At least 1 K on” and variable PLANT becomes “true”. The Petri net in Fig. 35.20 models the PM schedule of the compressors: • At t = 0 the token is in place “no PM”. • After 359 days, variable PM _K1 becomes “true” and the token is moved to place “PM K1”. • After 3 days (of PM on K1), variable PM _K1 becomes “false“, variable PM _K2 becomes “true” and the token is moved to place “PM K2”. • After 3 days (of PM on K2), variable PM _K2 becomes “false“ and the token is moved to place “no PM”. • The cycle is repeated for Year 2. The Petri net in Fig. 35.21 models the mobilisation of the spare parts for the compressors: • At t = 0 the token is in place “SpareK on”. • If one token is present in “Spare K on” when one compressor failure occurs (variable SPARE_K_Mob is “true”), then the spare part is immediately mobilised to repair this failure: the token is moved to place “SpareK order” and variable SPARE_K_Av becomes “true”. In addition, it can no longer be mobilised to repair another failure because variable SPARE_K_Mob becomes “false”.

35.4 Case Study for Comparison of Production Availability Models

735

Fig. 35.21 Petri net for the gas compressor spare parts

• A new spare part is ordered and after 150 days it is available and the token is moved to place “SpareK on”. Calculating System Production Availability The production availability of the system can then be calculated using sojourn time in each place of the Petri nets, the values of each variable, etc. obtained by simulating the Petri nets provided above. The real variables defined for the purpose of calculating the production availability are calculated using Boolean variables and real variables defined on the Petri nets. They are as follows: • Gas production availability: PROD_gas = min(PROD_common, PROD_SEP, PROD_K_gas) The gas production cannot be greater than the one of the three groups of components already identified. With: PROD_K_gas = min(100., PROD_K1_gas + PROD_K2_gas) The gas production cannot be higher than 100% and than the sum of the production of the two gas compressors. Considering that “true” = 1 and “false” = 0 and knowing that: PROD_K1_gas = PROD_K1 × K1_av PROD_K2_gas = PROD_K2 × K2_av The gas production of a gas compressor is of 0% if this compressor is failed.

736

35 Production Availability Related Modelling and Calculations

• Oil production availability: PROD_oil = min(PROD_common, PROD_SEP, PROD_K_oil) The oil production cannot be greater than the one of the three groups of components already identified. With: PROD_K_oil = min(100., PROD_K1_oil + PROD_K2_oil) The oil production cannot be higher than 100% and than the sum of the contribution to the oil production of the two gas compressors. Knowing that: PRODK1oil = [PRODK1 × (1 − PMK1 ) + 100. × PMK1 ] × K1_av PRODK2oil = [PRODK2 × (1 − PMK2 ) + 100. × PMK2 ] × K2_av The contribution to the oil production of gas compressor: – Is of 0% if the gas compressor is failed; – Is PROD_K1 out of its PM period and of 100% within its PM period. Conclusion About Direct PN Modelling Method If the contents of Chap. 33 is well understood, the direct building of the Petri nets modelling the system production availability is quite straightforward. The main difficulty lies in the writing of the variables (equations) allowing to assess this production availability.

35.4.5.3

Use of FD-Driven Petri Nets

Principle The description of the flow diagrams given in Sect. 35.4.2 is a good opportunity to develop a FD-driven Petri net model allowing to calculate both the oil production availability and the gas production availability. The overall basic module proposed in Fig. 35.22 is related to a single item (e.g. a component) Ci and it allows to implement this approach. This module is devoted to become the container of a sub-PN modelling the behaviour of component Ci according to its failures and repairs and also the information (predicates) received from outside. In turn it sends information (assertions) to outside and this information is used to calculate the production capacity at its output. More precisely:

35.4 Case Study for Comparison of Production Availability Models

737

Fig. 35.22 Basic module for FD-driven PN building

• Ci is a real variable representing the actual capacity of component Ci at a given instant. • Repi is a Boolean variable representing the state of the repair team (available or not) allocated to Ci . • Plant is a Boolean variable representing the state (up or down) of the whole plant. • PMi is a Boolean variable indicating if preventive maintenance has to be undertaken on Ci . • Sparei is a real variable indicating how many spare parts are available to repair Ci . The above variables are used to simulate the sub-PN modelling the behaviour of component Ci. In addition, the following variables are used to calculate the oil and gas production at a given instant: • Ci,In is a real variable representing the flow at the input of Ci . • Ci,Out is a real variable representing the flow at the output of Ci . As shown in Fig. 35.22, the production at the output is calculated as the minimum between the flow at the input and the internal capacity of the module. Nevertheless, this very simple basic model can be adapted when needed. It has to be noted that, if needed, the oil and gas capacities and flows could be modelled separately (e.g. Coi , Cgi , Coi,In , Cgi,In , Coi,Out , Cgi,Out ).

Basic Petri Net Modules for PSHH, SDV, LT1, LCV1, PT1 and PCV1 The sub-PN in Fig. 35.23 provides the general basis to develop the relevant sub-PN modelling the behaviour of PSHH, SDV, LT1, LCV1, PT1, PCV1.

738

35 Production Availability Related Modelling and Calculations

Fig. 35.23 Sub-PN of type 1

Table 35.2 Link between the components parameters and the sub-PN of type 1 (Fig. 35.23) Fail

EoR

Ci

Repi

Plant

SDV

λSDV

μSDV

SDV_Av

REPinst

Plant

PSHH

λPSHH

[2–4 h]

PSHH_Av

REPinst

Plant

LT1

λLT

[3–5 h]

LT1_Av

REPinst

Plant

LCV1

λLCV

μLDV

LCV1_Av

REPinst

Plant

PT1

λPT

[2–4 h]

PT1_Av

REPinst

Plant

PCV1

λPCV

μPCV

PCV1_Av

REPinst

Plant

An item modelled with the sub-PN can fail only when the plant is running (?Plant) and the plant stops running (! − Plant) when it fails. Therefore, the failures of all other items sharing this model are inhibited. When the item is repaired, then the plant runs again (?Plant) and the failure of the other items sharing this model are validated again. Therefore, the first item which fails inhibit the failure of all the others which become unable to modify the value of the Boolean variable Plant. In addition, under the assumption that the items cannot fail when the plant is stopped, the failure events of the related items are suspended and this is why the transition Fail is a transition with memory (mem).

35.4 Case Study for Comparison of Production Availability Models

739

Fig. 35.24 Application of the sub-PN of type 1 to PSHH

Table 35.2 makes the link between the parameters of the generic model drafted in Fig. 35.23 and the parameters of each item (i.e. components of the production system) which can be modelled by using this model. The sub-PN in Fig. 35.24 is the application of the generic model in Fig. 35.23 to the PSHH component: • • • •

Three places: UP, Mob and Rep. Three transitions: Fail, SoR (start of repair) and EoR (end of repair). It can fail only when the plant is running (?Plant). When it fails (exponential distribution), variable PSHH _Av drops to 0, the plant stops running (! − Plant) and the component waits for the availability of the repair team REPinst. • As soon as this repair team is available (?REPinst), the transition SoR is fired and the repair team becomes unavailable (! − REPinst) for other failed components. • Then the repair begins (uniform distribution) and when it is completed, variable PSHH _Av rises to 100, the repair team becomes available again (?REPinst), the plant is running again (!Plant) and the component comes back in up state. Basic Petri Net Modules for LT2, LCV2, PT2 and PCV2 The sub-PN in Fig. 35.26 provides the general basis to develop the relevant sub-PN modelling the behaviour of LT2, LCV2, PT2 and PCV1.

740

35 Production Availability Related Modelling and Calculations

Fig. 35.25 Sub-PN of type 2

Table 35.3 Link between the components parameters and the sub-PN of type 2 (Fig. 35.26) Fail

EoR

Ci

Repi

Plant

LT2

λLT

[3 h-5 h]

SEP2_Av

REPinst

Plant

LCV2

λLCV

μLDV

SEP2_Av

REPinst

Plant

PT1

λPT

[2 h-4 h]

SEP2_Av

REPinst

Plant

PCV1

λPCV

μPCV

SEP2_Av

REPinst

Plant

It is very similar to the sub-PN of type 1. Again, an item modelled with this subPN can fail only when the plant is running (?Plant) but, when the failure occurs, it does not stop the plant (Fig. 34.25). Table 35.3 makes the link between the parameters of the generic model drafted in Fig. 35.23 and the parameters of each item which can be modelled by using this model.

Basic Petri Net Modules for SEP1 and SEP2 The sub-PN in Fig. 35.26 provides the general basis to develop the relevant sub-PN modelling the behaviour of SEP1 and SEP2. It works in the same way as the sub-PN in Fig. 35.23 except that determinist delays have been added before mobilising the maintenance team (transition SoM, start of

35.4 Case Study for Comparison of Production Availability Models

741

Fig. 35.26 Sub-PN of type 3

Table 35.4 Link between the components parameters and the sub-PN in Fig. 35.23 Fail

SoM

EoAR

EoR

Coi

Cgi

Repi

Plant

SEP1

λSEP

δ1SEP

μSEP

δ2SEP

SEP1_Av

SEP1_Av

REPstat

Plant

SEP2

λSEP

δ1SEP

μSEP

δ2SEP

SEP2_Av

SEP2_Av

REPstat

Plant

mobilisation) and after the active repair (transition EoAR, end of active repair) has been completed. It can be used to model both SEP1 and SEP2 but, when SEP2 fails, the plant is not stopped and the assertions (! − Plant) and (!Plant) have to be removed to model SEP2. Table 35.4 makes the link between the parameters of the generic model drafted in Fig. 35.26 and the parameters of each item which can be modelled by using this model. The sub-PN in Fig. 35.27 is the application of the generic model in Fig. 35.26 to the SDV1 component. The sub-PN related to SEP2 is similar and is obtained by:

742

35 Production Availability Related Modelling and Calculations

Fig. 35.27 Application of the sub-PN of type 2 to SEP1

• Removing (! − Plant) from the transition Fail and (!Plant) from the transition EoR; • Replacing !!SEP1_Av = 0 by SEP2_Av = 0; • Replacing !!SEP1_Av = 90 by SEP2_Av = 10. Basic Petri Net Modules for P1 and P2 The sub-PN in Fig. 35.28 is devoted to model the behaviour of the pumps operated in standby configuration. This sub-PN is similar to the one in Fig. 35.26 for the transitions Fail, SoR and EoR. The remaining part of the sub-PN works as follows: • Transition SoSB (start of standby) on the top right-hand side models the standby operations: when the other pump fails (!!Pj _Av = 0), pump Pi is started with a probability 1 − γP and fails to start with a probability γP . • Transition Aux1 allows to drop the Pi capacity to 0 when it fails to start (remark the double arrow). • Transition Aux2 allows to make Pi running if Pj is failed. • Transition Aux3 allows to put Pi in standby position if Pj is running. In addition to the main part of the sub-PN, two small sub-PNs have been added on the right-hand side of the figure in order to manage the value of the Boolean variable Plant. It goes to 1 if at least one of the pumps is running and goes to 0 when both pumps are failed.

35.4 Case Study for Comparison of Production Availability Models

743

Fig. 35.28 Sub-PN of type 4 to model P1 and P2

The same sub-PN can be used to model the pump in standby position at t = 0, just by placing the token in place SB instead of place Up.

Basic Petri Net Modules for K1 and K2 The sub-PN in Fig. 35.29 is devoted to model the behaviour of the compressors which are subject to preventive maintenance and which need spare parts to be repaired. This sub-PN is similar to the one in Fig. 35.26 for the transitions Fail, SoR and EoR. The remaining part of the sub-PN works as follows: • Transition WoSP (wait of spare part) models the delay needed to obtain a spare part. This works in relationship with the sub-PN on the right-hand side of Fig. 35.30. • The sub-PN in the middle of the figure models the effect of the PM: the capacity drops to 0 when it is performed. • Transition Aux allows to re-establish the value of the compressor capacity after a PM has been completed. This works in relationship with the sub-PN on the left-hand side of Fig. 35.30. Again, in addition to the main part of the sub-PN, two small sub-PNs have been added on the right-hand side of the figure in order to manage the value of the Boolean

744

35 Production Availability Related Modelling and Calculations

Fig. 35.29 Sub-PN of type 5 to model K1 and K2

Fig. 35.30 Auxiliary sub-PN to model preventive maintenance and spare parts provisioning

variable Plant. It goes to 1 if at least one of the compressors is running and goes to 0 when both compressors are failed. The preventive maintenance of K1 and K2 is modelled on the left-hand side of Fig. 35.30. Compressor K1 is maintained first and then compressor K2. This sub-PN

35.4 Case Study for Comparison of Production Availability Models

745

Fig. 35.31 Flow diagram related to oil production (copy of Fig. 35.9)

Fig. 35.32 Oil production capacity at output of the composite block Ops

communicates with the sub-PN in Fig. 35.29 through assertions !!PM i and !!PM i . Over a two-year period, this sub-PN is simulated twice. The spare part provisioning of K1 and K2 is modelled on the right-hand side of Fig. 35.30. As soon as the spare part is used to repair one of the compressors, a new one is ordered in prevision of repairing a further failure.

Calculating the System Production Availability The main interest of the sub-PNs described above is that they can be directly used to model the blocks of the flow diagrams analysed in 35.4.2 in order to constitute an FD-driven PN. The FD related to oil production has been repeated in Fig. 35.31 to facilitate the demonstration. • The calculation of the production capacity at the output O1 of the composite Ops is illustrated in Fig. 35.32. It is equal to PSHHout which is calculated as shown in Fig. 35.24. • In the same way, the production capacity at the output O2 of the composite block Sepa1 is illustrated in Fig. 35.33. It is equal to PCV 1out which is calculated step by step from O1 through SEP1, LT1, LCV1, PT1 and PCV1. • A similar process from O1 through SEP2, LT2, LCV2, PT2 and PCV2 leads to the value of the production capacity at the output O3 of the composite block Sepa2. • This allows to calculate the production capacity O4 = O1 + O3.

746

35 Production Availability Related Modelling and Calculations

Fig. 35.33 Oil production capacity at output of the composite block Sepa1

• Then, the same calculation process can be used to obtain the oil production capacity at: • The output O5 = min(O4, P1_Av); • The output O6 = min(O4, P2_Av). • Then, the oil production capacity at the output of the flow diagram can be calculated as Prd _Oil =_ min(100, O5 + O6). • The oil production capacity being limited by the gas production capacity, it is finally calculated as Prd _Oil =_ min(Prd _Gas, O5 + O6). Applying the above formula implies to calculate Prd _Gas and this can be done exactly in the same way by using the flow diagram related to gas production in Fig. 35.8.

Conclusion About FD-Driven PN Modelling Using flow diagrams as guiding principle proves to be a very effective approach to manage rather large and complicated systems. This leads to FD-driven Markov PNs which is a good way to model dynamic flow diagrams. This opens the way for user-friendly interfaces as for example this is implemented in the module PETRO of GRIF-Workshop (2020) devoted to production availability calculations.

35.4.5.4

Conclusion About PN Modelling

The PN model has been the only model able to take into account all the constraints of the modelled system (repair strategy, spare part provisioning, preventive maintenance, etc.) with failure and repair distributions not necessarily exponential. Therefore, and as already written in Sect. 33.12, the simplicity, flexibility and modelling power of the Petri nets used in conjunction with Monte Carlo simulation allow to model the production availability of any production system (Folleau et al. 2016). This approach also allows to identify the topmost contributors to the production losses (Dutuit and Innal 2011) but this is beyond the scope of this book.

References

747

References Brameret P-A, Rauzy A, Roussel J-M (2015) Automated generation of partial Markov chain from high level descriptions. Reliab Eng Syst Safety (RESS).139:179–187. 10.1016/j.ress.2015.02.009. Elsevier Dutuit Y, Innal F (2011) A component importance measure suitable for flow transmission multi-state systems. Int J Performability Eng Folleau C, Collas S, Vinuesa C (2016) New simulation model for evaluating the production availability of petroleum systems. ESREL (Eur Saf Reliab Conf) GRIF-workshop (2020) Markov and PETRO modules. Funded and developed by TOTAL, http:// grif-workshop.fr/. Accessed Sept 2020 ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland ISO 20815 Ed. 2.0 (2018) Petroleum, petrochemical and natural gas industries. Production assurance and reliability management. International organization for standardization (ISO), Geneva, Switzerland Leroy A (2018) Production availability and reliability. Use in the Oil and Gas Industry, 1st ed., Wiley-ISTE. London, UK

Chapter 36

Functional Safety Related Modelling and Calculations

36.1 Introduction and Standardization The safety system analysis has already been introduced in Chap. 6 which provides the definitions of functional safety, safety instrumented system and safety integrity related to this topic. As illustrated in Fig. 36.1, the trend, nowadays, is to replace the conventional safety system (the relief valve on the left-hand side) by safety instrumented systems (the High Integrity Pressure Protection System (HIPPS) which is a particular type of SIS designed to be very reliable—or rather, very available—on the right-hand side). Moving from conventional safety systems to safety instrumented systems changes drastically the way to design them: • relying on physical laws—not subject to failure—, and accumulated field feedback and experience—gathered from the nineteenth century for most of them and possibly synthetized into standards—for the first ones; • relying on measures (sensors), calculations (logic solver), actuation (final elements)—subject to failures—, and reliability analyses—subject to errors and incompleteness—for the second ones. Therefore, attempting to make the proof, a priori, that a safety system is going to properly perform as required is far more difficult for an SIS than for a conventional system. This led to a new engineering concept named functional safety (see Chap. 6) and to the development of a whole family of sectoral standards derived from IEC 61508 (the mother standard): IEC 61511 (process domain), IEC 61513 (nuclear domain), IEC 62061 (machinery domain), ISO 26262 (automobile domain), for example. These standards represent hundreds of pages and only a quick overview can be given hereafter. They provide, for example, requirements about: • the specific definitions to be used; © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_36

749

750

36 Functional Safety Related Modelling and Calculations

Flare

Relief valve

Pressure PT 1 PT 2 PT 3 transmitters Functional safety

Tank

Logic 2/3 LS solver Flow out

Flow in

Tank SDV1 SDV2

Conventio na l s a fety s ys tem

Safety instrumente d s ys tem (SIS)

Fig. 36.1 From conventional to safety instrumented systems

• the determination of the risk reduction and of the corresponding safety integrity level (SIL, see Sect. 36.2) expected from the installation of a given SIS; • the design and maintenance of the SIS according to its required SIL; • the deterministic constraints to comply with the required SIL (e.g. safe failure fraction, fault tolerance, software languages, factory acceptance tests, traceability of modifications, documentation) regardless of the reliability/availability analyses; • the values of specific probabilistic parameters to comply with the required SIL (PFD and PFH, see Sect. 36.2). It has to be noted that these standards are focused on the design of safety instrumented systems only. That is to say that ensuring a good balance between safety and production is outside their scope. Clearly speaking, they do not really care about the production losses due to spurious safety actions. Therefore, an operator wanting to maximize his production in safe conditions has to be cautious and should not blindly use them. It has also to be noted that, in spite of important requirements linked to probabilistic calculations, no normative reliability data collection requirements (see Chap. 38) were present in the original issues of the IEC 61508 standards. Even if this is now mentioned in the new issue [route called 2H in edition 2.0, (IEC 61508 Ed.2.0 2010)], this explains why the decision makers applying these standards do not consider the collection of specific reliability data as a really important topic and find useless to spend money about that. Then, in spite of the existence of few generic data bases (e.g. OREDA 2015; EXIDA 2015; PDS 2013; Ostebo and Dammen 2006), more than 20 years after the first issue of IEC 61508, this domain permanently suffers of an endemic lack of independent input reliability data. Often, in the best case only vendor data are available but, coming from judge and jury sources, they may be optimistic and have to be used cautiously. In the worst case, no data are available at all and only qualitative analyses can be performed through deterministic constraints required in the standards. As a side effect, this secretes the insidious idea that it is not possible to obtain accurate input data and that probabilistic calculations are not really important. This

36.1 Introduction and Standardization

751

in turn leads to the feeling that, as data are not accurate, any simplistic approach is good enough to perform the calculations. This feeling has been aggravated by the fact that only a catalogue of simplified ready-made formulae was provided in the original IEC 61508 issue. To counterbalance this situation—i.e. to avoid to add calculation uncertainties to data uncertainties and to perform conservative calculations (what is of utmost importance when dealing with safety)—a set of alternative techniques (including Reliability block diagrams, Fault trees, Markov graphs and Petri nets) has been added in IEC 61508-6 Ed. 2.0 annex B issued in 2010. At the same time the ISO/TR 12489 (2013) standard has been developed to explain in detail how to perform sound probabilistic calculations in line with the IEC 61508 Ed. 2.0 (2010) standard and elements for handling multiple safety systems have been provided in the IEC 61511-3 Ed. 2.0 (2016) standard. The content of this book is in direct relationship with the alternative techniques of IEC 61508 and the content of ISO/TR 12489 (2013) or IEC 61511-3 (2016). The aim of this chapter is mainly to proceed to a critical and constructive analysis of the functional safety approach by identifying its weaknesses and providing solutions to eliminate them. This is done with regards to the safety integrity concepts and probabilistic calculations. In particular, the simplified probabilistic calculations proposed in IEC 61508 (2010) and not described elsewhere in the book are analysed in detail in this chapter. The aim is to help the users wanting to implement them in spite of strong encouragements to implement the conventional approaches (Boolean or dynamic) described in Parts 3 and 4 instead (Signoret et al. 2014, 2013; Brissaud et al. 2019; Zang and Rauzy 2017) are interesting documents about modelling and calculation of safety systems.

36.2 Safety Integrity Concepts 36.2.1 Establishing the Safety Integrity Levels (SIL) Requirements 36.2.1.1

Safety Integrity Level Versus the Necessary Risk Reduction

The concept at the core of the standard mentioned above is the safety integrity level (SIL), which is defined as follows in (IEC 61508-4 2010): Safety integrity level (IEC 61508): discrete level (one out of a possible four), corresponding to a range of safety integrity values, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest. This definition is not very much informative but can be clarified by considering how the four discrete levels are established from the risk reduction expected from the safety instrumented implementation. The risk reduction can be established by considering risk matrices like this illustrated in Fig. 36.2.

752

36 Functional Safety Related Modelling and Calculations

Likely

Not acceptable

Unlikely

10-2/Year 10-3/Year

Tolerable

Very unlikely

10-4/Year Extremely unlikely

Acceptable 10-5/Year

Remote Moderate

Serious

Major

Catastrophic Disastrous

Fig. 36.2 Example of risk matrix

In a risk matrix, the failure frequency (or the probability) is used in ordinate and the consequence levels in abscissa. The scales are generally given in a subjective way: likely, unlikely, …, remote for the frequency and moderate, serious, …, disastrous for the consequences. The appellations change from a domain to another but the principle is the same. The aim is to define three risk zones on the matrix: • acceptable zone where there is nothing to do to reduce the risk; • not acceptable zone where it is required to do something to reduce the risk; • tolerable zone where it is necessary to analyse if further improvements are needed or not based on criteria like ALARP [as low as reasonably practicable, see IEC 61508-5 (2010); HSE (2020); NOPSEMA (2015); Wikipedia ALARP (2020)]. From an industrial point of view, three kinds of risks are generally considered: safety, environment and economic (assets, production losses) related risks. This leads to three different risk matrices and the decisions are taken by considering the matrix showing the highest risk. When a risk reduction is needed, the next step is to estimate the level of this risk reduction. This can be done on the basis of engineering judgment but it is better to associate figures with failure frequency or probabilities as this is done on the right-hand side of Fig. 36.2. The principle of risk reduction is illustrated in Fig. 36.3: • the starting point is the process risk, R1 , without any protection layer; • the second step is the risk, R2 , achieved when the command control and the conventional safety systems have been implemented; • from this point, if R2 is still in the not acceptable zone, the implementation of a safety instrumented system can be considered in order to decrease the risk to R3 which is in the acceptable zone (or at least the tolerable zone). Therefore, the necessary risk reduction provided by the SIS should be equal to R2 /R3 and this ratio is called risk reduction factor (RRF). It is used to determine four sets of requirements which are called safety integrity levels (SIL i):

36.2 Safety Integrity Concepts

753

R3

Consequence

Process risk R2

Consequence

Frequency

Achieved risk Frequency

Frequency

Tolerable risk

R1

Consequence Risk

SIS Wanted risk reduction: R2/R3

2nd layer

1st layer

Conventional risk reduction: R1/R2

Fig. 36.3 Risk reduction by conventional and instrumented safety systems

• SIL 1 is required if RRF ∈ ]10, 102 ]; • SIL 2 is required if RRF ∈ ]102 , 103 ]; • SIL 3 is required if RRF ∈ ]103 , 104 ]; • SIL 4 is required if RRF ∈ ]104 , 105 ]. Even if SIL 0 is sometimes used, it is not defined in IEC 61508. And, in addition, risk reductions greater than 105 are not considered to be achievable by implementing a single SIS. The SIL is not a property (i.e. not an attribute like a probability of failure or a failure rate) of a considered SIS but a set of requirements to which the SIS complies. The requirements are more and more stringent when the SIL increases. Then, it is expected that the chances for an item to perform as required and when required (i.e. a dependability attribute) are better when SIL 4 requirements are fulfilled than when SIL 3 requirements are fulfilled, better when SIL 3 requirements are fulfilled than when SIL 2 requirements are fulfilled and better when SIL 2 requirements are fulfilled than when SIL 1 requirements are fulfilled. However, when an SIS is claimed to be SIL 3, that means, and only means, that it has been developed according to the SIL 3 requirements of the standards. If a SIL certificate is produced, this only certifies that the compliance with the SIL 3 requirements has been verified. Therefore, the risk reduction actually provided by a SIS does not depend only on its SIL but also strongly on the way it is implemented and operated within the protected installation. It has to be noted that SIL certificates are often delivered (and widely used) for parts of SIS (e.g. for sensors or logic solvers) but, with regards to the standard— which defines an SIL for an SIS as a whole—, this is a kind of abuse as a part of a SIS does not reduce any risk alone (especially when it is on the shelf!). Again, that just means that the part has been developed in compliance with a given set of requirements of the standards and that is all. Therefore, considering that designing a SIS consists simply in bringing SIL certified parts together like a kid playing with Lego® bricks is illusory! Even if assembling SIL 3 certified parts together is likely to lead to a lower probability of SIS failure than

754

36 Functional Safety Related Modelling and Calculations

assembling SIL 2 certified parts, there is absolutely no guarantee that the resulting SIS complies with the SIL 3 requirements. In fact, hundreds of years ago Confucius and then Aristotle, (Wikipedia Aristotle 2020), would already have said that “the whole is more (or less) than the sum of its parts”. More recently R. Bellman, (Bellman 1957; Wikipedia Bellman 2020), says more or less the same about optimality (see Fig. 2.1 in Chap. 2) and K. Gödel about the incompleteness of formal systems (Gödel 1992; Wikipedia Gödel 2020). Unexpected behaviours can emerge due to systemic dependencies between the parts and thorough holistic analyses and calculations have to be performed to verify that the safety objectives are actually reached.

36.2.1.2

Basic Risk Reduction Principle

The principle of risk reduction provided by an individual single SIS is illustrated in Fig. 36.4: • in absence of the SIS, the frequency F of demand occurring from the process is also the frequency of occurrence of an unsafe (hazardous) situation, FU = F; • in presence of the SIS providing a risk reduction equal to r, this frequency is reduced to FU = F/r; • therefore, the frequency to reach a safe situation is equal to FS = F − FU and then FS = F.(1 − 1/r). For example, if the risk reduction is equal to 500 (i.e. SIL 2), FU = 0.002 × F and FS = 0.998 × F. Therefore, the risk reduction factor acts in the formulae through its inverse value 1/r which, like a probability, is comprised between 0 and 1. This is why it is easy to assimilate the probability of failure of the SIS to the inverse of the risk reduction brought by this SIS, as analysed in Sect. 36.2.2. It has to be noted that this basic analysis is only valid for a single protection layer protecting against a single threat. When multiple safety systems are implemented to contain the same threat, the link between the risk reduction provided by an individual safety system and its failure probability is not straightforward: it strongly Fig. 36.4 Principle of risk reduction

Protected system

SIS

Situation

SIS success Safe

Demand

Unsafe

F SIS failure

36.2 Safety Integrity Concepts

755

depends on how it is implemented and on its relationships with the other barriers (see Sect. 36.3.4.6). More generally, when several protection layers act in sequence—i.e. layer 2 is demanded if layer 1 fails, layer 3 is demanded if layer 2 fails, etc.—the risk reduction provided by layer 2 is lower than if it was used alone or in 1st position, the risk reduction provided by layer 3 is lower than if it was used in second position this risk reduction itself being lower than if layer 3 was used alone or in 1st position, etc. This is due to systemic dependencies (e.g. common parts, synchronous tests) between the protection layers (see Sect. 36.3.2.4 hereafter). Therefore, the overall risk reduction rmult is not equal but lower than the product, r1 · r2 · r3 . . .., of the risk reductions provided by the protection layers used alone: rmult = r1 · r2 · r3 · . . . where r2 < r2 , r3 < r3 , . . .

(36.1)

This is illustrated in Fig. 36.5 which represents the LOPA (Layer of Protection Analysis) principle applied to the safety systems represented in Fig. 36.3. It has to be noted that non-conservative results should be forbidden when performing safety analyses. Therefore, the estimation of the risk reduction provided by multiple protection layers should be done by implementing systemic approaches considering all of them as a whole [see Chaps. 13–33 and ISO/TR 12489 (2013) or Innal (2008)]. Protection st 1 layer layers

2nd layer

Initiating event frequency

SIS

Hazardous event frequency w Success w

Failure r1

Safe situation

Success Failure r'2

Fig. 36.5 Loss of protection analysis (LOPA) example

Success Failure r'3

Hazardous event

756

36 Functional Safety Related Modelling and Calculations

36.2.2 Low Demand Versus High Demand Mode of Operation Safety systems can be split according to two modes of operation: continuous mode for these acting in permanence to control (regulate) the process (e.g. centrifugal speed governor of Watt, level or pressure regulators) and these acting on demand when some physical parameters cross given thresholds (e.g. relief valves, shutdown systems, HIPS). The IEC 61508 standard takes up this idea with more detail for the demand mode systems and considers three cases of operation for a safety system: • low demand mode of operation: no more than one demand per year; • high demand mode of operation: more than one demand per year; • continuous mode of operation: permanent demand as a part of normal operation. The splitting between less and more than one demand per year is questionable because considering the proof test interval (to detect hidden failures) would have been more relevant. Perhaps this has been chosen from a pragmatic and simplifying point of view because one year is often a reasonable and widely used proof test interval. It has to be noted that the high demand mode of operation is generally assimilated to the continuous mode and this is a legitimate and conservative assumption.

36.2.3 Probabilistic Requirements: PFDavg and PFH IEC 61508 makes a difference between the requirements related to the low demand and the high demand/continuous mode of operations with regards to probabilistic requirements by introducing the two following key parameters: Average probability of dangerous failure on demand (PFDavg ): average unavailability of a SIS to perform as required when required. Average probability of dangerous failure per hour (PFH): average failure frequency (i.e. the average unconditional failure intensity (see Chap. 4) of the SIS). The discrepancy between the names and the definitions are a legacy from the early version of the IEC 61508 standard. The acronyms PFDavg and PFH quickly became very popular after the first issue of IEC 61508, then it has been decided to keep them in the new issue where the definitions have been improved in order to cope with these in use in the dependability field (IEV 192 2015). Therefore, at the present time, the definitions of these concepts are in line with the usual unavailability and frequency calculations as described in this book. According to IEC 61508, the PDFavg has to be used for the low demand mode of operation and the PFH for the high demand/continuous mode of operation. The reason to associate PFDavg to low demand mode and PFH to high demand/continuous mode is unclear (perhaps pragmatic reasons for simplified calculations) because they are relevant in any mode of operation. Furthermore, it has to

36.2 Safety Integrity Concepts

757

be noted that some sectoral standards (e.g. IEC 61511 2016) are more flexible with regards to the use of PFDavg or PFH for high demand mode of operation. The reason to associate the PFH to the continuous mode of operation is also unclear as the unreliability is the natural relevant parameter for systems having to operate continuously without failures. Perhaps, this is to encompass low reliability SIS (e.g. SIL 1) which may have several failures during the life of the protected system. Therefore, beyond the PFH required to comply with the standard, it could be wise for reliability engineers to calculate, in addition, the unreliability of the safety systems operated in continuous mode. Therefore, the hazardous event frequency, HEFavg , is equal to: • HEFavg = w × PFDavg for low demand mode of operation, if w is the demand frequency; • HEFavg = PFH for high demand or continuous mode of operation. As w × PFDavg is not directly linked to PFH , it is likely that the use of PFDavg or PFH for SIS in between low and demand mode leads to different SIL results. The PFDavg and PHF requirements are summarized in Table 36.1 (see IEC 61508part1) and they depend on the level of safety integrity to be achieved: the higher the SIL and the lower the required PFDavg or PHF. In addition, this table also shows that: • the required PFDavg has the inverse value of the risk reduction factor introduced in Sect. 36.2.1.2; • the required PFH is equal to the required PFDavg divided by 104 . The simple relationship C = 1/r has already been mentioned in Sect. 36.2.1.2. It holds for a single SIS operating alone but should not be used when multiple safety systems are involved (see Sect. 36.3.4.6). The rationale of the division by 104 for defining the PFH is unclear but seems to simply come from a rough approximation of the duration of one year (10,000 h instead of 8760 h) which makes quick calculations easier. Anyway, for a single component and under the assumption of λτ  1, PFDavg ≈ λτ/2 and PFH ≈ λ. Then when t = 10, 000 h, PFDavg ≈ 5000 × PFH . Then the consistency between PFDavg and PFH is not really ensured. This is illustrated in Fig. 36.6 for a single non-repaired item over a period of 10,000 h. This figure shows clearly that the PFDavg and PFH criteria do not systematically lead to the same SIL requirements. Even if this does not pretend to be a demonstration in the general case, it seems that, when the SILs are different, the PFH approach Table 36.1 PFDavg and PFH requirements in relationship with the SIL

SIL 4

PFDavg ≥ 10−5 to < 10−4 ≥

10−4 to

2



10−3 to

1

≥ 10−2 to < 10−1

3

PFH [h−1 ] ≥ 10−9 to < 10−8

0 for p ∈]0.5, ∞[ The unavailability curves in Fig. 36.13 are drafted for p(t) = 1−exp(−λ i .t). This  implies that 1 − p(t) = exp(−λi .t) and P = 0 when exp −λi .teq = 0.5. Then, finally, λi .tdivR = −ln(0.5). This leads to: tdivR =

−ln (0.5) = 0.693 × MTTF λi

(36.5)

36.2 Safety Integrity Concepts

773

This result is similar to this observed in Fig. 36.13. The instant tdivR divides the time space between two intervals where the reliability/availability of the 2oo3 is: • better than this of the 1oo1 over t ∈ [0, 0.693 × MTTF[; • equivalent to this of the 1oo1 for tdivR = 0.693 × MTTF; • worse than this of the 1oo1 over t ∈ ]0.693 × MTTF, ∞[. Comparison with regards to the failure rate point of view According to Sect. 4.7.6.2, the failure rate of a system can be calculated from its reliability from formula (t) = − dR(t)/dt . R(t) According to Formula 36.3, and with p(t) = 1 − exp(−λi .t), the reliability of the 2oo3 is given by the following formula: R2oo3 (t) = U2oo3 (t) = 3 exp(−2λi .t) − 2 exp(−3λi .t)

(36.6)

6λi [1 − exp(−λi .t)] 6λi [exp(−2λi .t) − exp(−3λi .t)] = 3 exp(−2λi .t) − 2 exp(−3λi .t) 3 − 2 exp(−λi .t)

(36.7)

Then: 2oo3 =

This can be compared to the failure rate of a single component (i.e. a 1oo1):  = 1oo1 − 2oo3 = λi −  = 0 when λi −

6λi [1−exp(−λi .t)] 3−2 exp(−λi .t)

6λi [1 − exp(−λi .t)] 3 − 2 exp(−λi .t)

=0⇔1−

6[1−exp(−λi .t)] 3−2 exp(−λi .t)

(36.8)

= 0.

The solution of this equation leads to exp(−λi · tdiv ) = 3/4 and to tdiv =

−ln(0.75) = 0.288 × MTTF λi

(36.9)

Again, this result is similar to this observed in Fig. 36.13. The instant t div divides the time space between two intervals where the failure rate of the 2oo3 is: • better than this of the 1oo1 over t ∈ [0, 0.693 × MTTF[; • equivalent to this of the 1oo1 for tdivR = 0.693 × MTTF; • worse than this of the 1oo1 over t ∈ ]0.693 × MTTF, ∞[. Therefore, the implementation of majority vote systems is interesting with regards to a non-redundant system (1oo1) only if the periodic proof tests interval is small enough: • at least for keeping the unavailability lower than this of the non-redundant system, • or better, for keeping the failure rate lower than this of the non-redundant system.

774

36 Functional Safety Related Modelling and Calculations

It has to be noted that, if the majority vote logic is rather easy to implement with the sensor of logic solver part of the SIS, it is generally difficult to implement with final elements (e.g. valves) for which simple logics 1oo1, 1oo2 or 1oo3 are often only tractable.

36.3 Probabilistic Calculations 36.3.1 Input Data Needs and Conservativeness The problem of input data accuracy has already been mentioned in introduction (Sect. 36.1) and this section is an opportunity to remind that: • the input data being a weak point, it is not acceptable to use this fact as an alibi to perform probabilistic calculations with over simplified models because this adds calculation uncertainty to data uncertainties; • when dealing with safety systems, the calculations should be made with conservative data and assumptions. It can be added that using data coming from field proven (proven in use, proved by prior use) components is certainly a better and stronger approach than a pure qualitative approach. In the same way, it can also be mentioned that the systematic use of a set of pre-established input data is essential to perform consistent probabilistic calculations allowing relevant comparisons between different designs and different safety systems. This is also useful to perform sensitivity analyses. Then, gathering such a preferred data set appears as a very important task to undertake. Coming back to the standard, except when implementing the route 2H , the data coming from the field feedback are not required in any case to comply with the IEC 61508 standard. However, when this is done, “any failure rate data used should have a confidence level of at least 70%” (IEC 61508-2). This is clearly a conservative requirement which can be interpreted as to use the 70th percentile of the statistical estimation (e.g. given by the chi2 distribution) instead of the mean value obtained by the maximum likelihood estimate. When route 2H is undertaken, it is required to use data: • coming from the feedback of similar components used in a similar way; • collected according to international standards [e.g. IEC 60300-3-2 (2004) or ISO 14224 (2016)]; • evaluated according to the amount of field feedback collected, the exercise of expert judgment and the performance of specific tests when needed in order “to estimate the average and the uncertainty level (e.g., the 90% confidence interval or the probability distribution of each reliability parameter (e.g., failure rate) used in the calculations)”.

36.3 Probabilistic Calculations

775

The last sentence about the 90% confidence is not very clear: this means perhaps that the 90th percentile of the statistical estimation must be used instead of the 70th percentile or of the mean value. This is likely intended to counterbalance the abandonment of the safe failure fraction. With regards to PFDavg and PFH, it is required that “the system shall be improved until there is a confidence greater than 90% that the target failure measure is achieved”. Again, this is not very clear but it seems to mean that the 90th percentile of the PFDavg or PFH obtained through a Monte Carlo simulation have to be considered instead of their average values. And again, this is a conservative requirement aiming to counterbalance the abandonment of the safe failure fraction when route 2H is implemented. This is going to be analysed in the following chapters. It has to be noted that end-users are “encouraged to organize relevant component reliability data collection”: this is a very wise advice, which, if it is followed, is likely to help to perform more accurate analyses in the future.

36.3.2 Simplified Analytical Approach 36.3.2.1

Introduction to Simplified Analytical Calculations

Although it is recommended to perform PFD and PFH calculations by implementing the systemic approaches described in Part 3 (Boolean approaches) and Part 4 (Markov processes and Petri nets), the use of simplified analytical formulae is a widespread approach because it has been brought forward in the first IEC 61508 issue. Even if this has been improved in IEC 61508-6 annex B issued in 2010 by the introduction of alternative techniques (RBDs, FTs, Markov and PNs), the use of the catalogue of ready-made formulae proposed in this standard is still very popular because it allows fast probabilistic calculations. However, no explanation being given about the way these formulae have been established, the risk of misuse is very important. This is why, in addition to IEC 61508 alone, it is wise to use the ISO 12489 standard which aims to help the reliability engineer to implement the simplified approach with a full knowledge of the underlying assumptions and limitations. It also provides the missing explanations and information needed to adapt or even develop new formulae to cover specific problems. As it is a rather thick document, only a snapshot about this approach is going to be given hereafter: to this end, the dangerous undetected failures (DU)—which raise the most specific calculation difficulties due to the periodic proof tests—have been selected. It has to be noted that the underlying model behind the analytical formulae is a mix between the Markovian approach (Chap. 31) and the minimal cut set approach (Chaps. 17 and 19). It can be seen as a very simplified version of the RBD/FT driven Markov processes (Chap. 27). Explanations are also provided in (Innal 2008).

776

36.3.2.2

36 Functional Safety Related Modelling and Calculations

Dangerous Failure Analysis of Non-redundant Items

Figure 36.14 represents a simple item, A, whose dangerous undetected (DU) failures are tested with a proof test interval τ . This is the simplest element for establishing the analytical formulae. For doing that, it is traditionally considered that the tests and the repairs are performed instantaneously and that the test interval, τ , is small with regards to 1/λ (e.g. λ.t  1). Then the approximation R(t) = 1 − e−λ.t ≈ λ.t is valid all over the test interval [0, τ ]. According to the above assumptions, the item becomes again instantaneously as good as new after a test and therefore U (t) = F(δ) ≈ λ.δ with δ = t mod(τ ) (i.e. the rest of the division of t by τ ). This leads to the typical saw-tooth curve illustrated in Fig. 36.15, which is the simplest model which can be encountered when dealing with functional safety related calculations.

Impact of Hidden Failures The curve being replicated from test interval to test interval, the calculation of the average unavailability (PFDavg ) can be done by considering only one of these intervals: A

τ

Test of A

Fig. 36.14 Individual item whose failures are periodically tested

Instantaneous test and repair

Maximum .

≈ .

PFDavg

τ

t

δ

Time t

Fig. 36.15 Saw-tooth curve of the unavailability of SIS presented in Fig. 36.14

36.3 Probabilistic Calculations

PFDavg

777

1 ≈ τ

τ λ.δ.dδ = 0

1 λ.τ 2 λ.τ · = τ 2 2

(36.10)

Let us consider ϑh the average duration of a hidden failure given that the item is as good as new just after a test and failed at the next test. In this case: • R(τ ) ≈ λ.τ is the probability to reveal a failure when performing the test; • λ.τ × ϑh is the expected unavailability duration of the item over a test interval; • PFDavg = λ.τ × ϑh /τ = λ.ϑh is another way to calculate the PFDavg . So: ϑh =

τ 2

(36.11)

Therefore, when a dangerous hidden failure of a single item is revealed by a proof test, it has occurred, in average, for half the test interval (e.g. 6 months for a yearly test or 2.5 years for a test performed every five years), as illustrated in Fig. 36.16. No doubts that Formula 36.11 is the most famous and the most widely used formula to perform simplified analytical PFDavg calculations. However, the users should be aware of the impact of the assumptions described above which implies that this formula is valid if and only if: • the test duration is included at the end of the proof test interval and has no detrimental impact on the SIS availability (in particular the item remains available during the test); • the protected system is immediately shut down when a hidden dangerous failure is revealed by the proof test. In this case, the hidden failures are detected instantaneously at the end of a test interval and the exposure to danger disappears immediately as soon as the dangerous failure is revealed. Then the repair duration does not matter with regards to the SIS performance. It is the job of the analyst to ensure that these strong assumptions are realistic. This is analysed hereafter. It is not realistic to consider that a protected system is systematically shut down during the repair of a part of an SIS because this is likely to lead to too much shutdowns. In addition, in case of redundancy, the probability of success of the SIS is only reduced (i.e. the safety action is not inhibited) and compensatory measures can be Fig. 36.16 Unavailability duration of a hidden failure when it is revealed by proof test

OK

KO

τ

Hidden failure duration

778

36 Functional Safety Related Modelling and Calculations

undertaken to mitigate the impact of the failure. Then, provided that it does not last too long, the items are often repaired without shutting down the protected system: that means that the repair time has to be considered when calculating the PFDavg .

Impact of Repair Times Two categories of failures can be detected when a proof test is performed: • the hidden dangerous failure properly speaking: probability R(τ ) ≈ λ.τ ; • the failure provoked by the performance of the test itself: probability γ . Therefore: • λ.τ + γ being the probability to have a failure to repair after a proof test; • 1/μ being the mean overall time to repair these failures; (λ.τ + γ )/μ is the expected maintenance time for the failures discovered by a proof test. The impact of the repair duration on the PFDavg is illustrated in Fig. 36.17. The contribution of the repairs to the PFDavg is obtained by dividing the expected value by τ and the PFDavg formula becomes: PFDavg =

λ γ λ.τ + + 2 μ μ.τ

(36.12)

Optimum Test Interval and Minimum Achievable PDFavg In the functional safety standards, the term γ /(μ.τ ) is omitted and yet this is an important term as it prevents to increase the proof test frequency at the infinite to reduce the PFDavg to zero. The derivative of Formula 36.12 leads to the optimum test interval as follows: τopt =



2.γ /(λ.μ)

(36.13)

When increasing the test interval from zero to τopt , PFDavg decreases and reaches a minimum value which is the lower value achievable just by reducing the proof test interval. Increasing the test interval beyond τopt is going to increase the PFDavg instead of decreasing it as this is generally expected. Fig. 36.17 Impact on unavailability duration of the repair of a revealed dangerous failure

KO

OK Average repair duration

τ

36.3 Probabilistic Calculations

779

Parameter γ may be difficult to estimate but it can be kept in mind as a safeguard: if not performing enough tests is not good for safety, performing too much tests is not good for safety either. The PFDavg obtained for τopt is the minimum achievable just by decreasing the test interval (see Fig. 36.18). Then it is impossible to reach any wanted PFDavg target lower than this value just by reducing the proof test interval, e.g. a value lower than 5.47 × 10–3 by reducing the test interval under 447 h for the example on the left-hand side of Fig. 36.18. Let us note that the PFDavg decreases quickly before the optimum value and increases more slowly after. Then, a too small test interval is more detrimental on the PFDavg than a too large one. The contributions of the various terms of Formula 36.12 are illustrated on the right-hand side of Fig. 36.18. With the used parameters, the contribution of the term λ.τ/2 increases quickly and the contribution of the term γ /(μ.τ ) decreases quickly. The optimum is reached when their contributions are identical. The impact of the third term, λ/μ is more limited. The evolution of τopt as a function of γ is illustrated on the left-hand side of Fig. 36.19. With a log–log scale, this gives a linear curve: then τopt (γ ) increases

8. 10-3 6 . 10-3 4.

PFDavg( )

Contributions( )

= 1.0 10 − 5 h-1 = 0.1 h-1 = 1.010 − 2

100% 80%

10-3

. /2

60% Minimum achievable PFDavg

2 . 10-3

/ .

40% /

20%

0. %

0.0 500

1000

1500

500

1000

1500

Fig. 36.18 Evolution of the PFDavg as a function of the test interval and contribution (in %) of the three terms of Formula 36.13

( )

10000

10-1

1000

10-2

100

10-3

10

10-4

Minimum achievable PFDavg( )

/ . / . /2

10 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1

10-5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1

Fig. 36.19 Evolution of the τopt and of the minimum achievable PFDavg as a function of γ

780

36 Functional Safety Related Modelling and Calculations

very quickly when γ increases. On the right-hand side of Fig. 36.19 is shown the minimum achievable PFDavg as a function of γ . Also, this is drawn with a log–log scale: when γ is small, the term λ/μ dominates and when γ is big, the terms γ /(μ.τ ) and λ.τ/2 (which are equal for τopt ) dominate.

Unavailability During Proof Test When the item is put offline during tests (e.g. a sensor which is disconnected to be tested), the test duration π brings a contribution equal to π/τ to the PFDavg and may be a top contributor for this parameter. The related unavailability duration is illustrated in Fig. 36.20 where the proof test has been located at the end of the test interval. Note that π  τ but it has been enlarged on the figure for more clarity. Adding this contribution gives the following formula: PFDavg =

λ γ π λ.τ + + + 2 μ μ.τ τ

(36.14)

Improved Optimum Test Interval and Minimum Achievable PDFavg Contrary to the probability to provoke a failure due to the proof test, γ , the test duration π is a parameter readily available. Taking π into account leads to recompute the optimum test interval as:  = τopt



2.(γ + μ.π )/(λ.μ)

(36.15)

This is illustrated in Fig. 36.21 where a test duration of 5 h has been considered. This seems a small duration but the impact is important: the minimum achievable PFDavg (i.e. 1.2 × 10–2 ) is multiplied by two and the optimum test interval (i.e. 1095 h) by 2.4 compared to the case where π = 0. The shape of the PFDavg (τ ) curve is similar to this of the previous case and leads to the same remarks. The contributions of the various terms of Formula 36.15 are illustrated on the righthand side of Fig. 36.21. Again, the contribution of the term λ.τ/2 increases quickly but now it is balanced by the term π/τ which decreases quickly. The contributions of the terms γ /(μ.τ ) is more limited than in the previous case and similar to this of the term λ/μ. Fig. 36.20 Impact on test duration (item offline during proof test)

OK

τ

KO

Test duration

36.3 Probabilistic Calculations

6. 10-2 4.

10-2

781

Contributions (%)

PFDavg( ) = = = =

10 − 5 h−1

1.0 0.1 h-1 1.010 − 2 5h

100%

/

80%

. /2

60%

2 . 10-2

40%

20% 0.

/ . /

0. %

4000

8000

12000

4000

8000

12000

Fig. 36.21 Evolution of the PFDavg as a function of the test interval and contribution (in %) of the three terms of Formula 36.15

Minimum achievable PFDavg( )

( ) 10-1 6000 10-2

4000

. /2

/

/

10-3

2000

0 0

50

100

150

200

10-4

/ .

0

50

100

150

200

Fig. 36.22 Evolution of the τopt and of the minimum achievable PFDavg as a function of π

The evolution of τopt as a function of the proof test duration π is illustrated on the left-hand side of Fig. 36.22. In the example, the value of τopt (γ ) increases from 547 to 5814 h when π increases from half an hour to one week. On the right-hand side of Fig. 36.19 is shown the minimum achievable PFDavg as a function of π . A log scale is used for the ordinates: for the same example, the minimum achievable PFDavg increases from 6.5 × 10–3 to 6.0 × 10–2 when π increases from half an hour to one week. The terms λ.τ/2 and π/τ dominate and have almost the same value for τopt . The contribution of the term λ/μ is constant and this of γ /(μ.τ ) continuously decreases. It has to be noted that, neglecting the impact of the maintenance, μ, and the impact of the failures due to the test, γ , this leads to:  = τopt

 2.π/λ

(36.16)

With the same values as above (λ = 10−5 h−1 ; π = 5 h), this leads to an optimum proof test interval of 1000 h and a minimum achievable PFDavg of 1.1 × 10–2 .

782

36 Functional Safety Related Modelling and Calculations

Fig. 36.23 Example of human failure related to proof test performance

OK

KO

Reconfiguration omission

τ KO

KO

Failure not revealed

τ Impact of Human Failures Other sources of unavailability for a SIS are human failures. Figure 36.23 illustrates two cases where dangerous failures are present after a test due to human failures: 1. At the top of the figure, the item disconnected for test purpose is not put back on line when the test has been completed (probability ζ1 ); provided that the overall repair time is negligible compared to the test interval, this encompasses also the case where the item is not put back online when a repair has been performed. 2. At the bottom of the figure, a failure normally covered (probability τ + γ ) by the test is not revealed by a badly handled test (probability ζ2 ); this encompasses all the forgotten tests which are not performed at all. It has to be noted that the first case is precisely the cause of the accident of the nuclear power plant of Three Mile Island in 1979 (Rogovin and Frampton 1979) where the four valves of the pressurizer have been left in a wrong position after a test. In both cases, the item remains unavailable during the whole test interval but the failure can be detected when the next test is performed. Therefore, the contribution to unavailable time is equal to [ζ1 + ζ2 .(λ.τ + γ )] · τ and the contribution to the PFDavg equal to [ζ1 + ζ2 .(λ.τ + γ )]. Finally, the whole PFDavg is obtained as: PFDavg =

λ γ π λ.τ + + + + ζ1 + ζ2 .(λ.τ + γ ) 2 μ μ.τ τ

(36.17)

No doubts that other cases of unavailability could be identified (e.g. the dangerous failures not covered by diagnostic nor by proof test), but the aim is mainly to demonstrate that simplifying the PFDavg calculations to PFDavg = λ.τ /2 + λ/μ or, worse, to PFDavg = λ.τ /2 completely sets aside the fact that, if there is a benefit (i.e. detect the hidden failure) to perform proof tests, there is also a loss (unavailability due to the tests themselves). Therefore, a good balance between not enough tests and too much tests has to be pursued.

36.3 Probabilistic Calculations

783

Fig. 36.24 Simplest SIS made of one sensor, one logic solver and one final element organized in series

Sensor

Logic solver

S

LS

Final element V

Channel C

τ

+

Test of C

SIS Made of Items in Series Figure 36.24 represents the simplest SIS architecture: a channel made of one sensor, one logic solver and one final element whose dangerous undetected (DU) failures are tested at the same time. The components being in series and tested at the same time, this system is equivalent to a single component C whose DU failure rate, λ, is equal to the sum of the individual DU failure rates of the three components. Therefore, this SIS can be considered as a single component and the formulae established above are still valid and can be used. Unfortunately, if the three parts of this SIS are not tested at the same time, the formulae established above cannot be directly used and this is the opportunity to remind that, over an interval [0, T ], the average value, X (T ), of a random variable X (t) = F[A(t), B(t), C(t)] function of other random variables cannot be calculated directly from the average values A(T ), B(T ), C(T ) of these random variables: X (T ) = F[A(T ), B(T ), C(T )]

(36.18)

The correct following formula must be used instead: 1 X (T ) = T

T X (t).dt

(36.19)

0

Concerning the example and as illustrated in Fig. 36.25, when several items are failed at the same time, the corresponding unavailability durations are counted several times and this is why the sum of the PFDavg values of S, LS and V overestimate the C the PFDavg of C leads to the following inequality: PDFavg of C. Noting PFDavg C = U C (0, T ) ≤ U S (0, T ) + U LS (0, T ) + U V (0, T ) PFDavg

(36.20)

However, this sum provides a conservative estimation of the PDFavg of C which can be used cautiously when the U S (0, T ), U LS (0, T ) and +U V (0, T ) are small. Therefore, even for the extremely simple SIS of Fig. 36.24, the PFDavg values of individual items cannot easily be combined to estimate the PFDavg of a system

784

36 Functional Safety Related Modelling and Calculations

Sensor

τS

Logic solver Final element

τLS τV Counted twice

Channel (C)

Fig. 36.25 Example of unavailability durations when the components are not tested at the same time

comprising them. The difficulty is going to increase when considering redundant items hereafter.

36.3.2.3

Dangerous Failure Analysis for Redundant Items

System Made of Two Redundant Components Let us consider a system made of two redundant components A and B whose undetected dangerous failures are proof tested at the same time and under the assumption of instantaneous tests and repairs. Then, the instantaneous availabilities of each of them are saw-tooth curves similar to this illustrated in Fig. 36.15. This system is unavailable when both A and B are unavailable at the same time and then U (t) = FA (δ) × FB (δ) ≈ λa .δ × λb .δ with δ = t mod(τ ). The components being tested at the same time, the PFDavg of the system can be calculated over a single test interval: AB PFDavg

= U AB

1 = τ

τ λa .λb δ 2 dδ = 0

λa .λb τ 2 1 λa .λb τ 3 · = τ 3 3

(36.21)

Let us consider ϑh the average duration of a hidden failure given that the redundant item of the system is as good as new just after a test and failed at the next test. In this case: • Rv(τ ) ≈ λa .λb τ 2 is the probability to reveal a double failure when performing the test; • λa .λb τ 2 × ϑh is the expected unavailability duration of the system over a test interval; • PFDavg = λa .λb τ 2 × ϑτh = λa .λb .τ.ϑh is another way to calculate the PFDavg . By comparison with Formula 36.21, ϑh is calculated as

36.3 Probabilistic Calculations Fig. 36.26 Hidden failure of a system made of two redundant components (synchronous tests)

785

A

A

B

B

OK

KO

Hidden failure duration

τ ϑh =

τ 3

(36.22)

Therefore, when a dangerous hidden failure of a system made of two redundant components is revealed by a proof test, it has occurred, in average, at one-third of the test interval (e.g. 4 months for a yearly test or 1.7 year for a test performed every five years), as illustrated in Fig. 36.26. AB of the system, λa .λb τ 2 /3, is not the product The first remark is that the PFDavg of the PFDavg s of components A and B: (λa τ /2) · (λb τ /2) = λa .λb τ 2 /4. AB = PFDavg

4 A B PFDavg .PFDavg 3

(36.23)

Therefore, the PFDavg of the system is 33.3% higher than expected by multiplying the individual PFDavg values of A and B. This is the result of a systemic dependency introduced by the synchronicity of the tests: A and B are good (just after a test) or bad (just before a test) at the same time. The above formula can be written as follows: AB = PFDavg

λa τ λb τ 4 4 A B · · = PFDavg · PFDavg 2 2 3 3

(36.24)

Or when considering the risk reduction factors as the inverse of the PFDavg values: 3 RRFAB = RRFA · RRFB 4

(36.25)

This can be interpreted as follows: • the introduction of the first item provides, as expected, a risk reduction equal to A ; RRFA = 1/PFDavg • the introduction of the redundant item provides a risk reduction equal to 75% of B provided by B when it is used alone. the risk reduction RRFB = 1/PFDavg Therefore, when redundancy is introduced, the risk reduction is less than what can be expected just by considering the PFDavg values required to comply with the functional safety standards.

786

36 Functional Safety Related Modelling and Calculations

τ θ

2

1

τ−θ

Test of A Test of B

τ

Fig. 36.27 Test staggering of the two redundant components of a redundant system

Staggering the Tests of a System Made of Two Redundant Components The staggering of the tests has already been mentioned here above (see Fig. 36.7) as a way to improve the PFDavg . This can be demonstrated by considering the system made of two redundant components analysed here above. Staggering the test of components A and B leads to the test pattern presented in Fig. 36.27. Four cases have to be considered: 1. interval 1: A and B fail during θ ; probability ≈ λa .λb · θ 2 ; 2. interval 1: A fails during θ and B during τ − θ of the previous test interval; probability ≈ λa .λb · θ · (τ − θ ); 3. interval 2: A and B fail during τ − θ ; probability ≈ λa .λb · (τ − θ)2 ; 4. interval 2: A has previously failed during θ and B fails during τ − θ ; probability ≈ λa .λb · θ · (τ − θ ). In case 1 and 3, both components are in up state at the beginning of the interval and in down state at the end of the interval. Then, thanks to the memoryless property of the exponential distributions, Formula 36.21 can be used: the duration of unavailability is, in average, equal to the third of the interval. In case 2 and 4, only one component is in up state at the beginning of the interval and it is in down state at the end of this interval. Then, thanks to the memoryless property of the exponential distributions, Formula 36.11 can be used: the duration of unavailability is, in average, equal to the half of the interval. This leads to: (1)

λa .λb θ 3 [3 τ

(2)

λa .λb (τ −θ )3 [ 3 τ

• unavailability in the interval one: U AB ≈ • unavailability in the interval two: U AB ≈

+ (τ − θ) θ2 ]; 2

+θ ·

(τ −θ )2 ]. 2

Developing the above formulae leads to: (1)

(2)

U AB = U AB + U AB ≈

 λa .λb  2 2τ − 3τ · θ + 3θ 2 6

(36.26)

Finally, the average unavailability of a system made of 2 redundant components is: max • maximum for θ = 0 or θ = τ . This gives UAB ≈ 13 λa .λb · τ 2 ; max • minimum for θ = τ2 . This gives UAB ≈

5 λ .λ 24 a b

· τ 2.

36.3 Probabilistic Calculations

787

Then, with regards to the hidden dangerous failures, staggering the test of the redundant components A and B by the half of the test interval is the most effective strategy. This reduces the PFDavg linked to hidden failures by a factor of about 5 × 3 = 58 = 62.5% compared to the synchronous tests. This is far from being 24 negligible. This kind of optimization is analysed in (Rouvroye and Wiegerinick 2006). As a side effect and provided that a relevant analysis of the failure cause is undertaken, staggering the tests allows to double the frequency of the detection of the common cause failures: one time with the test of A and one time with the test of B. Therefore, staggering the test has the double benefit to decrease the part of the PFDavg due to hidden dangerous failures, and also due to common cause failures.

System Made of Three Redundant Components Let us consider the system illustrated in Fig. 36.28 which is made of three redundant components tested at the same time (instantaneous tests and repairs). Similar calculations as above lead to: ABC PFDavg

1 = τ

τ λa .λb .λc δ 3 dδ =

λa .λb .λc τ 3 4

(36.27)

0

This is twice the product of the three PFDavg values of A, B and C. ABC A B C = 2 · (PFDavg · PFDavg · PFDavg ) PFDavg

(36.28)

And the risk reduction is then divided by two with regards to the product of the individual risk reduction related to A, B and C used individually: RRFABC =

1 (RRFA · RRFB · RRFC ) 2

(36.29)

The above formula can be written as: Fig. 36.28 Hidden failure of a system made of three redundant components (synchronous tests)

A B

A

C

C

OK

KO

B

τ

Hidden failure duration

788

36 Functional Safety Related Modelling and Calculations

3 8 RRFAB = RRFA · RRFB · RRFC 4 12

(36.30)

The risk reduction brought by the third redundant item is now only 66% of the risk C ). Then, reduction expected from the PFDavg of the component used alone (1/PFDavg the risk reduction provided by extra redundant components decreases more and more when redundancy (i.e. the level of fault tolerance) increases. This is due to systemic dependencies and, thus, relying on claimed SIL only (e.g. from SIL certificates) is obviously not sufficient to properly evaluate the risk reduction provided by redundant safety systems. This has to be completed by performing relevant safety analyses and calculations.

Staggering the Tests of a System Made of Several Redundant Components The benefit of test staggering has been brought to light for the system made of two redundant components. This is exactly the same when the redundancy is greater than two. As shown above for two redundant components, the calculations are tedious but it can be demonstrated that the minimum PFDavg related to hidden failures is obtained by staggering the tests of the components by θ = τ/k for a system made of k redundant components. Then, for a system made of three redundant components A, B and C, the minimum PFDavg related to hidden failures is obtained by staggering the tests of B by τ/3 with regards to A and staggering the tests of C by τ/3 with regards to B (i.e. by 2 × τ/3 with regards to A). This strategy homogenizes the repartition of the good (after tests) and bad (before tests) states of the components of the redundant system.

Other Cases of Unavailability of Redundant Systems In Fig. 36.29, the two components of a redundant system are revealed to be failed after a test. The probability of this situation is: (λa τ + γa ) (λb τ + γb ). Then these components are repaired and the repair duration ρ depends on the number of available repair teams: • two repair teams: ρ = 1/(μa + μb ); Fig. 36.29 Unavailability due to repairs of a system made of two redundant components (synchronous tests)

A

A

A

B

B

B

KO

τ

OK Repair time

ρ

36.3 Probabilistic Calculations

789

• single repair team: ρ = 1/μ where μ = max(μa , μb ) because it is expected that the component with the shorter MORT (see Chap. 4) is repaired first. AB of this case is equal to (λa τ + γa ) (λb τ + γb )/(ρ · τ ). The contribution to PFDavg In Fig. 36.30, the two components of a redundant system are unavailable (i.e. disconnected) during the performance of the proof tests. The system is unavailable during a test but, when it has been performed, the system is exactly in the same state as just before beginning the test operation. But the failures are now revealed. This situation occurs each time a test is performed, its duration is π and the contribution AB of this case is equal to π/τ . to PFDavg When the components are unavailable during the tests, it is certainly wise to test them sequentially, one after the other. This is illustrated in Fig. 36.31. This case is more complicated than this for the synchronous tests as several situations have to be considered:

• B is failed before the test of A: probability λb τ , duration πa ; • B fails during the test of A: probability λb πa , duration πa /2; • A is failed (including the failure due to γa ) before the test of B; probability (λa τ + γa ), duration πb ; • A fails during the test of B: probability λa πb , duration πb /2. AB Gathering the above unavailabilities gives a contribution to PFDavg due to the test λ π2

λ π2

a b b a + 2τ + γaτπb . performance equal to: λb πa + λa πb + 2τ Figure 36.32 illustrates the unavailability due to a reconfiguration mistake when a test has been completed. As A and B are tested at the same time, such a mistake is likely to be a common cause failure. Then the probability of this situation is η1 and AB is simply η1 . the duration is equal to τ . So, the contribution to PFDavg Figure 36.33 illustrates the unavailability due to a hidden failure of B when A is in hidden failure since the beginning of the test interval (reconfiguration mistake, or non-detection of the failure). This can occur when sequential tests are performed,

Fig. 36.30 Unavailability due to synchronous tests of a system made of two redundant components

OK KO

KO

B

π

Test duration

τ

Fig. 36.31 Sequential proof tests of a system made of two redundant components

OK

A

A B

πa

A B

πb

τ Test durations

790

36 Functional Safety Related Modelling and Calculations

Fig. 36.32 Unavailability of a system made of two redundant components due to the omission of reconfiguration after synchronous test (human failure)

A B OK

KO

Reconfiguration omission

τ

Fig. 36.33 Unavailability due to the non-detected down state of A after a test (human failure)

A

A

B

B

OK

KO

τ the reconfiguration error  is not necessarily a common cause failure. The probability of such a situation is η2 (λa τ + γa ) + η1 ] · λb τ and its duration is τ/2. Symmetrically, the unavailability due to a hidden failure of A when  B is in hidden failure since the beginning of the test interval has a probability of η2 (λb τ + γb ) + η1 ] · λa τ and a duration of τ/2. AB equal to: Gathering the two situations gives a contribution to PFDavg η1 (λa + λb )

(γa λb + γb λa )τ τ + η2 [λa .λb τ 2 + ] 2 2

And so on! Several other situations could be identified: e.g. failure of A during the repair of B and vice versa, A failed but not revealed by test (human failure) during repair of B and vice versa. The analytical development is more and more tedious and would be even more difficult if staggered tests or/and a mix of DD and DU failures were considered.

36.3.2.4

Multiple Safety Systems

Figure 36.34 gives an example of a multiple safety instrumented system made of two SIS operating in sequence: when a demand occurs, SIS1 is requested to trigger the safety action first; if SIS1 fails to perform as required, then the demand is transmitted to SIS2 and if SIS2 fails to perform as required, then a hazardous situation occurs. In order to simplify the calculations, the demand occurrence, the architecture of the SIS and the test procedure are chosen as simple as possible: • demand frequency: w;

36.3 Probabilistic Calculations

Protected system

791

SIS1

SIS2

Situation OK

YES

Demand

Degraded

A NO

Safe

YES

B

C NO

D

Hazardous event Unsafe

Fig. 36.34 Example of a multiple safety system made of two SIS operating in sequence

• SIS1 : two redundant similar channels A and B with the same dangerous undetected failure rate λ1 ; • SIS2 : two redundant similar channels C and D with the same dangerous undetected failure rate λ2 ; • proof tests: performed synchronously with a test interval τ . In addition, only the dangerous undetected failures are considered in the following development. With the above assumptions, a hazardous event occurs at an instant δ of the test interval if the demand frequency is w and: • SIS1 fails to trigger the safety action: probability (λ1 .δ)2 ; • SIS2 fails to trigger the safety action: probability (λ2 .δ)2 . Therefore, the hazardous event frequency at δ is equal to: w.λ21 .λ22 .δ 4 and the average hazardous event frequency, HEF, can be calculated as:

HEFavg

w = τ

τ λ21 .λ22 .δ 4 dδ = w.λ21 .λ22 .

τ4 5

(36.31)

0

This can be written HEFavg =

 λ2 τ 2 λ2 τ 2 9  9 AB CD · w · 1 · 1 = · w · PFDavg .PFDavg 5 3 3 5

(36.32)

Therefore, HEFavg is 9/5 = 1.8 times higher than the result given just by multiplying the PFDavg values (into brackets in Formula 36.32). Then the actual risk reduction is only 5/9 = 55% of what could be expected from a simplistic calculation. Formula 36.32 can be rewritten as follows: 9 AB CD HEFavg = w · PFDavg × PFDavg 5

(36.33)

792

36 Functional Safety Related Modelling and Calculations

According to this formula: AB • SIS1 reduces the HEF from w to w · PFDavg and it provides a risk reduction equal AB to 1/PFDavg ; CD AB CD • SIS2 reduces the HEF from w · PFDavg to w · PFDavg × 95 · PFDavg and it provides 5 CD CD a risk reduction equal to 9 × 1/PFDavg = 0.55/PFDavg .

Let us consider that, according to Fig. 36.3, the risk reduction factor required from CD the SIS2 (e.g. from ALARP analysis) is equal to RRFreq . That means that SIS2 has CD CD to be designed in order that RRFreq = 0.55/PFDavg .  CD CD Therefore, SIS2 has to be designed such as PFDavg = 0.55/RRFreq . That is to say with a PFDavg of about the half of what would be required if SIS2 was implemented alone. For example, if the risk reduction (e.g. established from ALARP analysis) required CD = 7000 (i.e. SIL 3 requirements), to fulfil this requirement, for SIS2 is RRFreq  CD the above result leads to require PFDavg = 7.86 × 10−5 (i.e. SIL 4 requirement). Therefore, a risk reduction falling into SIL 3 requirements implies a PFDavg falling into SIL 4 requirements. Of course, the SIL 4 requirements should be retained to design SIS2 . If this problem seems ignored in IEC 61508, it can be handled by performing the calculations proposed in (IEC 615011-3 2016) annex J (informative) and (ISO/TR 12489 2013).

36.3.2.5

Conclusion About the Simplified Analytical Formulae

The above development seems sufficient to demonstrate that, except in the very simple case of an individual component, the establishment of simplified analytical formulae is not easily tractable. For a redundant system, the number of unavailability cases depends on the testing philosophy and increases with the level of redundancy. Establishing the formulae related to all of them is tedious and time-consuming and, in the end, this results in a catalogue of ready-made PFDavg or PFH formulae based on simplified assumptions and difficult to understand. Improving them is often beyond the normal mathematical knowledge of engineers who have no choice but to use them as they are. Combining these PFDavg or PFH formulae through addition (series systems) appears to be conservative while combining them by multiplication (parallel systems)—when integral should be implemented—appears to be very nonconservative with regards to the risk reduction actually provided by the installation of a new SIS. This is more and more non-conservative when the internal redundancy (individual SIS) or the number of SIS used in sequence (multiple SIS) increases— then, when the probability of hazardous event decreases. This is a very unusual situation as the probabilistic approximations are generally more and more accurate when the probability decreases.

36.3 Probabilistic Calculations

793

Beyond the systemic dependencies introduced by the performance of synchronous periodic proof tests, the common cause failure can also contribute to the nonconservativeness in the following situations: • common cause failures between the components of an individual SIS; • common components (e.g. shared sensors or final elements); • common cause failures between components belonging to different individual safety systems of a multiple safety system (e.g. between components of SIS1 and SIS2 in the example above). Desynchronizing the tests (i.e. staggering the tests) between redundant components or individual SIS belonging to multiple safety systems decreases the detrimental impacts of both the systemic dependencies and the CCFs. However, when complex patterns of proof tests are implemented, the result of the competition between detrimental or beneficial effects is difficult to anticipate just by using the simplified analytical approach. This is particularly the case for the safety of industrial systems implementing several protection layers (see Fig. 6.1 in Chap. 6) acting in sequence and implying the use of multiple safety systems. The ultimate protection layer just before an accident occurs is often provided by a SIS (e.g. HIPS) which, according to Sect. 36.3.2.4, offers the lowest risk reduction compared to the PFDavg required by functional safety standards. Therefore, when the simplified analytical approach is implemented to fulfil the functional safety requirements, the actual risk reduction achieved should be checked to verify that the targeted tolerable hazardous event frequency is actually achieved. Except in very simple cases, this cannot be done by using ready-made formulae and the simplified analytical approach should be superseded by the systemic approaches (Boolean and dynamic) described in this book: they allow to make relevant and accurate probabilistic calculations without approximation and taking into account the proof test patterns, the common cause failures and, when needed, non-exponential failure distributions. In this part devoted to the simplified approach, only the dangerous undetected failures and the safety systems operating on low demand mode have been analysed. Similar formulae could be developed for dangerous detected failures, safe failures of high demand or continuous mode of operation (PFH calculations). The conclusion would have been the same: it is better to use the systemic approaches which, based on sound mathematics, allow to perform sound probabilistic calculations.

36.3.3 Markovian Approach The Markovian approach is described in Chap. 31 of this book. The basis to deal with periodically tested items has been described in Sect. 31.5.4.4 where the modelling of the main parameters is illustrated in Figs. 31.41–31.45. Due to the exponential explosion of the number of states when the number of components increases, this approach allows to model only SIS with few components.

794

36 Functional Safety Related Modelling and Calculations

In fact, it is mainly used in combination with fault trees to provide the unavailabilities related to the primary events or in combination with reliability block diagrams to provide the availabilities related to the blocks (see Chap. 27).

36.3.3.1

Modelling the Test Duration

Figure 36.35 illustrates the multiphase Markov model of a periodically tested item unavailable during a test duration π . This example completes these proposed in Chap. 31 and leads to two recurring phases 1 and 2 which are linked together. In this figure, the down states are highlighted in grey. The Markov graph in phase 1 (test interval properly speaking) is similar to this described in Fig. 31.40 i.e. 3 states: available (A), hidden fault (DU) and repair (R). During phase 2 (test duration) a fourth state has to be added (Tst) where the item is actually tested but not failed. The linking matrices are indicated at the bottom of Fig. 36.35: when a test starts, the item can fail due to the test itself (γ ) and the dangerous undetected failures are revealed. When the test is completed, the repair of the revealed failures starts. If the item is disconnected to perform the test, state Tst is a state of unavailability (this is why it has to be distinguished from state A) and, as presented in Fig. 36.36, the unavailability U (t) jumps to 1 when the test begins and it remains to 1 until the test is completed. What happens during the test is magnified on the right-hand side of this figure. If the item is not available for the safety action during the test, it can also fail (failure rate λ ) during the test interval but, with the above example, this is negligible compared to the probability of failure due to the test itself (γ ). Applying the mathematical development proposed in Sect. 31.2 to the above multiphase Markov model allows to calculate the failure frequency (unconditional Test

λ A

DU

A

Repair

A

μ

R

π

τ−π

Tst

Phase 2

Start test

DU

λ

Tst Test

μ

R

Phase 1

Tst

DR

λ'

Available

μ

Revealed fault Test

Start test

Hidden fault

R

τ−π

Phase 1

Tst

Tst

A

A

A

A

A

A

A

DR

DU

DU

DR

DR

DU

DU

DR

R

R

R

R

R

R

R

R

A

Fig. 36.35 Modelling of periodically tested item with a test duration π

36.3 Probabilistic Calculations

1

795

( )

Unavailable during test

π

Impact of repair

γ

PFDavg

0

T

Fig. 36.36 Unavailability of a periodically tested item inhibited during the test duration π

( )

( )

PFH

0

T

0

T

Fig. 36.37 Failure frequency and equivalent failure rate of a periodically tested item inhibited during the test duration π

failure intensity or Vesely failure rate), w(t), and the equivalent failure rate (conditional failure intensity), (t), related to the modelled component. This is illustrated in Fig. 36.37 and shows that the shapes of w(t) and (t) are different because w(t) = A(t) · (t): • (t) remains constant (equal to λ) except during the tests; • w(t) decreases during a test interval because the component availability A(t) decreases; • both (t) and w(t) are equal to zero during the tests because the component is unavailable and cannot become more unavailable than it is already. The PFH (i.e. the average failure frequency) is lower than the equivalent failure rate frequency. In the example, the PFH is equal to 1.77 × 10–4 h−1 when λ = 2.0 × 10−4 h−1 (and the average failure rate is equal to 1.99 × 10−4 h−1 due to the impact of test durations). Then the failure rate can be used as a conservative estimation of the PFH.

36.3.3.2

Modelling DU, DD and Non-covered Failures

Figure 36.38 gives the multiphase Markov model of an item with dangerous detected failures (DD), dangerous undetected failures (DU) and failures not covered by both

796

36 Functional Safety Related Modelling and Calculations

Test

Repair DD RDD

λDNc λDD

λDU

A

μDD

μDU Available

τ

Nc Not covered Hidden DU fault

Test

Repair RDU DU A

A

DU

DU

DU

RDU

RDU

RDU

RDU

RDD

RDD

RDD

RDD

Nc

Nc

Nc

Nc

A

A

DU

Phase 1

Fig. 36.38 Modelling DD, DU and not covered failures with a multiphase Markov graph

the diagnostic and proof tests. Again, the down states are highlighted in grey in this figure. The repair of a DD failure starts as soon as it occurs; the repair of a DU failure starts after it has been discovered by a proof test and the non-covered failures are never repaired because they are never revealed. Therefore, the multiphase Markov model is made of one recurrent single phase and one single linking matrix allowing to link the probability of the state at the end of a test interval to the state at the beginning of the following one. The unavailability of the above item is illustrated on the left-hand side of Fig. 36.39: the unavailability due to DD failures converges quickly toward and asymptotic value, λDD /(λDD + λμDD ); which gives the jump at the beginning of the curve and translates upwards the saw-tooth curve due to the DU failures. The minima of this curve increase because of the failure not covered by diagnostic nor proof tests.

( )

Tests

+

+

PFDavg

PFH NC failures DD failures

0

0

T

0

T

Fig. 36.39 Unavailability of periodically tested item with DU, DD and not covered failures

36.3 Probabilistic Calculations

797

The equivalent failure rate and the failure frequency are presented on the righthand side of Fig. 36.39. Again, the failure rate is constant (equal to λDD +λDU +λDNc ) when the failure frequency decreases inside the test interval. The jump due to the DD failures can be observed at the beginning of the curve and the maxima decrease because of the non-covered failures. The PFH decreases also due to the non-covered dangerous failures and goes to zero when the observation time t goes to infinity. Then the difference with the failure rate is more and more important when the observation time increases. This highlights the fact that the PFH is not really relevant when dealing with non-repaired failures. Anyway, the models proposed in Chap. 31 as well as the models above highlight the flexibility of the multiphase Markovian approach to model the various parameters encountered when dealing with a single item. Such models can effectively be used as input of RBD or FT driven Markov processes (see hereafter and Chap. 27) which have proven very effective to accurately model and calculate the PDFavg of safety instrumented systems.

36.3.3.3

Modelling a System Made of Two Redundant Items

The multiphase Markov model of a system made of two redundant periodically tested components is illustrated in Fig. 36.40. For the sake of simplicity, the same test interval, τ , is considered for both components but the model could be used in the Test of A

Test of B

A

A B

A B

Da B

Da B

A Db

A Db

μa

Ra B

Ra B

Ra B

λa

Da B

A Rb

A Rb

Da Db

Da Db

Ra Db

Ra Db

Da Rb

Da Rb

Ra Rb

Ra Rb

λb

B

μb λb

μa Da Db

A B

λb

A Db

μb

θ Phase 1

Ra Db

τ

μb

λa μa

A Rb

Ra Rb

Da Rb

λa

A B

A B

Da B

Da B

A Db

A Db

Ra B

Ra B

A Rb

A Rb

Da Db

Da Db

Ra Db

Ra Db

Da Rb

Da Rb

Ra Rb

Ra Rb

τ-θ

θ

Test of A

Phase 2

Phase 1

Test of B

Fig. 36.40 Redundant system with staggered tests

798

36 Functional Safety Related Modelling and Calculations

( )

( ) = /

=

=

= = /

PFDavg 0 /

/

0

/

Fig. 36.41 Unavailability and average unavailability (PFDavg ) as a function of the test staggering

case of different tests intervals. The tests are staggered by θ and this leads to identify two recurring phases, “phase 1” and “phase 2”, which are repeated ad libitum. As in Fig. 36.35, each individual component has three states, the Markov matrix of the redundant system comprises 32 = 9 states and it is valid both in phase 1 and in phase 2. The difference occurs with the linking matrices: when A is tested, only the DU failures of A are revealed (and then repaired) and vice versa, when B is tested, only the DU failures of B are revealed (and then repaired). Again, the down states are highlighted in grey in this figure. This model allows to evaluate the impact of the proof test staggering on the PFDavg of the redundant system, as illustrated in Fig. 36.41. It has to be noted that when θ = 0 the two linking matrices have to be merged in a single one because the DU failures of both A and B are revealed at the same time. On the left-hand side of Fig. 36.41, the unavailability of the redundant system when θ = 0 is compared to the unavailability of the same system when θ = τ/2. As this has been already mentioned several times, staggering the tests allows to decrease the range of the instantaneous unavailability and this is useful when a permanent SIL is required (see Sect. 36.2.3). The test staggering also allows to decrease the PFDavg . In the presented example, it is reduced by a factor equal to 1.74 (i.e. about 42.4%) which is a little bit higher than the approximated value of 8/5 = 1.6 (i.e. 62.5%) found above in Section “System Made of Two Redundant Components” with the simplified calculations. It has to be noted that the above model implicitly considers that A and B have their own repair team. If this assumption is reasonable when the tests are staggered because the probability to have to repair A and B simultaneously is normally very low, this is not the case when A and B are revealed faulty at the same time and it is likely that A would have to wait the end of the repair of B (or vice versa). Then the PFDavg would be higher than calculated above in case of θ = 0. However, in this case, the failure of the first component (A or B) occurs, in average, at τ/2 and the failure of the second one (B or A) at 2.τ/3. When the MORT (mean overall repair time) is small compared to 2.τ/3, the assumption of as many repair teams as components would have only a limited impact. Nevertheless, this highlights another advantage to stagger the tests: avoiding to have to repair both components simultaneously.

36.3 Probabilistic Calculations

799

Fig. 36.42 Failure frequency and PFH as functions of the test staggering (high availability)

( )≈

( ) = /

=

PFH

0

T

Figure 36.42 illustrates the failure frequency related to the same cases as those analysed in Fig. 36.41 for which the unavailabilities are small compared to 1 (i.e. A(t) ≈ 1). Then the failure frequency and equivalent failure rate are practically superimposed and cannot be distinguished in the figure. Due to the redundancy, the equivalent failure rate and then the failure frequency start from zero and this leads to a saw-tooth curve which increases between tests. As shown in the figure, the effect of staggering the test from zero to τ/2 is to decrease the PFH but, in this example, the impact is only of about 8%, i.e. less than for the PFDavg (74%). In order to see the impact of higher unavailabilities, the failure rates λa and λb have been multiplied by 100 and this gives the results presented in Fig. 36.43. This shows that, at the beginning of a test interval, when the availability is high, w(t) and (t) follow the same evolution. When the availability decreases, then (t) still increases when w(t) reaches a maximum and then decreases. Then w(t) is always lower than (t). As shown in the figure, the difference between the PHF and the average equivalent failure rate decreases when the tests are staggered. This figure also shows that the PFH when the tests are staggered is about 20% higher than when the tests are not staggered. Then, when the unavailability is high (i.e. when the proof test interval is too large), the effect of test staggering is the opposite of what is expected. This is a side effect of the fact that, when the test interval increases, the probability to be already failed at a given time t increases and that, when failed at t, it is not ( ) ( ) PFH

( )

PFH

( ) 0

0

T

0

0

T

Fig. 36.43 Failure frequency and PFH as functions of the test staggering (low availability)

800

36 Functional Safety Related Modelling and Calculations

possible to fail anymore and then the failure frequency decreases. In the end, if the proof test interval becomes infinite, w(t) and then PFH would converge to zero. As increasing the proof test interval to decrease the PFH is obviously a vicious solution which should be prohibited, this means that the PFH indicator is valid and should be used only when U (t)  1. Then it has to be cautiously considered when the failure rates or/and the proof test interval increases. Even if the redundant system illustrated in Fig. 36.40 is very simple, the multiphase Markov model is already rather complicated. Therefore, modelling actual industrial safety instrumented systems with numerous components seems not realistic by hand and a computerized tool generating automatically the Markov graph from higher level descriptions [e.g. a formal language such as AltaRica Data Flow (see Brameret et al. 2015 or Boiteau et al. 2006) should be implemented (see Chap. 31).

36.3.3.4

Modelling the Accident Occurrence

In the redundant system example described in Fig. 36.40, four states are related to a hazardous situation (see Fig. 36.11): Da Db , Ra Db , Da Rb , Ra Rb . With regards to the hazardous situation, three cases can be identified: • Da Db : the hazardous situation is unknown; • Ra Db , Da Rb : the hazardous situation is also unknown if no further procedure is implemented to verify that the system is not completely inhibited (e.g. testing immediately the other component); • Ra Rb : the hazardous situation is obviously known and it is likely that compensatory measures are going to be taken in order to prevent the occurrence of an accident (e.g. shutdown of the protected system). It has to be noted that this is the job of the analyst to identify which procedure is actually undertaken when a component failure is revealed. If the SIS is the ultimate safety layer, an accident occurs from states Da Db Ra Db , Da Rb as soon as a demand occurs. If the SIS is not the ultimate safety layer, this only triggers a demand on a further safety layer. The Markov model presented in Fig. 36.44 is built under these assumptions. The restoration has been drafted in dotted line on this graph because the situation is very different from what actually happens: • ordinary repair in case of a demand on the further safety layer; • perhaps no repair at all in case of accident if the installation is destructed. It has to be noted that the concept of accident frequency is relevant only if the consequence of the accident can be repaired. If not, the accident frequency goes to zero when time goes to infinity and the relevant concept is the probability of accident over a given period of observation (i.e. a probability which, like an unreliability, goes to 1 when time goes to infinity).

36.3 Probabilistic Calculations

801

Fig. 36.44 Modelling the demand for the safety action and the accident occurrence

λdem

A B

Da Db Markov graph

Ra Db

λdem

Da Rb

Accident (or demand on another SIS)

λdem Restoration

36.3.4 Boolean Approach 36.3.4.1

Introduction and Principle of Calculation

The principle of calculation is based on theuse of RBD driven Markov processes or of FT driven Markov processes which have already been described in Chap. 22 and reminded in Chap. 36. In these approaches, small Markov graphs are used for calculating the input availabilities of the blocks of RBDs (see Fig. 22.8) or of the input unavailabilities of the primary events of FTs (see Fig. 22.9). Then, the RBDs or FTs provide the logic to combine these availabilities or unavailabilities. Figures 22.11–22.25 in Chap. 22 provide a lot of information on these approaches for both dangerous detected failures and dangerous undetected failures. Then, only examples related to dangerous undetected failures and not presented in Chap. 22 yet are going to be considered hereafter.

36.3.4.2

Series System

The fault tree illustrated in Fig. 36.45 is related to three components organized in series. It may be used, for example, to model the simple SIS made of one sensor, one logic solver and one final element analysed in Section “SIS Made of Items in Series”. S of the system to the sum of the PFDavg values In order to compare the PFDavg of the individual components in series, the failure rates and test intervals have been chosen in order to have the same PFDavg values, λa2τa = λb2τb = λc2τc = 5.0 × 10−3 according to Formula 36.11. The unavailabilities and PFDavg values of the components are drawn on the leftA B C = PFDavg = PFDavg = 4.9834 × 10−3 hand side of the figure. This leads to PFDavg −3 which is very close to the approximated value of 5.0 × 10 . This leads also to S = 1.490 × 10−2 which is very close to 3 × 4.9834 × 10−3 = 1.495 × 10−2 . PFDavg Then, when the PFDavg values of the components are small compared to 1:

802

36 Functional Safety Related Modelling and Calculations

) A

) B

) C S

0

0

0

T

0

T

Fig. 36.45 Unavailability and PFDavg calculation of an SIS made of three components in series i • the approximation PFDavg = λ2i τi for an individual component is pretty good and conservative; λi i S • the approximation PFDavg ≈ for the series system is also pretty good and 2 i

conservative. Theses approximations are getting better and better when the λi τi /2 decrease and worse and worse when they increase. It has to be noted that, in the example, the components comply with the SIL 2 requirements whereas the overall system complies only with the SIL 1 requirements (see Table 36.1). This highlights that a system made of SIL i components does not necessarily fulfil the SIL i requirements. When components in series are involved, it is likely to comply only with SIL i-1 requirements. If, for example, the failure rates of the above examples are multiplied by 10, i = 4.84 × 10−2 instead of 5.0 × 10−2 and PFDavg = 0.138 instead of 0.15. PFDavg Then, the approximation still holds for the individual components but, for the overall system, the impact of overlapping failure (see Fig. 36.25) is now perceptible. And now, the components comply with SIL 1 and the system with SIL 0. The failure frequency, w(t), and the PFH of the series system are illustrated in Fig. 36.46. The saw-tooth curve of w(t) is inverted compared to the unavailability U (t) illustrated in Fig. 36.45. Fig. 36.46 Failure frequency and PFH calculation of an SIS made of three components in series

T

36.3 Probabilistic Calculations

803

S It has to be noted that, with the used parameter, PFDavg is equal to 1.49 × 10−2 which complies with the SIL 1 requirements while PFH S is equal to 1.58 × 10−5 which complies only with SIL 0.

36.3.4.3

Parallel System

The unavailability, U (t), of the parallel system illustrated in Fig. 36.47 has been calculated by multiplying by 10 the failure rates of the components used in Fig. 36.45. A B C S = PFDavg = PFDavg = 4.84 × 10−2 and to PFDavg = This leads to PFDavg −4 1.424 × 10 . The approximation of the PFDavg value of the components is now 5.0×10−2 and it is 3.2% higher than the calculated PFDavg . But it is still conservative and acceptable. 3  S The approximation of PFDavg by 5.0 × 10−2 = 1.25 × 10−4 is now optimistic by about 14% and this is not really acceptable. In addition, the components now comply with the SIL 1 requirements whereas the system complies with the SIL 3 requirements. Therefore, again, from a probabilistic point of view, the link between the SILs of the components and the SIL of the system are far from straightforward. In the case above, using components complying with SIL 3 (e.g. with a PFDavg of 5.0 × 10−4 ) in order to design a SIL 3 system is going S in the range of 10−10 which is more than one million times lower to lead to a PFDavg than what is needed for a SIL 3 system! The failure frequency, w(t), and the PFH of the parallel system are illustrated in Fig. 36.48. The saw-tooth curve of w(t) is oriented in the same direction compared to the unavailability U (t) illustrated in Fig. 36.47. S is equal to 1.42 × 10−4 It has to be noted that, with the used parameter, PFDavg which complies with the SIL 3 requirements while PFH S is equal to 3.74 × 10−7 which complies only with SIL 2. Like for the series system, the PFDavg and the PFH requirements are not consistent and the PFH requirements are more stringent than the PFDavg requirements. ) )

A B

)

C S 0

0

T

Fig. 36.47 Unavailability and PFDavg calculation of an SIS made of three components in parallel

804

36 Functional Safety Related Modelling and Calculations

Fig. 36.48 Failure frequency and PFH calculation of an SIS made of three components in parallel

T 36.3.4.4

Series–Parallel System and Common Cause Failures

An example of series–parallel system is given by the three sensors operating in 2 out of 3 of the typical SIS analysed in Chap. 16 (see Fig. 16.17). The sensors being similar, they are subject to common cause failures, as illustrated in Fig. 36.49. In order to simplify the example, the dangerous detected failures have been considered to be negligible: λDD = 0. Then, the independent dangerous undetected failures are modelled by a failure rate λDU and  cause failure between the sensors  the common by a failure rate equal to λccf = β. λccf + λDU (see Chap. 5 about the beta-factor model). The calculations have been performed with β = 0.99% in order to have λccf = 0.01 × λDU . The three sensors are tested at the same time (synchronous tests) and the scales are indicated on the figure because they are rather different from a curve to another due to the redundancy of the sensors. From the bottom to the top of the figure are drafted:

Sensors failed (S)

1E-3

G1

2E-4 T

0 0

T

1E-3

2/3 G2 S1 faulty (independent)

0 CCF sensors

Sensors failed (independent)

)

S3 faulty (independent) )

)

0.02

S2 faulty (independent) )

0

0

T

Fig. 36.49 Unavailability and PFDavg of 3 sensors organized in 2oo3 (non-staggered tests)

36.3 Probabilistic Calculations

805

Sensors failed (S)

4E-4

G1

1E-4 0

Sensors failed (independent)

4E-4 )

2/3 G2 S1 faulty (independent)

S3 faulty (independent)

) S2 faulty (independent)

0

CCF sensors

0

0

) 0.02 ) 0

0

Fig. 36.50 Unavailability and PFDavg of 3 sensors organized in 2oo3 with staggered tests

• the unavailability of the sensors (the same curve for the three sensors); • the unavailability due to the independent failures of the three sensors organized in 2oo3 and the unavailability due to the common cause failure (in grey); • the unavailability of the overall system. S This results in PFDavg = 4.79 × 10−4 and PFH S = 3.03 × 10−7 . The SIS complies with the SIL 3 requirements according to PFDavg and SIL 2 requirements according to PFH considerations. The parameters used in Fig. 36.49 have been kept in Fig. 36.50 but the tests of the sensors have been staggered by one third of the test interval:

• the unavailability of the sensors is now represented by three different curves; • the unavailabilities of the 2oo3, of the CCF and of the overall system have now a test frequency multiplied by three. S This results in PFDavg = 2.57 × 10−4 which is reduced to 46.4% of the PFDavg without test staggering. According to the PFDavg criterion, the SIS complies with the SIL 3 requirements. The failure frequency with and without staggering the tests is illustrated on the left-hand side of Fig. 36.51. At t = 0 it is equal to wccf (0) = λccf · Accf (0) = λccf . The expected number of failures, Nbf (t), as a function of time is illustrated on the right-hand side of the same figure. It is calculated as follows (see Chap. 4):

1 Nbf (t) = w(t) × t = [ t

t

t w(δ)] × t =

0

w(δ)dδ 0

(36.34)

806

36 Functional Safety Related Modelling and Calculations

( )

=

= /

4. E-03

( )

4. E-07

PFH 2. E-03

2. E-07

0

= = /

T

0

0

T

Fig. 36.51 Frequency and number of failures of 3 sensors organized in 2oo3 with staggered tests

And over the interval [0, T ], Nbf (T ) = w(T ) × T = PFH × T . The PFH drops from 3.03 × 10−7 to 2.80 × 10−7 (i.e. a reduction of 7.6%) when the tests are staggered. The integral of the failure frequency provides the number of expected failures, Nbf (t), which is represented on the right-hand side of the figure. Over the interval [0, T ], the number of failures is also reduced by 7.6% but, more interesting, the curve is more linear when the tests are staggered and the risk of failure is more homogeneous than when the tests are not staggered. As PFH S = 2.80×10−7 , the SIS complies with the SIL 2 requirements according to the PFH criterion.

36.3.4.5

Accident Frequency Calculation

For an ultimate demand mode SIS, an accident (hazardous event) occurs when it is failed (hazardous situation) and a demand for the safety action occurs. Then the hazardous event frequency is simply the product of the SIS unavailability by the demand frequency. This leads to make the difference between the initiating event (demand of the safety action) and the other primary event participating only to the probability of failure (unavailability) of the SIS. This has been modelled in Fig. 36.52 by linking the sub-tree made of one AND gate and of the initiating event (boxed in dotted line) to the FT analysed in Figs. 36.49 and 36.50. The initiating event “Demand frequency” is a primary event characterized only by its frequency of occurrence, wdem . Then, the accident frequency (or the demand frequency on a further protection layer) is simply the multiplication by wdem of the overall system unavailabilities calculated in the previous chapter (i.e. drafted at the top of Figs. 36.49 and 36.50). Then: HEF(t) = wdem .US (t)

(36.35)

S HEFavg = wdem × PFDavg

(36.36)

and:

36.3 Probabilistic Calculations

807

Fig. 36.52 Principle of the calculation of the accident frequency (constant demand frequency)

The corresponding curves are drafted on the right-hand side of the figure for both staggered and non-staggered tests cases. The benefit for staggering the tests is the same as for the unavailabilities (i.e. a reduction of 46.4%). However, the demand frequency, wdem (t), is not necessarily constant and, provided that the sub-FTs are independent, in this case, the principle is the following (see Innal et al. 2014): • build a sub-FT to model and calculate the demand frequency, wdem (t); • build another sub-FT to model the SIS unavailability, US (t); • combine both sub-TFs with an AND gate, HEF(t) = wdem (t).US (t);

• calculate HEFavg (T ) = T1 0T HEF(t)dt, or more practically by a numerical averaging of the HEF(t) curves. The modelling of the hazardous event related to a multiple safety system made of two protection layers is illustrated in Fig. 36.53. What is important in this modelling is to clearly identify the subtree modelling the initiating event and the subtree modelling the multiple safety system failure. This is why a special symbol has been used to identify the subtree modelling the demand frequency. In fact, as explained in Chap. 22, it is better to use a priority AND gate (PAND) to model this situation. This is done on the right-hand side of the figure where the hazardous situation must be present before the initiating event occurs and triggers the hazardous sequence. It has to be noted that: • when the two types of sub-FTs are combined in this way, only hazardous event frequencies can be calculated and the classical probabilistic calculations (unavailability or unreliability) are no longer possible;

808

36 Functional Safety Related Modelling and Calculations Hazardous event

Hazardous event PAND gate

Demand frequency

Initiating event

Hazardous situation

Hazardous situation

Initiating event

Demand frequency

Priority Safety system unavailable

Safety system unavailable

Safety system unavailable

Sub faulttrees

Safety system unavailable Sub faulttrees

Fig. 36.53 Principle of the calculation of the accident frequency (non-constant demand frequency)

• when several safety layers are modelled, they have to be gathered in a single fault tree. Proceeding this way allows to take the dependencies between the protection layers into consideration but the fact that S2 reacts only if S1 fails cannot be modelled by a FT which is, basically, a static model (see Chap. 13). The independency between the protected system (generating the demand) and the protection system (triggering the safety action when the demand occurs) is required but it seems realistic in most of the cases.

36.3.4.6

Multiple Safety Systems

An example of RBD related to a multiple safety system is represented in Fig. 36.54. It is made of two SIS: in case of a demand, SIS1 is expected to trigger the safety action first; if it does not perform as required, then SIS2 is demanded and a hazardous event occurs if it fails.

Fig. 36.54 RBD of a multiple safety system made of two SIS

36.3 Probabilistic Calculations

809

The blocks corresponding to the common cause failures between sensors, logic solvers and valves have been modelled in this RBD. Dotted lines have been used to clearly identify them. The fault tree represented in Fig. 36.55 is the dual of the RBD presented in Fig. 36.54. Again, the common cause failures are drafted with dotted lines to clearly identify them. The subtrees corresponding to SIS1 (OR gate G2 ) and SIS2 (OR gate G3 ) have been built separately as this could be done (e.g. by different contractors) with the actual larger industrial safety systems. They are linked through the AND gate (G1 ) to model the multiple safety system failure. The unavailabilities and PFDavg values have been calculated with the following assumptions: • the logic solvers have only dangerous detected failures (i.e. their unavailability converges quickly toward an asymptotic value); • the valves are tested every year; • the sensor of SIS1 is tested every 6 months; • the sensors of SIS2 are tested every year; • the failure rates of common cause failures are equal to 1% of the independent failure rates. According to the above assumptions, the common cause failure of sensors is tested every 6 months. Unavailability and PFDavg Calculations The unavailability and PFDavg results are illustrated in Fig. 36.56 for SIS1 , SIS2 and the overall multiple safety system. The impact of the test of the sensor CCF can be observed in the saw-tooth curve related to SIS2 on the right-hand side of the figure. The PFDavg values are the following: SIS1 • PFDavg = 2.53 × 10−2

SIS1 & SIS2 inhibited G1 SIS1 inhibited

SIS2 inhibited

G2

G3 2/3 Common cause failures

Fig. 36.55 Fault tree related to the multiple safety system presented in Fig. 36.54

810

36 Functional Safety Related Modelling and Calculations

1.0E-03 SIS1 & SIS2 inhibited

6.0E-04 2.0E-04 0

0

5000

10000

15000 T

G1 SIS1 4.0E-02 2.0E-02 0

Test of valves & sensors

Test of valve Test sensor

6.0E-03

0

5000

10000

15000 T

2.0E-03 0 0

SIS2

Test of CCF sensors

5000

10000

15000 T

Fig. 36.56 Unavailabilities and PFDavg values of SIS1 , SIS2 and of the multiple safety system

SIS2 • PFDavg = 2.07 × 10−3 S • PFDavg = 3.17 × 10−4

Then SIS1 complies with the SIL 1 requirements, SIS2 with the SIL 2 requirements and the overall multiple system with the SIL 3 requirements. It has to be noted that the simple approximation consisting in multiplying the SIS1 SIS1 × PFDavg = 5.24 × 10−5 , which underestimates PFDavg values leads to PFDavg S PFDavg by 83% (i.e. by a factor 6) and leads to SIL 4 instead of SIL 3 for the multiple safety system. This is due to the common cause failures between the SIS and to the systemic dependencies due to synchronous proof tests and proves, again, that the multiple safety system has to be modelled and calculated as a whole to find the risk reduction which is actually provided. S = 3.17 × 10−4 , the SIS complies with the SIL 3 According to the result PFDavg requirements. Failure Frequency and PFH Calculations The failure frequency and PFH results are illustrated in Fig. 36.57 for SIS1 , SIS2 and the overall multiple safety system. The saw-tooth curves of SIS1 and SIS2 are typical curves related to non-redundant and redundant safety systems, as analysed in the previous chapters. Each of these curves starts with a jump: • SIS1 : the jump is equal to the sum of the failure rates of the primary event of the related subtree; • SIS2 : the jump is equal to the sum of the failure rates of the logic solver and of its common cause failure; • overall multiple safety system: the jump is equal to the failure rate of the CCF of the logic solver.

36.3 Probabilistic Calculations

811

6.0E-07 4.0E-07

SIS1 & SIS2 inhibited

2.0E-07 0

0

5000

10000

15000 T

G1 SIS2

SIS1 7.60E-06 1.42 E-05 1.40 E-05 1.38 E-05 0

6.80E-06 5000

10000

15000 T

6.00E-06

0

5000

10000

15000 T

Fig. 36.57 Failure frequencies and PFHs of SIS1 , SIS2 and of the multiple safety system

The values of the PFHs are the following: • PFH SIS1 = 1.40 × 10−5 • PFH SIS2 = 6.98 × 10−6 • PFH S = 3.24 × 10−7 Then SIS1 complies with the SIL 0 requirements, SIS2 with the SIL 1 requirements and the overall multiple system with the SIL 2 requirements. Therefore, in this example and like what has already been shown in Fig. 36.6, the PFH requirements are more stringent than the PFDavg requirements. SIS1 SIS2 + PFH SIS2 × PFDavg can be The simple approximation PFH SIS1 × PFDavg S used to calculate PFH when SIS1 and SIS2 are independent. Due to the common cause failures, this is obviously not the case here but, however, it leads to PFH S ≈ 3.69×10−7 which is a conservative value overestimating PFH S by only 8% (i.e. by a factor 0.93). A more in-depth analysis would be needed to draw a general conclusion about this result. According to the result PFH S = 3.24 × 10−7 , the SIS complies with the SIL 2 requirements. Again, the PFDavg and the PFH requirements are not consistent and the PFH requirements are more stringent than the PFDavg requirements.

36.3.5 Petri Net Approach The SIS being binary systems, the dynamic RBDs in general (Chap. 27) and the RBD driven Petri nets (Chap. 33) in particular are very good candidates to model SIS and perform SIL calculations (see Chap. 33, Figs. 33.42 and 33.43). For doing that, a library of sub-PN modules has to be developed first. Several such modules have already been introduced and explained in Chap. 33 (see Figs. 33.38– 33.41) and they have been used to develop the mini sub-PN module library illustrated in Fig. 36.58:

812

36 Functional Safety Related Modelling and Calculations

Rep

!! Ci =true

UP

!! Ci=true

Rep

UP

UP NbF

?? S=false !! Ci =false

?? S=true

!! Ci =false Test(Θ , )

Dwn

DD M-1

Aux-1

M-2

Rep

!! Ci =true !! MT=MT+1

UP

!! MT=MT-1 ??MT>0

!! Ci =true

Rep

!! NbR=NbR-1 !! MT=MT+1

nM

UP

!! MT=0

!! MT=MT-1 !! Ci =false

??MT>0

Test(Θ , )

DD M-3

!! Ci =false

?? NbR>0

?? NbR=0

!! MT=1

Test(Θ , )

DU

DD M-4

!! NbR=NbR+1

Mb

DU Aux-2

Fig. 36.58 Mini library of PN modules for building RBD driven PNs

• M-1 to M-4: four modules describing several component behaviours and devoted to be used within RBD driven Petri nets: – M-1: component with dangerous detected failures (i.e. the repair starts at once when it fails); – M-2: component with dangerous undetected failures and an unlimited number of maintenance teams (i.e. failures are detected by proof tests and the repairs start at once when the failure is revealed); – M-3: idem M-2 but with only a single maintenance team (MT ); – M-4: idem M-3 but with a mobilisation procedure of the maintenance team (or of the maintenance support). This module works in combination with the module Aux-2 via variable NbR which counts how many items are failed (and have to be repaired) at the same time. • Aux-1 and Aux-2: – Aux-1: this is a simplified version of the sub-PN module proposed in Fig. 33.30 reduced for availability/unavailability calculations only. The place NbF allows to count the SIS failures occurring during the simulation of one history (and to calculate the PFH). – Aux-2: mobilisation procedure. It works in coordination with module M-4. The mobilisation procedure starts as soon as one component fails (i.e. when NbR > 0) and the maintenance team is mobilised after a delay τMb . It is demobilised as soon as all the components have been repaired (i.e. NbR = 0).

36.3 Probabilistic Calculations

813

Modules M-1 and M-2 have been used to build the RBD driven Petri nets presented in Fig. 36.59. This figure models an SIS made of three sensors with dangerous undetected failures organized in 2 out of 3, one logic solver with dangerous detected failures and two redundant safety valves with dangerous undetected failures. The links between the nodes of the virtual RBD are achieved through the use of logic equations using the state variables of the components. These state variables are updated by the corresponding failure and repair transitions of the related modules. The output, S, of the RBD driven Petri nets is then used to trigger the transitions of a module Aux-2 and calculate the availability/unavailability and the failure frequency of the modelled SIS. This model is equivalent to SIS2 modelled in Fig. 36.55. However, in order to simplify the model, the common cause failures have not been taken into consideration. This could be easily done by implementing what has already been described in Chap. 33. The SIS unavailability obtained with 100,000 simulated histories is illustrated on the left-hand side of Fig. 36.60. The curve is less smooth but similar to the curve obtained by fault tree calculations. In this figure, the 90% confidence interval of the simulation is drafted in grey dotted lines around the average simulated value. S results have been enlarged on the right-hand side of the figure: the The PFDavg value obtained by FT is drafted with a solid black line, the simulated average with a black dotted line and the 5% lower and 95% upper bounds are represented with grey dotted lines.

S1

.

M-2

S2 M-2

. + LS

2/3

)

V1 M-2

M-1

S Aux-1

V2 M-2

S3 M-2

Virtual RBD

Fig. 36.59 Typical SIS modelled by using modules M-1, M-2 and Aux-1

Average value and confidence interval

Confidence interval

2.1E-03

Simulated average 95%

5.0E-03 3.0E-03 1.0E-03 0

5%

1.7E-03 0

5000

10000

15000 T

FT calculation 0

5000

10000

15000

Fig. 36.60 Unavailabilities and PFDavg of the SIS modelled in Fig. 36.59 (100,000 histories)

814

36 Functional Safety Related Modelling and Calculations

The evolution of the confidence interval as a function of the number of simulated histories (from 500 to 1,000,000) is illustrated in Fig. 36.61. The reference average values (calculated by FT) are drafted in solid black line, the simulated average in black dotted lines and the upper and lower bounds of the 90% confidence interval in grey dotted lines. As expected (see Chap. 32), the 90% confidence interval squeezes √ when the number n of histories increases (it varies according to 1/ n). S remains within the simulated confidence interval The reference value of PFDavg S and the reference value of PFDavg is lightly outside for a large number of histories but the difference is of about 0.7% and this is negligible. S (1.87 × 10−3 ) complies with SIL 2 requirements As already observed, PFDavg S −6 whereas PFH (6.83 × 10 ) complies only with SIL 1 requirements. Another important result provided by Monte Carlo simulation is the histogram of S obtained when simulating the histories. This is represented in Table 36.3 for PFDavg 10,000 histories over two and twenty years of utilization of the related SIS: S • Over two years, the simulated PFDavg is equal to 1.87×10−3 and the SIS complies with the SIL 2 requirements. However, only 1.37% of the histories lay between the SIL 2 bounds: 94.51% are better than SIL 2 (i.e. SIL 3 or SIL 4) and 1.12% are worse (i.e. SIL 1 or even SIL 0). Then, over two years of operation, the SIS has about 98.9% chances to be SIL 2 or better and 1.1% to be worse. S • Over twenty years, the simulated PFDavg = 1.84 × 10−3 changes slightly and is equal to 1.84 × 10−3 and the SIS complies again with the SIL 2 requirements. Again, only 3% of the simulations lay between the SIL 2 bounds, 90.4% are better

FT calculation 2.0E-03 95% Conf. 1.0E-03 int. 0 5% 100 1000

8.0E-06 95% Simulated average 10000

Nb of histories

100000 1000000

FT calculation

7.0E-06 Conf. 6.0E-06 int. 5% 100 1000

Simulated average 10000

Nb of histories

100000 1000000

S and PFH S Fig. 36.61 Impact of the number of simulated histories on the accuracy of PFDavg

S over two and twenty years of operating time Table 36.3 Histogram of PFDavg

PFDSavg

2 years

20 years

SIL

Nb of histories

%

Synthesis (%)

Nb of histories

%

Synthesis (s) (%)

4

9018

90.18

94.51

7371

73.71

90.36

1665

16.65

3

733

7.33

2

137

1.37

1.37

299

2.9

2.99

1

42

0.42

1.12

665

6.65

6.65

0

70

0.7

0

0.00

36.3 Probabilistic Calculations

815

and 6.7% worse. Therefore, over twenty years, the corresponding SIS has about 93.4% chances to be SIL 2 or better and 6.6% to be worse. Therefore, due to the distributions of the times to fail and of the times to repair, S PFDavg is not deterministic but it is a random variable which evolves between SIL 0 and SIL 4 with given probabilities. Therefore, when SIL i is claimed, the corresponding SIS has a probability to be better than that but also a probability to be worse, and it may be useful to estimate the probability to be worse in order to appreciate the margin with regards to the tolerable risk. The model presented in Fig. 36.62 is similar to the previous one but module M-3 has been implemented instead of module M-2. Module M-3 is not illustrated in Fig. 36.58 due to lack of space but it is similar to M-3 adapted to dangerous detected failures (place DU and the proof test transition have been removed). The various modules are linked together by variable RT which plays the role of predicates and assertions. If the initial value of RT is 1, this models a single maintenance team, if it is 2, this models two maintenance teams and if it is 6, this models as many maintenance teams as components to be repaired. The result of the simulation of this model over two years as above shows very small differences between an unlimited number and only a single repair team. Therefore, the calculations have been performed over four years where the impact is more perceptible. As shown in Fig. 36.63, the impact, with the MORT used (8 h for the sensor and S and almost imperceptible the logic solver, 24 h for the valves), is very light on PFDavg S on PFH . This is because the overall repair times are negligible with regards to the times needed to detect the failures (6 months in average): in this case, this legitimates the assumption of an unlimited number of repair teams made when implementing fault trees. The situation can change when a special maintenance support has to be mobilised to perform the maintenance operations, as modelled in the RBD driven PN illustrated in Fig. 36.64, where modules M-4, M-4 (similar to M-4 without place DU and proof test transition) and module Aux-2 have been implemented.

S1

S2 M-3

S3 M-3

. +

.

M-3

2/3

LS M-3'

V1 M-3

V2 M-3

Virtual RBD

Fig. 36.62 Typical SIS modelled by using modules M-3, M-3 and Aux-1

) S Aux-1

816

36 Functional Safety Related Modelling and Calculations Simulated average

95%

7.12E-06

Nb of 5% repair teams

7.04E-06

1.9E-03 1.8E-03 1.7E-03

1

6

6.96E-06

Simulated average

95% Nb of repair 5% teams

1

6

S and PFH S of the number of maintenance teams (100,000 histories Fig. 36.63 Impact on PFDavg over 4 years)

Mob Aux-2

S1

S2 M-4

S3 M-4

. +

.

M-4

2/3

LS M-4'

V1 M-4

V2

) S Aux-1

M-4

Virtual RBD

Fig. 36.64 Typical SIS modelled by using modules M-4, M-4 , Aux-1 and Aux-2

In this model, each component failure increments by one variable NbR and each repair decrements by one the same variable, which counts the number of items to repair at any time. As soon as a repair is needed, the mobilisation starts and the maintenance team (or/and support) becomes available when the delay τMb is elapsed. Then the faulty components are repaired one after the other and, when no more repair is needed (i.e. NbR = 0), the maintenance team (or support) is demobilised. After that, any other failure requires a new mobilisation of the maintenance team to be repaired. The calculations have been performed with a mobilisation delay of 2 weeks and the results are illustrated in Fig. 36.65. A jump can be observed at the beginning of the saw-tooth curve: it is due to the dangerous detected failures of the logic solver. S is now estimated at 3.70 × 10−3 , which is about twice The simulated PFDavg compared to the case without mobilisation. However, the SIS still complies with the SIL 2 requirements. The simulated PFH S is illustrated in Fig. 36.66 and the simulated value (6.86 × 10−6 ) still complies with the SIL 1 requirements. S , the impact is very light: the simulated average of PFH S Contrary to PFDavg increases from 6.83×10−6 (without mobilisation) to 6.86×10−6 (with mobilisation) and, compared to the FT calculation without mobilisation, PFH S even decreases from

36.3 Probabilistic Calculations

817

Average value and confidence interval 6.0 E-3 4.0 E-3 2.0 E-3 0

0

Impact of DD failures 5000 10000

Confidence Simulated interval average 3.8 E-03

95%

3.6 E-03

5%

15000 T

0

5000

10000

15000

Fig. 36.65 Unavailabilities and PFDavg of the SIS modelled in Fig. 36.64 (100,000 histories)

Fig. 36.66 PFH of the SIS modelled in Fig. 36.64 (100,000 histories)

Confidence FT without interval mobilisation 95% 6.90E-06 6.80E-06

Simulated average 0

5000

10000

5%

15000

6.91×10−6 to 6.86×10−6 . This seems surprising but the mobilisation delay increases the time when the components remain faulty and this decreases the failure frequency as they cannot fail anymore in the meanwhile. As increasing the overall repair time decreases the failure frequency, the PFH S should be used cautiously when the overall repair times are not negligible with regards to the time of observation of the SIS or when compensatory measures are not undertaken to decrease the risk when failures have been revealed.

36.3.6 Uncertainty Handling in SIL Calculations 36.3.6.1

Standard Requirements

When route 2H (see Sects. 36.2 and 36.3) is implemented, two ways are proposed which require to perform the SIL calculations by: • using the 70% upper bounds of data input instead of the average values of the parameters; • using full distributions of the input parameters and retaining the 90th percentile of the resulting PFDavg or PFH distributions. Then, the first step is to establish the 70% upper bounds of the input data or their full distributions. This can be done by using the chi2 distribution (see Chap. 25) which has been implemented to obtain the curves on the right-hand side of Fig. 36.67. The average value (black dotted line) has been estimated from the maximum likelihood estimator (k/Tc ) and the number of observations, k, and the accumulated time of observation Tc has been chosen to obtain an average failure rate λDU =

818

36 Functional Safety Related Modelling and Calculations

Sensors failed (independent) 2/3

2.0E-05

Confidence interval

1.5E-05

S3 faulty (independent)

S1 faulty (independent)

1.0 E-05

)

)

S2 faulty (independent)

5%

70%

95%

5.0E.-06 0 0

10

20

30

40

50

) Fig. 36.67 Illustration of input data (failure rate) uncertainty

4.5×10−6 . The number of observations increases from 1 to 50 in order to encompass both inaccurate and rather accurate estimations. The 70% upper bound is drafted in black solid line: it is higher than the average value and converges toward 4.5 × 10−6 . The 90% confidence interval is represented in grey lines and it decreases around the average value. The less observations, the more conservative the estimation with the 70% upper bound and the larger the 90% confidence interval. The simple fault tree illustrated on the left-hand side of Fig. 36.67 is used to explain how to proceed with both approaches. It has been chosen because this is a series–parallel FT which is likely to provide a good general example.

36.3.6.2

70% Confidence Level of Input Data Approach

The simplest calculation complying with the IEC 61508 standard is to use the 70% upper bounds of the input data instead of the average values. This has been done for the fault tree in Fig. 36.67 and Fig. 36.68 illustrates the saw-tooth curves obtained for λDU = 4.5 × 10−6 estimated from a number of observations of 10 failures over 2,222,222 h. Calc. with Deterministic 70% upper calculation bounds

1.5E-03

Calc. with Deterministic 70% upper calculation bounds

8. E-07 6. E-07

1.0E-03

4. E-07 5.0E-04 0

2. E-07 0

T

0

0

Fig. 36.68 Comparison between basic and 70% input data upper bounds calculations

T

36.3 Probabilistic Calculations

2.0E-03

1.4E-06 Calculation with 70% data upper bounds Deterministic calculation

1.0E-03

0

0

819

10

20

30

40

50

1.0E-06

Calculation with 70% data upper bounds

6.0E-07

Deterministic calculation

2.0E-07 0 0

10

20

30

40

50

Fig. 36.69 PFDavg and PFH calculated with the 70% upper bounds of input data

The usual US (t) and wS (t) calculations performed with λDU = 4.5 × 10−6 (called deterministic calculations) are drafted in dotted lines and the curves obtained by −6 replacing this value by the 70% upper bound, λ70% DU = 5.6×10 , are drafted in black S S solid lines. The resulting average values, PFDavg and PFH , are also represented in this figure. The same calculations have been performed for a number of observations ranging S and PFH S are illustrated in Fig. 36.69 from 1 to 50 failures and the resulting PFDavg S in black solid lines. This provides conservative results compared to the PFDavg and S PFH obtained by using the average values (deterministic calculations) in dotted lines. The figures are similar in both cases: the smaller the number of observations (i.e. the less certain), the more conservative the calculations with the 70% upper bound. The conservativeness drops from 477% to 18% when the number of observations increases from 1 to 50. Therefore, it is easier to reach PFDavg or PFH requirements when accurate data are available and this pleads for the organization of an effective reliability data collection (see Chap. 38).

36.3.6.3

Full Distribution of Input Data Approach

Using the full distributions of the input parameters is far more difficult than the point calculation performed above with the 70% upper bounds. This implies to establish the distribution of the input data and to undertake Monte Carlo simulations as described in Chap. 32. With regards to the example, the gamma distribution, fG (λ; k, Tc ), is very practical because it accepts directly the number of observations, k, and the cumulated observation times, Tc , as input parameters (see Sect. 38.5). In order to compare with the previous results with the 70% upper bounds, the calculations have been performed with k = 10 and a Tc = 2,222,222 h and the results, obtained from a Monte Carlo simulation comprising 10,000 histories, are presented in Fig. 36.70. The 90% upper bound is drafted in grey solid line and the simulated average value in grey dotted line. This average value should not be confused with the deterministic value calculated by using λDU = 4.5 × 10−6 (see Fig. 36.72).

820

36 Functional Safety Related Modelling and Calculations Simulated average

90th percentiles

1.8E-03

90th Simulated percentiles average

8.0E-07

1.4E-03

6.0E-07

1.0E-03

4.0E-07

6.0E-04 2.0E-04 0 0

1.0E-07 0

T

T

0

Fig. 36.70 90th percentile and average values from 10,000 simulated histories

8.0E-04

6.0E-07

90% distribution upper bounds

6.0E-04

90% distribution upper bounds

4.0E-07

4.0E-04 2.0E-04

2.0E-07

Deterministic calculation

0 0

10

20

30

40

50

0

Deterministic calculation 0

10

20

30

40

50

Fig. 36.71 PFDavg and PFH calculated as the 90th percentile of the output distribution

The same calculations have been performed for a number of observations ranging S from 1 to 50 failures and the resulting PFDavg and PFH S are illustrated in Fig. 36.71 S in grey solid lines. Again, this provides conservative results compared to the PFDavg S and PFH obtained by using the average values (deterministic calculations) in dotted line. Again, the figures are similar in both cases: the smaller the number of observations (i.e. the less certain), the more conservative the calculations with the 90th percentile of the resulting system failure distribution. The conservativeness drops from 137% to 22% when the number of observations increases from 1 to 50. This consolidates the observation that it is easier to reach PFDavg or PFH requirements when accurate

3.80E-04

2.58E-07 Deterministic calculation

Deterministic calculation

3.78E-04

2.56E-07

Simulated average value

Simulated average value

3.76E-04

2.54E-07 3.74E-04

0

10

20

30

40

50

0

10

20

Fig. 36.72 Comparison of average simulated values with deterministic values

30

40

50

36.3 Probabilistic Calculations

821

data are available and, again, this pleads for the organization of an effective reliability data collection (see Chap. 38). These calculations are also the opportunity to compare the simulated average results with the deterministic results obtained with the average estimation of λDU . This is illustrated in Fig. 36.72 and leads to the following observations: • The simulated average values are lower than the deterministic values. • The difference is greater when the input data are inaccurate (it decreases when k increases from 1 to 50). • And the curve converges toward the deterministic results when the accuracy (i.e. when the number of observations or/and the cumulated time of observation) increases (in the example, for 50 observations, the curves have converged). In the example, the simulated average value underestimates the deterministic value by 1.4% for 1 observation and decreases to 0.03% for 50 observations. This seems negligible but the difference can be greater than that (see Chap. 38). Anyway, as the 90th percentile is retained in this approach, the risk of non-conservativeness of the results disappears.

36.3.6.4

Comparison Between the Approaches

Figure 36.73 synthesizes the results: • When the input data are inaccurate, the approach using the 70% upper bounds of input data is more conservative than the approach using the 90% of the system failure distribution. • When the input data accuracy increases, both approaches give similar results. Table 36.4 gives the figures related to Fig. 36.73. It shows that, when the accuracy increases, the approaches provide similar results even if the 70% input data upper bound approach becomes a little bit less pessimistic than the 90th percentile of the full distribution of system failure.

2.0E-03 1.5E-03 1.0E-03

1.4E-06

Calculation with 70% data upper bounds

1.0E-06

90% distribution upper bounds

Deterministic calculation

5.0E-04 0 0

Calculation with 70% data upper bounds 90% distribution upper bounds

6.0E-07

Deterministic calculation

2.0E-07 0 10

20

30

40

50

0

10

20

30

40

Fig. 36.73 Comparison between deterministic calculations and the 70 and 90% approaches

50

822

36 Functional Safety Related Modelling and Calculations

Table 36.4 Comparison between deterministic calculations and the 70 and 90% approaches S PFDavg

Nb Obs.

PFH S

k

70%

90%

70%

90%

1

477%

136%

470%

134%

3

148%

87%

147%

86%

5

95%

67%

94%

67%

10

54%

47

54%

47%

25

29%

30%

28%

30%

50

18%

21%

18%

21%

36.4 Conclusions Functional safety standards are more and more massively used to design safety instrumented systems. This success is explained by the useful requirements contained in these standards and by the proposed approach which, at first glance, seems rather simple. However, when analysing this approach more in depth, some weaknesses can be identified as this has been done in this chapter: loose definitions, questionable concept of safe failure fraction, loose links between PFDavg /PFH and risk reduction, simplified analytical approach shortcomings, etc. Fortunately, they can easily be overcome by implementing the systemic (holistic) approaches developed in the classical dependability field. In particular, the RBD/FT driven Markov processes and the RBD driven Petri nets prove to be very useful for this purpose. To conclude this chapter, a few points can be brought to light: • It is not really possible to undertake sound SIL calculations without using a software package: the free demo version of the GRIF-workshop software package (GRIF-Workshop 2020) has been used throughout the above chapters to perform the probabilistic calculations and draw the various curves. • The SIL approach is based on average values which do not really catch the dynamic aspect of what happens in operation. This should be complemented by the analysis of the saw-tooth curves and by forbidding too much excursions outside the required SIL zone (permanent SIL requirement). • Staggering the proof tests is an excellent way to decrease the peaks of the sawtooth curves, increase the common cause failure detection frequency and decrease the uncertainties by suppressing the correlation between the tested items. • The functional safety approach is based on a SIS design from scratch whereas it should be more realistic to rely on the past accumulated experience on similar systems (i.e. by using field-proven items or designs). This is the GALE (Globally At Least Equivalent—GAME in French) approach which seems to be a more realistic approach and could be fruitfully used as an alternative to the ALARP

36.4 Conclusions

823

approach. This principle is used to ensure the safety of French public transportation (STRMTG 2011) and is also the basis for the NORSOK standard (OLF 070 2018), which provides minimum SIL allocation tables for oil and gas industry. • The calculation of PFDavg /PFH based on the full distribution of input data and the 90th percentile of the overall result has not been very much investigated at the present time. However, taking uncertainties into consideration is of utmost importance when dealing with safety. A short overview has been provided here above but no doubt than an important effort is needed on this topic.

36.5 Associated Exercises The exercises related to this chapter are shared with this developed for the Boolean family in Chap. 29: • Exercise 16.1: build the FT related to an overpressure protection system. • Exercise 16.2: identify the tie sets related to the above system. • Exercise 20.1: semi-quantitative analysis of the minimal cut sets of an overpressure protection system. • Exercise 20.2: extension of exercise 20.1 to calculate the impact of partial stroking tests of the safety valves of the above system. • Exercise 20.3: rank the items belonging to the above system according to their Vesely-Fussell importance factors. • Exercise 20.4: identify the minimal cut sets related to an overpressure protection system which are subject to CCFs and calculate the impact of CCFs by using a beta factor of 5%. • Exercise 22.1: calculate the PFDavg (average unavailability), the PFH (average failure frequency) and the unreliability (probability of failure) of an overpressure protection system. • Exercise 22.2: idem exercise 22.1 with partial and full stroking tests of safety valves. • Exercise 22.3: extend exercise 22.1 to model common cause failures on sensors, valves and logic solvers. • Exercise 22.4: extend exercise 22.1 to model the tests staggering of the safety valves. • Exercise 24.1: calculate the various importance factors related to the items belonging to an overpressure protection system. • Exercise 25.1: calculate the impact of uncertainties on the PFDavg (average unavailability) and the PFH (average failure frequency) of an overpressure protection system.

824

36 Functional Safety Related Modelling and Calculations

References Bellman (1957) Dynamic programming. Princeton University Press, Princeton, USA Boiteau M, Dutuit Y, Rauzy A, Signoret J-P (2006) The AltaRica data-flow language in use: modelling of production availability of a multi-state system. Reliab Eng Syst Safety 91:747–755 (Elsevier) Brameret P-A, Rauzy A, Roussel J-M (2015) Automated generation of partial Markov chain from high level descriptions. Reliabil Eng Syst Safety (RESS) 139:179–187. https://doi.org/10.1016/ j.ress.2015.02.009 Elsevier Brissaud F, Vinuessa C, Folleau C (2019) Optimizing proof test policy for redundant safety-related systems. In Proceedings of the ESREL2019, Hannover, Germany Ciliberti V, Ostebo R, Selvik J, Alhanati F (2019) Otimize safety and profitability by use of the ISO 14224 standard and big data analytics. OTC-19634-MS. Houston, USA EXIDA Ed 4 (2015) Safety equipment reliability handbook: 3 volumes: sensors, logic solvers and interface modules, final elements. EXIDA. Sellersville. USA Gödel K (1992) On formally undecidable propositions of principia Mathematica and related systems. Paperback. Systems. Dover Books on Mathematics GRIF-Workshop (2020) Funded and developed by TOTAL, https://grif-workshop.fr/. Accessed Sept 2020 HSE (2020) ALARP at a glance. https://www.hse.gov.uk/risk/theory/alarpglance.htm. Accessed Sept 2020 IEC 60300-3-2 Ed. 2.0 (2004) Dependability management—part 3-2: application guide—collection of dependability data from the field. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60050-192 (IEV192) (2015) International electrotechnical vocabulary—Part 192: dependability. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/electronic/programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61511 Ed. 2.0 (2016) Functional safety. Safety instrumented systems for the process safety sector (3 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland Innal F (2008) Contribution to modelling safety instrumented systems and to assessing their performance—Critical analysis of IEC 61508 standard. (Ph.D. thesis) University of Bordeaux, France Innal F et al (2014) Probability and frequency calculations related to protection layers revisited. J Loss Prev Proc Ind 31, 56–69 (Elsevier) ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland NOPSEMA (2015) ALARP guidance note. National Offshore Petroleum Safety and Environment Management Authority OLF 070 (2018) Guidelines for the application of IEC 61508 and IEC 61511 in the petroleum activities on the continental shelf (Recommended SIL requirements). The Norwegian Oil Industry Association, Norway OREDA Handbook (2015) Ed. 6.0 Offshore and Onshore reliability data. Prepared by SINTEF and NTNU. Hovik, Norway Ostebo R, Dammen T (2006) Use of reliability data for Safety Instrumented Safety Systems. 30th ESReDA seminar, Trondheim, Norway PDS data handbook (2013) Reliability data for safety instrumented system. SINTEF, Trondheim

References

825

Rogovin M, Frampton GF (1979) Three mile Island: a report to the commissioners and to the public Vol 1 to 3. NUREG/CR-1250. USNRC, USA Rouvroye JL, Wiegerinick JAM (2006) Minimizing costs while meeting safety requirements: modelling deterministic (imperfect) staggered tests using standard Markov models for SIL calculations. ISA Trans 45:611–621 (Elsevier) Selvik JT, Signoret JP (2020) How to interpret safety critical failures in risk and reliability assessment. Reliabil Eng Syst Saf (RESS) 161:61–68 (Elsevier) Signoret J-P, Dutuit Y, Collas S, Cacheux P-J, Folleau C, Thomas P (2013) Assessment of the expected number and frequency of failures of periodically tested systems. Reliab Eng Syst Saf (RESS) 118: 61–70 (Elsevier) Signoret J-P, Collas, Ostebo R (2014) Reliability modelling and calculation of safety systems: ISO/TR 12489. Presentation and application in TOTAL. 11th TÜV Rheinland International Symposium, Functional Safety in Industrial Applications. Cologne, Germany STRMTG (2011) Systèmes de transport public guidés urbains de personnes. Principe «GAME» (Globalement Au Moins Équivalent) Méthodologie de démonstration. Service technique des remontées mécaniques et des transports guidés. République Française, Ministère de la transition écologique et solidaire. France Wikipedia ALARP (2020). https://en.wikipedia.org/wiki/ALARP. Accessed Sept 2020 Wikipedia Aristotle (2020). https://en.wikipedia.org/wiki/Aristotle. Accessed Sept 2020 Wikipedia Bellman (2020). https://en.wikipedia.org/wiki/Richard_E._Bellman. Accessed Sept 2020 Wikipedia Gödel (2020) https://en.wikipedia.org/wiki/Gödel’s_incompleteness_theorems. Accessed Sept 2020 Zhang Y, Rauzy A (2017) Modelling of a HIPPS with Altarica. RAMS Seminar, NTNU, Trondheim, Norway

Part VI

Standardization, Data Collection and Uncertainties

Chapter 37

Standardization

37.1 Introduction to Standardization To say the least, standardization is not among the first concerns of engineers in general and of reliability engineers in particular! And yet, it is omnipresent at any time and it governs most of the activities of the human beings. It ranges from languages (e.g. semantics, vocabulary, grammar), writings (e.g. ideograms, alphabet, mathematical symbols), usual objects (e.g. chairs, spoons, stairs, pens …), measure systems (e.g. metric system) to, of course, industrial products and practices. Therefore, standardization is the key to enable the common understanding between people which is of utmost importance to effectively design, build and exchange useful products (e.g. safe, available, ergonomic, able to work together, environment-friendly and costeffective) (Ostebo et al. 2018). So, it seems founded to make the links between standardization and the content of this book. The beginning of standardization gets lost in the night of time as proved, for example, by the discovery of thousands of standardized axes all over Europe from the Bronze Age. Obviously, cost cutting was already a concern at that time! Measure instruments (length, volume, weight) are also an example of important standards. The introduction of the metric system (France) and then of the international system of units (meter, kilogram, second, °Kelvin, ampere, mole and candela) has been a major progress in this domain, even if some industries and countries remain stuck on Middle Ages units (e.g. yard, pound, barrel, gallon) which leads to conversion problems and confusion [e.g. the loss of a space mission where Newtons-second have been confused with pounds-second, (Wikipedia Mars Climate Orbiter 1998; NASA 1999) !]. Nowadays, the interest of standardization can be highlighted by examples where it is not achieved properly: the threads of nuts and bolts for which at least two standards are existing: the ISO metric system (worldwide) and the unified thread standard (USA, CA), and the electrical plugs and sockets for which 14 different types can be identified. And what is funny is that one of the main standardization © Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_37

829

830

37 Standardization

body is named IEC (International Electrotechnical Commission)! Of course, such lacks are likely to cause availability, maintenance and even safety problems (e.g. with power sockets improperly used). With regards to industry, the standardization process consists in: 1. establishing a common agreement; 2. ensuring desirable characteristics of products and services; 3. at an economical cost. The first bullet implies to develop common terminologies and definitions with regards to terms and principles, engineering criteria and practices, materials, equipment parts and components. This is achieved through a consensus between the different parties (firms, users, interest groups, lobbies, standards organizations and governments) participating to the standardization works. The second bullet implies safety, environment impact, efficiency, dependability (reliability, availability, production availability), quality, repeatability, compatibility, interchangeability and user-friendliness. Again, it has to be noted that all these features are closely linked to the main topics—safety and dependability—of this book. The third bullet implies that the implementation of a given standard provides an economical advantage over parties not using it. Therefore, increased benefits are expected and participating to the development of a standard can be considered as an economical weapon against those who do not participate (due to lack of interest, lack of money and above all misunderstanding of the standardization potential benefits).

37.2 Standardization Versus Regulation and Certification The three topics mentioned in the title of this chapter are often confused and clarifications are needed: • international standards: they are implemented on a voluntary basis; • regulations: it is mandatory to apply them; • certification: it is not mandatory but may be (strongly) recommended for critical items. The confusion arises from the fact that some international standards can be transformed into regional or national rules. This is the case of the harmonized standards in the European Union whose application is no longer volunteer but mandatory (e.g. the CE marking for a free circulation of items in Europe). Another source of confusion is that some standards can be used as a basis for certification by certification bodies. The ISO 9000 certification is now inescapable in the quality domain and it is the same for the SIL certificates based on the IEC 61508 standard for the components used within the safety instrumented systems. Such standards remain, theoretically, of volunteer application but in practice it becomes very risky to develop products or safety systems without reference to them.

37.3 Standardization Organization Overview

831

37.3 Standardization Organization Overview 37.3.1 Standardization Bodies Standardization is organized at four main levels: • International level: ISO (International Standard Organization), IEC (International Electrotechnical Commission) and ITU (International Telecommunication Union). • Regional level: CEN (European Committee for standardization), CENELEC (European Committee for Electrotechnical Standardization) and ETSI (European Telecommunication Standard Institute) for Europe or PASC (Pacific Area Standard Congress) for Asia. • National level: AFNOR (France), SN (Standard Norway), BSI (British Standards Institution), ANSI (American National Standards Institute) or SAC, (Standardization Administration of the People’s Republic of China). • Sectoral level: e.g. BNPE (French National standardization bureau for petroleum industry), API (American Petroleum Institute) for oil and gas industry,IEEE (Institute of Electrical and Electronic Engineers) or US Department of Defense (Military standards). Any level comprises many technical committees (TCs) with specific scopes. Technical committees comprise working groups (WGs) comprising themselves several project teams (PTs). Many interrelationships exist between them and between the levels and this leads to a rather complicated organization. Therefore, a standard developed at a given level can be, in fine, adopted at another one: for example, many international standards are also adopted as regional, national or sectoral standards and vice versa. It is mandatory to issue the international standards in English and French but many standards are translated in other languages by the various national standardization bodies.

37.3.2 Development of a Standard The development of a standard is itself a standardized process in six main steps (see ISO/IEC directive part 1 Ed.6.0 2020): 1. NWIP: new work item proposal; 2. WD: working draft; 3. CD: committee draft; 4. CDV or DIS: committee draft for vote (IEC) or draft international standard (ISO); 5. FDIS: final draft international standard;

832

37 Standardization

6. IS: international standard. When the NWIP has been accepted, a project team with a project leader is constituted to develop a WD on the basis of the consensus between the participants. Once this is done, the WD becomes a CD and it is circulated for technical and editorial comments. This is the most important stage with regards to comments in the development process and this is the task of the project team to analyse and accept or reject them. This leads to the CDV (IEC) or the DIS (ISO) which is circulated again for vote and editorial comments. When they have been taken into account, the FDIS is issued and circulated for the final vote. Once it is approved, the international standard (IS) is issued by IEC or ISO (or both in case of joint projects) and becomes available for the users. The whole process takes about 36 months but can be shortened when the WD is based on an already existing document (e.g. when the maintenance of an existing standard is undertaken). All over the development process, the content of the document is based on the consensus between the members of the project team and, at each stage, the document is circulated through the national committees for votes and comments. A majority of two-third positive votes and less than 25% negative votes is required to proceed from a stage to the next one. This seems to be a rather democratic process but, the participation being rather costly, only motivated stakeholders are able to fund an active participation (e.g. for face-to-face meetings) since the early stages. Therefore, the standard content reflects more or less their points of view and this obviously gives them an advantage compared to those producing only comments who, in turn, have an advantage on those who do not participate at all.

37.3.3 Type and Content of Standards The main types of standardization documents are: • the guides which provide practical guidance to draft standards, • the standards properly speaking (IS), • thetechnical reports (TR) which provide information in relationship with the normative content of given standards. For example, the ISO/IEC guide 51 Ed. 3.0 (2014) is the basis to develop safety standards, the IEC 61508 Ed. 2.0 (2010) is the mother standard related to functional safety and the ISO/TR 12489 (2013) is the technical report explaining how to perform the probabilistic calculations required in IEC 61508. A given standard contains two types of clauses: • normative clauses which are mandatory to claim that this standard is actually implemented; they are indicated by the key word “shall”; • informative clauses which provide recommendations (key word “should”), permissions (key word “may”) and possibilities (key word “can”) for implementing the standard.

37.3 Standardization Organization Overview

833

It has to be noted that the key word “must” is reserved for requirements coming from outside the normative document itself (e.g. administrative or physical constraints). As it can be confused with the key word “shall”, its use is often avoided. See ISO/IEC directive part 2 Ed. 8.0 (2018) for explanations about the drafting of a standard. The structure of the documents is also standardized and, before the core of the document, every standard contains, at its beginning, the scope (purpose of the document), the normative references (other standards used as reference), the terms and definitions and the symbols and acronyms used within the document. Therefore, every standard related to safety and dependability may contain definitions related to these topics. The standard users should be aware that the same term is not necessarily defined in the same way in all the standards.

37.4 Safety and Dependability Related Standardization According to the purpose of the standardization (i.e. ensure desirable characteristics of products and services), most of the standards are more or less directly or indirectly linked to safety or dependability. The standardization activity being very active, this includes thousands of documents and it is almost impossible to collate all of them! It has to be noted that the early standards related to safety and dependability were developed for military purposes [e.g. military standards or military handbook from US Department of Defense (DoD)]. Even if (MIL-HBDK 217 1995) is still very popular, civil standards have progressively emerged which now completely supersede the military standards for civil applications. Anyway, some committees are more fundamental than the others with regards to the purpose of this book and they are identified hereafter: • IEC/TC56: Dependability. This is the main source of dependability standards because, thanks to an agreement between ISO and IEC (Lugano agreement), this committee is in charge of the development of the dependability standards for every domain. • IEC/TC65 A: Industrial-process measurement, control and automation—System aspects. Contrary to dependability standards, the development of safety standards is dispatched across all the technical domains. Among them, the IEC/TC65A committee is very important as it develops the functional safety standards devoted to the safety instrumented system design. The IEC 61508 Ed. 2.0 (2010)—mother standard on this topic—is the basis of the SIL calculations presented in Chap. 36. • ISO/TC67/WG4: Materials, equipment and offshore structures for petroleum, petrochemical and natural gas industries—Reliability engineering and Technology. This working group belonging to a committee devoted to oil and gas standards develops very interesting standards with regards to our purpose and which can be used in other domains:

834

37 Standardization

– ISO/TR 12489 Ed. 1.0 (2013) which explains how to undertake the SIL calculations required by IEC 61508 and which is the basis for the SIL calculations presented in Chap. 36. – ISO 20815 Ed. 2.0 (2018) which is devoted to production assurance and practically unique on this subject. – ISO 14224 Ed. 3.0 (2016) which is devoted to reliability data collection and exchange and certainly the best on this subject. It is used within Chap. 38. • IEC/TC1: Terminology. This committee is in charge of the International Electrotechnical vocabulary (IEV) which is available on www.electropedia.org. The IEC/TC56/WG1 committee develops, in cooperation with the IEC/TC1 committee, the IEC (60050-192 2015) standard (or IEV 192) which is devoted to the dependability terminology directly linked to Chap. 4 of this book. Other technical committees have obvious links with safety and dependability and, for example, the following can be mentioned: • ISO/TC 176: Quality management and quality assurance. This committee develops the series of the ISO 9000 about quality management. • ISO COPOLCO and IEC ACOS: Committee on consumer policy and Advisory Committee on Safety. These committees develop, through a joint ISO/IEC project, the ISO/IEC guide 51 Ed. 3.0 (2014) (Safety aspects—Guidelines for their inclusion in standards). This guide provides the definition of risk used by reliability engineers. • ISO/TC 262: Risk management. This committee develops the ISO guide 73 Ed. 1.0 (2009) (Risk management—Vocabulary) and the series of the ISO 31000 standards about risk management (ISO 31000 Ed. 2.0 2018). The guide 73 provides the definition of risk used by risk managers. • ISO/TC 251: Asset management. This committee develops the series of the ISO 55000 standards about asset management (ISO 55000 2014). • ISO/TC 69: Application of statistical methods. Among the technical committees mentioned above, this is the IEC/TC 56 committee which develops the greatest number of documents of interest with regards to the content of this book. They can be dispatched between the following different categories: • Terminology: IEV 192 (Dependability terminology) or (IEC 61703 2016) (Mathematical expressions for reliability, availability, maintainability and maintenance support terms). • Methods: (IEC 60812 Ed.3.0 2019) (FMEA), (IEC 61882 Ed.2.0 2016) (HAZOP), (IEC 61078 Ed. 3.0 2016) (Reliability block diagrams), (IEC 61025 Ed. 3.0 in progress) (Fault tree analysis), (IEC 62502 2010) (Event tree analysis), (IEC 62740 2015) (Root cause analysis), (IEC 61165 Ed. 2.0 2006) (Markovian technique), (IEC 62551 2012) (Petri net techniques), …

37.4 Safety and Dependability Related Standardization

835

• Data: (IEC 60300-3-2 Ed. 2.0 2004) (Collection of dependability data), (IEC 63142 in progress) (Reliability prediction for electronic components) … • Management: 60300 series (Dependability management), (IEC 60300-1 2014; IEC 60300-3-12 2011), IEC 62402 (2019) (Obsolescence management), IEC/ISO 31010 (2019) (Risk management), ISO 15663 Ed. 1.0 (2021) (life cycle costing) … • Guidelines: IEC 60300-3-1 2003) (Analysis techniques for dependability), (IEC 60300-3-3 2017) (Life cycle costing), (IEC 60300-3-10 2001) (Maintainability), (IEC 62508 2010) (Human aspects), (IEC 62347 2008) (Dependability specifications) … • Maintainability: (IEC 60300–3-11 2017) (Reliability centered maintenance), (IEC 62550 2017) (Spare parts provisioning) … • Tests: (IEC 62506 2013) (Accelerated testing), (IEC 61163-2 2020) (Reliability test screening), (IEC 60605-6 2007) (Test for constant failure rate) … • Reliability growth: (IEC 61014 2003) (Program for reliability growth), … • Software: (IEC 62628 2012) (Guidance on software dependability), … • Systems: (IEC 62853 2018) (Open systems). • Communication networks: (IEC 62673 2013) (Methodology for communication network dependability assessment and assurance) … More than eighty standards are developed and maintained by the IEC/TC 56 committee and the above list is only a sample. The ones related to terminology and methods are the most directly linked to the content of this book but all the others should be considered in the context of dependability related activities. At sectoral level, some standards can be mentioned as, for example: • Reliability data: (Mil-HDBK 217F 1995) (Electronic reliability prediction). • Reliability growth: (MIL-HDBK-189C 2011) (Reliability growth management). • Reliability program: (IEE 933 2013) (Guide for nuclear power stations). • Recommended Practices: (IEE 493 2007) (Design of reliable power systems).

37.5 Concluding Remarks About Standardization As said above, international standards can be translated in many languages but the two official languages used in the international standardization are English and French. Therefore, any international standard is issued in both languages and both versions are considered to be equivalent. However, slight differences can be observed due to translation and it may be better to use the original version rather its translation, especially when they are used in contracts.

836

37 Standardization

With regards to safety and dependability, a specific problem can be mentioned about the translation in French of the term “dependability”: • “dependability” covers only the economic aspects whereas, • in France, “sûreté de fonctionnement” covers both safety and economical aspects. In fact, “sûreté de fonctionnement” means “functioning sureness” (in the acceptation of “assurance”) and there is a discrepancy between the meanings used in standardization and by French reliability engineers. This is why in this book the terms “safety” and “dependability” are often associated to ascertain that the reader understands that both aspects are actually considered. Beyond the above remarks, the following observations can be done to conclude this chapter: • Many standards are issued or are in progress at any times and it is very difficult to be aware of all of them. • Every technical committee can adopt its own terminology and it is very difficult to control the drift of definitions. • Only an active participation is likely to impact in depth the content of a standard and this needs motivation and … funds, while standardization is not necessarily among the top priorities of managers. Some countries are very active when others suffer from a lack of funds and support.

This leads to meditate the following advice: Standar di ze your sel f f or your own needs . . . or other s will do that f or you . . . and not necessarily f or your pr o f it!

References IEEE 493 (2007) Recommended practice for the design of reliable industrial and commercial power systems. Institute of Electrical and Electronic Engineers Inc (IEEE), USA IEEE 933 (2013) Guide for the definition of reliability program plans for nuclear generating stations and other nuclear facilities. Institute of Electrical and Electronic Engineers (IEEE), USA IEC 60050-192 (IEV192) (2015) International electrotechnical vocabulary—Part 192: dependability. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-1 (2014) Dependability management: guidance for management and application. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60300-3-1 (2003) Dependability management:Application guide—analysis techniques for dependability—guide on methodology. International Electrotechnical Commission (IEC). Geneva, Switzerland

References

837

IEC 60300-3-2 Ed. 2.0 (2004) Dependability management, Part 3-2: application guide—collection of dependability data from the field. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-3-3 Ed. 3.0 (2017) Dependability management, Part 3-3: application guide, life cycle costing. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 60300-3-10 (2001) Dependability management:application guide—maintainability. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60300-3-11 (2017) Dependability management:application guide—reliability centred maintenance. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60300-3-12 (2011) Dependability management:application guide—integrated logistic support. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60605-6 (2007) Equipment reliability testing: test for the validity and estimation of the constant failure rate and constant failure intensity.International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 60812 Ed. 3.0 (2019) Failure modes and effects analysis (FMEA and FMECA). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61014 (2003) Programmes for reliability growth). International Electrotechnical Commission. Geneva, Switzerland IEC 61025 Ed. 2.3 (in progress) Fault tree analysis (FTA). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61078 Ed. 3 (2016) Reliability Block diagrams. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61508 Ed. 2.0 (2010) Functional safety. Safety of electrical/electronic/programmable electronic safety-related systems (7 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61511 Ed. 2.0 (2016) Functional safety. Safety instrumented systems for the process safety sector (3 parts). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61703 (2016) Mathematical expressions for reliability, availability, maintainability and maintenance support terms. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 61882 Ed.2 (2016) Hazard and operability studies (HAZOP studies)—application guide. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61165 Ed. 2 (2006) Application of Markov techniques, International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 61163-2 (2020) Reliability test screening: components. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 62347 (2008) Guidance on system dependability specifications. International Electrotechnical Commission (IEC). Geneva Switzerland IEC 62402 (2019) Obsolescence management. International Electrotechnical Commission (IEC). Geneva Switzerland IEC 62502 Ed. 1.0 (2010) Analysis techniques for dependability. Event tree analysis (ETA). International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 62506 (2013) Method for product accelerated testing). International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 62508 (2010) Guidance on human aspects of dependability. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 62550 (2017) Spare parts provisioning. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 62551 Ed. 1.0 (2012) Analysis techniques for dependability. Petri net techniques. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 62628 (2012) Guidance on software aspects of dependability. International Electrotechnical Commission. Geneva, Switzerland IEC 62673 (2013) Methodology for communication network dependability assessment and assurance. International Electrotechnical Commission (IEC). Geneva, Switzerland

838

37 Standardization

IEC 62740 Ed. 1.0 (2015) Root cause analysis (RCA), International Electrotechnical Commission (IEC). Geneva, Switzerland. IEC 62853 (2018) Open system dependability. International Electrotechnical Commission (IEC). Geneva, Switzerland IEC 63142 (in progress) A global methodology for reliability data prediction of electronic components. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO/IEC directive part 1 Ed.6 (2020) Procedure for the technical work. International Organization for Standardization and International Electrotechnical Commission (IEC). Geneva, Switzerland ISO/IEC directive part 2 Ed.8 (2018) Principles and rules for the structure and drafting of ISO and IEC documents. International Organization for Standardization and International Electrotechnical Commission (IEC). Geneva, Switzerland ISO/IEC Guide 51 Ed. 3.0 (2014) Safety aspects. Guidelines for their inclusion in standards.International organization for standardization (ISO) and International Electrotechnical Commission (IEC), Geneva, Switzerland. ISO guide 73 Ed. 1.0 (2009) Risk management—vocabulary. International Organization for Standardization (ISO). Geneva ISO 9000 (2015) Quality management. International Organization for Standardization (ISO). Geneva, Switzerland ISO/TR 12489 Ed. 1.0 (2013) Petroleum, petrochemical and natural gas industries. Reliability modelling and calculation of safety systems. International organization for standardization (ISO), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland ISO 20815 Ed. 2.0 (2018) Petroleum, petrochemical and natural gas industries. Production assurance and reliability management. International organization for standardization (ISO), Geneva, Switzerland ISO 31000 Ed. 2.0 (2018) Risk management. Guidelines. International organization for standardization (ISO), Geneva, Switzerland ISO/IEC 31010 (2019) Risk management–Risk assessment techniques. International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC). Geneva, Switzerland ISO 55000 (2014) Asset management—overview, principles and terminology. International Organization for Standardization (ISO). Geneva, Switzerland ISO 15663 Ed.1.0 (2021) Petroleum, petroctechnical and natural gas industies-Life cycle costing. Organization for Standardization and International Electrotechnical Commission. Geneva, Switzerland Je me demandes’ilest utile de mettre les références de toutes les normescitées MIL-STD: sioui, lesquelles MIL-HDBK-189C (2011) Reliability growth management. US Department of Defense MIL-HDBK 217 F notice 2 (1995) Military handbook: reliability prediction of electronic equipment. Department of Defense, Washington DC. USA NASA (1999) Mars Climate Orbiter Mishap Investigation Board Phase 1 Report Østebø R, Selvik JT, Naegeli G, Ciliberti T (2018) ISO standards to enable reliable, safe and costeffective technology development, project execution and operational excellence. OTC 28705, Houston, USA Wikipedia Mars Climate Orbiter (1998) https://en.wikipedia.org/wiki/Mars_Climate_Orbiter. Accessed Sept 2020

Chapter 38

Data Collection and Uncertainties

38.1 Introduction No safety and dependability study can be undertaken without relevant input data to properly feed the models and this is the heart of the problem for quantitative safety and dependability studies: even if this seems obvious, the first aim of this chapter is to remind that. When it is ascertained that input data are of utmost importance, the next step is to gather them from field feedback and to process and store them. This is a vast subject which would require a whole book for an exhaustive presentation, therefore the aim of the second part of this chapter is limited to a quick survey of reliability data collection standardization and databases. Due to the use of statistical calculations or/and expert judgment, the collected data are never perfectly known but only with a certain degree of confidence. Therefore, it is essential to take these input data uncertainties into account (uncertainty analyses), (Hjorteland et al. 2007), and to estimate the impact on the overall results (see Chap. 32). This leads to the third part of this chapter devoted to data uncertainties modelling.

38.2 The Bare Necessity of Input Data Input data can be split into technical and operational data and data coming from past experience. Technical and operational data describe the purpose of the system under study, how it works, how it is operated and how it is maintained. They are specific of the system under study and normally readily available, otherwise it would just not be possible to perform the study. They include information like:

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7_38

839

840

38 Data Collection and Uncertainties

– design purpose; – operating philosophy; – maintenance philosophy. Data coming from past experience provide guidelines to design the system under study, information about what could happen and allow to estimate the numerical data needed for calculations. They are linked to the family to which the system belongs and they include information like: – – – –

applicable standards, rules and regulation; knowledge of incidents/accidents occurred on similar systems; reliability parameters; engineering judgment.

The use of past experience is the core of the underlying philosophy behind the safety and dependability approaches: make use of what happened in the past to forecast what could happen in the future and improve it as far as possible (see Chap. 1). This participates to the gigantic trial and error process which is at work since the origin of the ages and which governs the biologic evolution (e.g. plants, animals and human beings), the learning process of human beings (e.g. philosophies, sciences and technologies) as well as the expertise of unique individuals (e.g. expert people). Standards, rules, regulations and states of the art come from the learning process mentioned above and constitute, in a way, the ready-made field feedback available for safety and dependability studies: they cover the basic designers needs and provide guidelines to avoid the biggest mistakes and, incidentally, allow to cope with legal obligations. In a less formal way, using the engineering judgment from experts proceed from the same philosophy. Concerning incidents/accidents or reliability data, the situation is different as it implies volunteer specific actions to extract them from the field feedback and process them. This is time-consuming, costly and really effective only on the long range. In addition, the quality of the collected data depends on the motivation of the data collectors who are generally not the data users themselves. This is why the data collection is often subject to procrastination and delayed when it is not completely forsaken. Then only few databases are available (see Sect. 38.3.2) and the reliability engineers have often to face with scarce, incomplete reliability data coming from different items with different operational conditions. Selecting relevant input data for a given study is challenging and ranges from: 1. taking the data from a specific relevant database and use it directly; 2. taking the data from a generic database or a manufacturer database and tune it through expert judgment, e.g. by implementing a Bayesian approach (Procaccia 2008; Piepszownik and Procaccia 1992; Clarotti 1989); 3. launching, when no data is available and when possible, a specific data collection campaign to fill in the gap;

38.2 The Bare Necessity of Input Data

841

4. to, in the worst case, estimating the missing data on the pure basis of expert judgment, e.g. Delphi approach Harold (2002); Wikipedia Delphi (2020). At the present time, the effort on modelling and calculating approaches has been far more important than the effort on the reliability data collection which, certainly, is the weak point of the probabilistic studies. Thinking that the sophistication of the mathematics can counteract the lacks of data input is a pure illusion and, frankly speaking, any quantitative study performed without relevant input data is pure baloney! This highlights the importance of developing effective data collection systems providing data accurate enough to allow the various approaches to be used at best, (Lannoy and Procaccia 1994; Lannoy 1996). Nowadays, the obsolescence of the items occurs faster and faster and obviously impedes the reliability data collection but, fortunately, at the same time, the universal and fast spreading of the so-called “smart” items gives hope to automate, at least partly, the data collection process, and then opens the way to improvement, in the (near) future.

38.3 Data Collection Standards and Databases 38.3.1 IEC and ISO Data Collection Related Standards The relevance of a data collection system is closely linked to the quality of the collected information which, in turn, needs, for example, to clearly define: • the aim of the data collection (e.g. collect accidents/incidents or events occurring on equipment); • who collects the data (e.g. manufacturers, operators, contractors); • who are the potential users (e.g. reliability engineers, safety manager, safety Authority); • which outcomes are expected to be obtained (e.g. accident/incident frequency, failure rate, repair rate databases), which field data are needed and which parameters have to be collected (e.g. incidents/accidents, times to failure and failure modes, times to repair); • the vocabulary to be used to remove any ambiguity in the very meaning of collected data (e.g. univocal equipment names, failure rate coding); • the process means used to produce end-user data (e.g. statistical treatments); • the means to make the results available for end-users (e.g. handbook or computerized database). A kind of standardization is obviously needed to satisfy the above requirements otherwise the collected data would not be really usable in a relevant way. This is why IEC and ISO have developed several standards which aim to help to satisfy the above requirements. Some of them are for general purpose (IEC 60300-3-2 Ed. 2.0 2004), some others are oriented to specific domains [e.g. ISO 14224 (2016) for oil and gas industry, ISO 6527 Ed. 1.0 (1982) and ISO 7385 Ed.1.0 (1983) for nuclear industry],

842

38 Data Collection and Uncertainties

to specific operation [e.g. IEC 60706-3 (2006) for maintenance] or to specific type of components, [e.g. IEC 63142 (in progress) for electronic components, IEC 61709 (2017) or IEC/TR 63162 (in progress) for electrical components]. It is beyond the scope of this book to further detail all of them but short descriptions can be found on the ISO or IEC web sites. Nevertheless, a glance about the (ISO 14224 2016) standard is useful to explain how a common understanding between data collectors, data processors and data users can be implemented (Kortner et al. 2005). The ISO 14224 standard provides a standardized taxonomy allowing an accurate identification of the items on which the data collection is performed. This taxonomy is based on several levels for identifying the location of the item under interest (e.g. related industry, plant unit or system under interest) and several hierarchical levels for identifying the equipment subdivisions (e.g. equipment class, subunits or maintainable items). This allows the traceability of the collected data. Even if this standard is devoted to oil and gas industry, the proposed taxonomy (i.e. systematic classification) remains valid for any other process industries. In particular, the battery limits of the equipment units are clearly defined in order to avoid to mix up what belongs to the unit and what is outside. For example, the driver (diesel engine or motor) is outside the battery limit of pumps to avoid to collect data containing driver failures and others excluding them. With the above taxonomy, the maintainable item level is the main level for collecting data. This allows to collect both failure and maintenance data related to the same items. Then, the ISO 14224 provides standardized lists of failure modes for each equipment unit which are used as guidelines for the data collection and limit the risk of discrepancy between the data collectors. Specific codes are used to identify these failure modes (e.g. FTC for fail to close on demand or SPO for spurious operation of a valve) in order to facilitate the data processing. Accurate timeline definitions are also provided in relationship with the up and down state of the item under consideration. This allows to collect the calendar times, operation times and maintenance times which are of utmost importance for estimating the reliability parameters (e.g. failure rate, active repair time or repair man-hours). Thanks to the use of the taxonomy, the failure mode codes and of the timeline definitions, the risk of confusion and error is minimized. This standard is based on the experience of the OREDA database (OREDA 2015; Ostebo 2006, Ostebo et al. 2000) which is in use since the early eighties and which has proven very effective to produce accurate failure and maintenance data.

38.3.2 Databases Before performing a safety and dependability analysis (e.g. HAZOP or event tree analyses or even to identify the unwanted events of fault trees), it is strongly recommended to inquire about what has already happened on similar installations by consulting incident/accident databases. When they exist, such databases are specific

38.3 Data Collection Standards and Databases

843

of the industrial domain of interest as, for example, the world offshore accident database, (WOAD 2020), for oil and gas domain, the incident reporting system, (IRSNI 2020; Aupied and Procaccia 1984), for nuclear installations, or the aviation safety reporting system, (ASRS 2020), for aeronautics. It has to be noted that Internet which is, in itself, a big database and keeps track of almost everything, should not be neglected for this purpose. At the numerical calculation stage of the analysis, numerical parameters (e.g. failure rates, repair rates) are needed as input. As already mentioned in Sect. 38.2, in the most favourable case they can be provided by generic databases or by manufacturers. Otherwise, they can be obtained, in-house, by a specific survey of failure or maintenance records related to similar equipment or by pure expert judgment when nothing is available (Leroy 2018). In the last case, an approach based on the use of questionnaires, interactive groups or individual interviews [e.g. Delphi approach, (Harold 2002; IEEE 500 1984)] can be implemented to gather and process the expertise of several experts in order to estimate reliability data. This has been done for the IEEE 500 standard which was a typical example of database proposing failure rates established on pure expert judgment. When in-house data collection campaigns are launched, it is wise to implement a standard like ISO 14224 or IEC 60300-3-2. Manufacturer data (e.g. in SIL certificates of safety related items)—which may be a little bit optimistic—and generic data—which may be related to slightly different items operated in different conditions—are readily available but it is often necessary to adapt them to the items actually under study. This can be done by mixing these data with engineering judgment through Bayesian analyses (Procaccia 2008; Piepszownik and Procaccia 1992; Clarotti 1989). In the continuation of this approach, it is recommended to constitute, from the various data sources available, a set of preferred reliability data to be used for any study and produce consistent results from one study to the other. Among the generic databases, the following can be mentioned: • MIL-HDBK 217: Reliability prediction of electronic equipment (MIL-HDBK 217 F 1995; MIL-HDBK 217+ 2015). It is one of the oldest and most famous databases for electronic components. It has been widely used but, as its last issue dates back to 1995, it is a little bit obsolete even if it is still used. • FIDES: a methodology for component reliability (FIDES 2020). Like the MILHDBK 217, it is devoted to electronic components. It has been standardized in France (UTEC80-811 2011) and its standardization process is in progress at international level (IEC 63142 in progress). • NPRD: Nonelectronic parts reliability data publication (NPRD 2016). The last issue dates from 2016 and compiles data from the early 1970s to the end of 2014. • OREDA: Offshore and onshore reliability data (OREDA 2015; Leroy 2018; Ostebo 2006; Osteboet al. 2000). Initiated in 1981, the project had been steadily and successfully pursued until now. It is issued as a computerized database (for the project members) and as handbooks (for every people). It has been the base for developing the ISO 14224 standard explained in the previous chapter.

844

38 Data Collection and Uncertainties

• PDS: Data handbook (PDS 2013). Developed from data coming from OREDA and vendors, it provides reliability data for components included into safety instrumented systems. • EXIDA: Safety equipment reliability handbook (EXIDA Ed. 4.0 2015). This is a specific database for components included into safety instrumented systems. It has to be noted that, if the reliability data are often scarce because only a few are collected, this is also because failure events are, fortunately, not normally frequent when good quality and reliable items are used. Therefore, even when convinced of the data collection usefulness, a single operator may not be able to collect enough failures on reliable items to gather representative samples and perform accurate statistic estimations. Fortunately, in this domain, as in many others, unity makes strength and it is better to gather the data coming from several operators operating similar installations than the data coming only from a single one: this allows to get bigger statistical samples faster. The idea seems obvious and trivial but, beyond the standardization of the data collection itself, a new challenge has to be taken up: make the data sources anonymous to ensure that an operator cannot be informed of the reliability/dependability performances of the others. This may be the biggest problem encountered to develop joint databases because nobody wants to share its collected data in presence of lacks of confidentiality. Therefore, the information related to the location of the related items has to be hidden to the users. As they are of utmost importance with regards to the data collection, they cannot be actually removed. Then, the involvement of a trusted third party in charge of keeping the data anonymous is needed. Since the eighties, this principle has proven to be very effective to carry out the OREDA project between major operators of the oil & gas domain.

38.4 Reliability Data Estimation As explained in the previous sections, most of the data inputs used for probabilistic calculations come from statistics on the field feedback—gathered on similar items actually operated in exploitation, from questionnaires and expert estimations (Delphi approach, see Harold 2002 or Wikipedia Delphi 2020)—when the field feedback is not available—or by a mix of both. Talking about statistics estimation or expert judgment is talking about data which are not perfectly known but only with a certain degree of confidence and thus prone to uncertainty. For example, if k failures have been collected for similar items over an accumulated observation period of time Tc , the failure rate of these items can be statistically estimated by the maximum likelihood estimator (MLE) as: λˆ =

k Tc

(38.1)

38.4 Reliability Data Estimation

845

This MLE does not work when k is equal to zero and in this case the median value, which is another statistical estimator, can be used instead (see Georgin and Signoret 1981): λ50% =

k + 0.7 Tc

(38.2)

The interest to use the median value is that the confidence level remains the same whatever the number of observed events and converges toward the MLE when k increases. In the same way, and for the same set of items, if n repairs (n = k or (k − 1) depending if the last repair has been completed or not) have been observed for an accumulated repair time Tr , the repair rate can be estimated by the MLE or the median value as: μˆ =

n + 0.7 n and μ50% = Tr Tr

(38.3)

These values can be used for calculating point estimates, for example: ˆ ˆ = e−λ·t • the item reliability as R(t) ;

• the item unavailability as Uˆ (t) =

λˆ ˆ μˆ [1 λ+

−e

  ˆ μˆ ·t − λ+

] (see Chap. 31).

The use of median values instead of the MLE values leads also to approximations ˆ and Uˆ (t) but these approximations have nothing to do with the median values of R(t) of R(t) and U (t). Regardless of the way λ and μ are estimated (expert judgment or statistics), ˆ the calculations of any probabilistic characteristic [e.g. R(t) or Uˆ (t)] using these parameters are also estimations and not exact results. Therefore, to have an idea of the degree of confidence which can be placed on the above calculations, it is necessary to: • evaluate the uncertainty on the input data according to the field feedback or expert judgment available; • evaluate the impact of those uncertainties on the probabilistic result obtained from dependability models like RBDs or the FTs considered in this chapter.

38.5 Data Uncertainty Modelling 38.5.1 Data Accuracy Versus Field Feedback As indicated by their name, the reliability data estimations obtained from the field feedback are only … estimations. This means that they depend on the representativeness of the collected statistical samples and that they are not perfectly known. Therefore, they can be considered as random variables and, as any random variable,

846 Fig. 38.1 Distribution of a random failure rate according to the field feedback

38 Data Collection and Uncertainties

f

Large field feedback

Medium field feedback

MLE (Average value) Scarce field feedback

be modelled by probabilistic distributions. This is illustrated in Fig. 38.1 in the case of failure rates considered as random variables and for three kinds of field feedback. The probabilistic distribution shape changes according to the quantity of information which is available: it is very flat when the field feedback is poor and becomes more and more acute (the standard deviation decreases) when the field feedback increases, i.e. when the number of observed failures and/or accumulated operating time increases. At the limit, should the quantity of information go to infinity, the standard deviation would tend to zero and the exact value of the failure rate would be obtained. Even if the three distributions presented in Fig. 38.1 have the same average value (maximum likelihood estimation), it is obvious that the same level of confidence in the probabilistic calculations undertaken with this average value is not the same when it comes from accurate estimations (i.e. large field feedback) or non-accurate estimations (i.e. scarce field feedback). Therefore, the uncertainties attached to the average values of the reliability parameters are also needed to perform relevant probabilistic calculations. The following subsections deal with uncertainty estimation and modelling. The handling of uncertainties by implementing Monte Carlo simulation is described in Sect. 32.5.

38.5.2 Uniform and Triangular Distributions: Expert Judgment The expert judgment cannot provide accurate figures but is very effective to rank the probabilities of failures or failure rates of various items: that is to say that, if an expert is not able to give the exact value of the failure rate of an item A, he is generally able to say if this item is more or less reliable than another item B. This is the basis of the Delphi approach to rank the probabilities of failure/failure rates of several items A, B, C, … Then, if the probabilities of failure/failure rates of some of them are already known by another way (e.g. from statistics on field feedback) they can be used to frame the probabilities of failure/failure rates of the others between lower and upper bounds: for example, if the probabilities of failure/failure rate of B and C are known and if the expert judgment indicates that the probability of failure/failure rate of A is

38.5 Data Uncertainty Modelling

847

greater than this of B and lower than this of C, the probability of failure/failure rate of A is comprised within the bounds provided by B and C. Then, the probability of failure/failure rate of A is no longer a point value but a random variable taking its values between the bounds of the interval [Bin f , Bsup ]. Without more information, the simplest assumption is that any value within these bounds has the same probability to be the exact value. This corresponds to a random variable uniformly distributed between these bounds, as illustrated in Fig. 38.2 for the failure rate. Due to its shape, the uniform distribution is also often named rectangular distribution. Nevertheless, the experts are sometimes able to estimate which is the most probable value within the two bounds and this leads to define a triangular distribution as illustrated in Fig. 38.3. This distribution is more accurate than the uniform distribution as it contains more knowledge about which is the most probable value of the random variable. Like for the uniform distribution, the support of the triangular distribution is the interval [Bin f , Bsup ]. The expert judgment can also be modelled by using gamma distributions and the Bayesian approach: this is described in Sect. 38.5.4.

Fig. 38.2 Uniform (or rectangular) distribution of a random failure rate

Fig. 38.3 Triangular distribution of a random failure rate

848

38 Data Collection and Uncertainties

38.5.3 Chi-Square Distribution: Statistics from Field Feedback Regarding the data obtained through statistics from field feedback, the situation is somewhat different from uniform and triangular distributions because the values of the random variable range from zero to infinity, i.e. the support is [0, ∞[, as illustrated in Fig. 38.4. In this case, the uncertainty can be evaluated by a confidence interval in which the true value has a given probability to be found. More generally, a centered interval [λα , λ1−α ] such as the unknown true failure rate has a probability α to be lower than λα and a probability α to be higher than λ1−α , has a confidence level equal to (1 − 2α). A 90% confidence interval (1 − 2α = 90%) is commonly used for this purpose, as illustrated in Fig. 38.4: the interval [λ5% , λ95% ] has 90% of chances to contain the true value of the failure rate. Of course, the narrowest the interval, the most accurate the estimation. When the failure rate is estimated from statistics, the chi-square distribution (see Sachs 1984; IEC 60605-4 1986; Lannoy and Procaccia 1994; Lannoy 1996; CPR 12E 1997) is commonly used for determining the bounds of this confidence interval. This is a widely used distribution within the statistical framework and, mathematically speaking, it is the distribution of the sum of the square of nf standard normal laws (nf is called “degree of freedom”) and a particular case of the gamma distribution. It is difficult to handle as no analytical formulae are available but its percentiles are tabulated in statistical books or in spreadsheet tools and algorithms are available to calculate them on a computer. The links between the chi-square distribution and the bounds of a confidence interval depend on the type of the statistical samples (e.g. complete or not, biased or unbiased, censored or not). Establishing them is rather complicated and beyond the scope of this book but, however, if k failures have been collected for similar items over an accumulated observation period of time Tc , it is generally accepted to calculate the bounds of the 90% confidence level as follows: Fig. 38.4 General distribution of a random failure rate with a 90% confidence level

38.5 Data Uncertainty Modelling

[λ5% , λ95% ] = [

849

1 2 1 2 χ95%,2k , χ ] 2Tc 2Tc 5%,2(k+1)

(38.4)

It has to be noted that the percentages related to the bounds and the percentiles of the chi-square distribution have to be inverted to perform the calculations (e.g. 5% vs. 95% and vice versa). According to Formula 38.4, the upper and lower bounds cannot be calculated with a single chi-square distribution. Therefore, the failure rate, as a random variable, cannot be modelled by a simple chi-square distribution and this is a pity because this should be the natural way to handle data uncertainties for uncertainty propagation purpose (Monte Carlo simulation, see Chap. 32) just by considering objective data— the accumulated time of observation and the number of observed events—without adding any other assumptions. Figure 38.5 illustrates the evaluation of such a 90% confidence interval and it can 2 provides the correct lower bound and a non-conservative be observed that χ1−α,2k 2 upper bound and that χ1−α,2k+2 provides the correct upper bound and a conservative 2 lower bound. Therefore, χ1−α,2k+2 is a conservative estimation of the probabilistic distribution of the failure rate and can be used for this purpose. Nevertheless, when the number of observed failures is small or equal to zero, this 2 2 which is in between χ1−α,2k may be a little bit pessimistic and considering χ1−α,2k+1 2 and χ1−α,2k+2 can be a good compromise. According to Formula 38.4, both upper and lower bounds depend on the wanted confidence level (1 − 2α), on the accumulated observation time (e.g. operation time Tc ) and on the number of observed events (e.g. k failures). Intuitively, it seems that the width of the confidence interval increases when the confidence level increases and decreases when the accumulated observation time or/and when the number k of the observed failure increases. But as, at the same time, the MLE decreases when the accumulated observation time decreases and increases when the number k of the observed failure increases, this is not so obvious to assess the impact on the Fig. 38.5 90% confidence interval when using chi-square distributions

850

38 Data Collection and Uncertainties

Fig. 38.6 Impact of k over the 90% confidence interval when Tc is constant

confidence interval. The problem is illustrated in Fig. 38.6 where the accumulated observation time Tc is constant and the number of observed failures increases from 0 to 20. The figures on the left-hand and right-hand sides are equivalent because only the scale of the ordinate has been changed from linear to logarithmic. On the lefthand side, the width of the confidence interval increases—in absolute value—when it decreases on the right-hand side—in relative value. Therefore, the width of the confidence interval is not really a good measure of the accuracy of the estimation and something else has to be used. This can be done by considering the ratio of the upper bound to the lower bound, which provides a relative measure of the width of the confidence interval. Doing that is equivalent to the calculation of the error factor defined for the log-normal distribution analysed in Sect. 38.5.4. More precisely, this gives the squared value of the error factor which is obtained in this way. As the result is not a genuine error factor, it will be named pseudo error factor and defined as follows: Pseudo error factor q α : square root of the ratio of the upper fractile at (1 − α) and the lower percentile at α.  qα =

λ(1−α) λα

(38.5)

Considering α = 5% and the corresponding lower and upper bounds provided by using the chi-square distribution leads to:  q5% =

λ95% = λ5%

1 χ2 2Tc 5%,2(k+1) 1 χ2 2Tc 95%,2k

=

2 χ5%,2(k+1) 2 χ95%,2k

(38.6)

Therefore, this pseudo error factor qα depends only of the number of observed failures and the accumulated observation time does not matter. This is illustrated in Fig. 38.7: the pseudo error factor q5% is not defined (infinite) when no failure has been observed, decreases quickly from 1 to 6 failures and then

38.5 Data Uncertainty Modelling

851

Fig. 38.7 Pseudo error factor q90% when the number of observed failure increases

Fig. 38.8 Impact of the confidence level on the confidence interval and the pseudo error factor

decreases more and more slowly toward an asymptotic value equal to 1 (where the upper and lower bounds reach the same value and there is no more uncertainty). The impact of the confidence level on the confidence interval width and on the corresponding pseudo error factor is illustrated in Fig. 38.8. The confidence level (α) of the bounds has been limited to 50% to avoid that the lower bound become higher than the upper bound! This is illustrated in Fig. 38.8 which has been drafted for k = 5. Due to the different degrees of freedom of the chi-square distributions, λ50% estimated as a lower bound is lower than λ50% estimated as an upper bound. On the left-hand side, the width of the confidence interval decreases when α increases—i.e. the confidence level (1 − 2α) decreases. On the right-hand side, the corresponding pseudo error factor also decreases when the confidence interval decreases. It is slightly higher than 1 for α = 50% due to the difference of degree of freedom between the two chi-square distributions.

852

38 Data Collection and Uncertainties

38.5.4 Bayesian Approach and Gamma Distribution 38.5.4.1

Bayesian Approach Principle

It is not possible to write a chapter about data input without mentioning the Bayesian approach, which is an effective way to elaborate these data inputs. As this is a specific topic with a rather difficult mathematical background, only an overview of the principle of this approach is given hereafter. Detailed explanations can be found in references (Procaccia 2008; Piepszownik and Procaccia 1992; Clarotti 1989). For a long time, the classical approach of probabilistic calculation (renamed frequentist approach) has been presented as an antagonistic approach of the Bayesian approach (renamed subjective approach). Fortunately, this internal war within the safety and dependability domain is out of date nowadays and both approaches are now considered to be complementary. In fact, the reliability engineers looking for data inputs never start from scratch as various more or less relevant data sources are ever available: they range from databases (e.g. specific, manufacturer or generic databases) to expert judgment. In most of the cases, their problem is not to select a given data source but rather to use at the best the information provided by all of them. Therefore, they may be led to mix data from different sources and, according to their engineering judgment, to use different weightings to take into account the relevance of the source with regards to the undertaken study. This approach is said to be subjective because it involves expert judgment but, looking at it closer, any approach involves a part of subjectivity and none of them can be completely objective. The input data elaborated in this way are based on the prior information available and can be used a priori for performing the forecasting probabilistic calculations. Then, when the system (or similar systems) has actually been brought in use, further data can be gathered from the field feedback and it is there that the Bayesian approach can be implemented to elaborate more accurate data (called posterior data) incorporating both the prior information and the new one from the field feedback. For doing that, if is necessary to: • • • •

select a prior distribution for the data input; gather further data from field feedback; calculate the likelihood of these data with regards to the prior distribution; use Bayes’ theorem to update the prior distribution with the likelihood of these data in order to obtain a more accurate distribution called posterior distribution.

It can be demonstrated, (Dupuis 2007), that the impact of the prior distribution decreases when the amount of collected data increases. Therefore, when plenty of data are collected, the prior choice does not matter but, however, the better the prior choice, the faster the convergence of data toward asymptotic values. Presented as above, this seems rather simple but in fact the calculation implies the calculation of the following formula (see Pagès and Gondran 1986) or (Procaccia 2008):

38.5 Data Uncertainty Modelling

f  (θ |x) =

853

L(x|θ ). f  (θ ) ∫ L(x|θ ). f  (θ ))dθ

(38.7)

where: – f  (θ ) is the prior distribution, – L(x|θ ) is the likelihood of the collected data x, with regards to the prior value θ , – f  (θ |x) is the posterior distribution. In the general case, the above calculation is rather complicated as no analytical solutions exists. Fortunately, when the prior distribution and posterior distribution are chosen to be conjugate distributions, the calculations are simplified. This is in particular the case when gamma functions are considered and this is why this distribution is often preferred when the Bayesian approach is implemented.

38.5.4.2

Gamma Distribution (Principle)

Modelling a failure rate, λ, as a random variable Λ, by a gamma distribution can be done by using the following probability density function (PDF): f G (λ; α, β) =

β α α−1 −β.λ λ ·e (α)

(38.8)

It has to be noted that the exponential, Erlang and chi-square distributions are particular cases of the gamma distribution (Piepszownik et al. 1992). In this formula: • λ is the value of Λ and f G (λ; α, β) · dλ gives the probability that it is comprised between λ and λ + dλ; • β is the inverse scale parameter; in terms of statistical sample, β is equivalent to the accumulated observation time; • α is the shape parameter; in terms of statistical sample, α is equivalent to the number of observed failures over β; • (.) is a function such as (x + 1) = x. (x); when n is an integer, then (n) = (n − 1)! The average value of the failure rate is equal to E(λ) = βα and the variance is equal to σ 2 (λ) = βα2 . The use of the inverse cumulative gamma(.) distribution is needed to establish the confidence intervals related to the random variable Λ. This function is not available in a simple analytic form but, fortunately, most of the spreadsheet tools comprise a function LAW.GAMMA.INVERSE (q%, α, 1/β) allowing to calculate the qth percentiles in function of the number of observed failures, α and of the inverse of the accumulated observation time (i.e. the scale parameter 1/β).

854

38 Data Collection and Uncertainties

Noting LG I (.) the above function in short gives the bounds of the 90% centred confidence interval as: • λ5% = LG I (5%, α, 1/β) • λ95% = LG I (95%, α, 1/β) Therefore,[LG I (95%, α, 1/β) − LG I (5%, α, 1/β)] is the width of this 90% confidence interval and the corresponding pseudo error factor (see 38.5.3) is equal  LG I (5%,α,1/β) to LG . I (95%,α,1/β) It has to be noted that the link of the gamma distribution with the number of observed failures, α, and with the accumulated observation time, β, provides a natural way to model a failure rate as a random variable directly by using the data from the field feedback. The problem is that the cumulative probabilistic distribution has no simple analytical form. Fortunately, algorithms exist which can be used directly for Monte Carlo simulation and this allows to use the gamma distribution for uncertainty propagation as described in Sect. 32.5. Within the framework of Bayesian approach and for the estimation of constant failure rates, the use of the gamma distribution according to the principles developed in Sect. 38.5.4.1 is very simple, as shown hereafter. Prior information Based on available databases and/or expert judgment, the first step consists, for a reliability engineer, to select a plausible average value, λ˜ , and a plausible standard deviation, σ˜ , for the given item under interest. Then these selected values can be used to define the prior distribution:    • E λ = βα  = λ˜  • σ 2 λ = βα2 = σ˜ 2 And the prior gamma distribution parameters can be calculated from this prior information: • β =

α β

·



β2 α

=

λ˜ σ˜ 2

2 • α  = λ˜ · β  = σλ˜˜ 2   Finally, f G λ ; α  , β  is the prior gamma distribution equivalent to the prior information selected by the reliability engineer. This shows that this prior information is equivalent to a statistical sample gathering α  failures over an observation time equal to β  hours. The inverse cumulative gamma distribution function allows to calculate the 5th and 95th percentiles in function of the number of observed failures, α  , and of the inverse of the accumulated observation time as follows:

• λ5% = LG I (5%, α  , 1/β  ) • λ95% = LG I (95%, α  , 1/β  )

38.5 Data Uncertainty Modelling

855

New information The failures observed on similar items in operation—or, better, on the related item itself—provide new information which is the basis for updating the prior distribution. Under the assumption of a constant failure rate, this information can be simply summarized by: • k: number of observed failures; • Tc : accumulated observation time. With regards to the prior value λ˜ , the likelihood of these observations is given by: ˜

˜ c ) = λ˜ k e−λ.Tc L(λ|T

(38.9)

Posterior distribution In the particular case of gamma distributions and constant failure rate, the combination of the prior distribution to the likelihood function (Formula 38.7) simply leads to another gamma distribution with the following parameters: • α  = α  + k; • β  = β  + Tc Then, the posterior distribution combining the prior information and the informa-   tion provided by the field feedback is given by the gamma distribution f G λ ; α  , β  which represents now a statistical sample gathering α  failures over an observation time equal to β  hours. As above, the inverse cumulative gamma distribution function allows to calculate the 5th and 95th percentiles in function of the number of observed failures, α  , and of the inverse of the accumulated observation time as follows: • λ5% = LG I (5%, α  , 1/β  ) • λ95% = LG I (95%, α  , 1/β  ) 38.5.4.3

Gamma distribution (Example of Application)

Let us consider now a simple example (Pagès and Gondran 1986) to illustrate the above development. Prior information ∼

• Reliability engineer estimation: average failure rate λ= 10−3 h−1 and standard ∼ deviation σ = 5. × 10−4 h−1  • Equivalent gamma distribution parameters: β = 4. × 103 h and α  = 4   Then f G λ ; 4, 4. × 103 is the prior gamma distribution and the prior information is equivalent to a statistical sample gathering 4 failures over an observation time equal to 4000 h.

856

38 Data Collection and Uncertainties

Fig. 38.9 Example of prior and posterior gamma distributions

As α  = 4 and 1/β  = 1/4000 = 2.5 × 10−4 h−1 , the 90% confidence interval of the modelled failure rate can be calculated as: λ5% = LG I (5%, 4, 2.5 × 10−4 ) = 3.42 × 10−4 h−1 λ95% = LG I (95%, 4, 2.5 × 10−4 ) = 1.94 × 10−3 h−1 The width of this confidence interval is equal to 1.60 10–3 and the corresponding pseudo error factor (see Sect. 38.5.3) is equal to 2.38. This prior distribution with the corresponding confidence interval is illustrated on the left-hand side of Fig. 38.9. New information Let us consider that 1 failure has been observed after an observation duration of 1000 h of 6 similar items. The new information is summarized by k = 1 and Tc = 6000 h. Then, the 3 ˜ corresponding likelihood function is L(λ˜ |Tc ) = λ˜ .e−λ×6.×10 . Posterior distribution Combining the prior distribution with the likelihood function gives the parameters of the posterior gamma distribution: • α  = α  + k = 4 + 1 = 5; • β  = β  + Tc = 4. × 103 h + 6. × 103 h = 104 h.   Then f G λ ; 5, 104 is the posterior distribution and the posterior information is now equivalent to a statistical sample gathering 5 failures over an observation time equal to 10,000 h. As α  = 5 and 1/β  = 1/10, 000 = 1.0 × 10−4 h−1 , the average value and the 90% confidence interval of the modelled failure rate can be calculated as: • λ˜ =

= 5.0 × 10−4 h−1  • λ5% = LG I 5%, 5, 1.0 × 10−4 = 1.970 × 10−4 h−1  • λ95% = LG I 95%, 5, 1.0 × 10−4 = 9.153 × 10−4 h−1 5 104

The width of this confidence interval is equal to 7.18 10–4 and the corresponding pseudo error factor (see Sect. 38.5.3) is equal to 2.16.

38.5 Data Uncertainty Modelling

857

This posterior distribution with the corresponding confidence interval is illustrated on the right-hand side of Fig. 38.9. The accuracy of the posterior is obviously better than this of the prior distribution: the width of the confidence interval has been divided by 2.2 and the pseudo error factor has been reduced by 9.5%. If new information becomes available again, the same principle can be applied by  combining f G (λ; 5, 104 ) as prior distribution to the likelihood of the new information in order to obtain a new posterior gamma distribution f G (λ; α  , β  ). And so on: each time new information is gathered, the distribution of the failure rate can be updated and becomes more and more accurate and less and less dependent of the prior estimation: in the end, the estimation does not depend on the prior estimation but the convergence is faster when the prior estimation is good.

38.5.5 Log-Normal Distribution: Practical Approach The log-normal distribution (also called Galton law) is a continuous distribution of a random variable whose logarithm is distributed according to a normal law. Its density is given by: f (δ) = LnN (μ, σ ) =

[ln(δ)−μ]2 1 √ e− 2σ 2 δ.σ. 2π

(38.10)

In this formula, σ is the standard deviation (then σ 2 is the variance) and μ the expected value (i.e. the average) of the corresponding normal law N (μ, σ 2 ). The expected value m, the median value δ50% and the variance 2 of the log normal law are given by:  2  σ2 2 2 m = eμ+ 2 , δ50% = eμ , 2 = eσ − 1 · e2μ +σ

(38.11)

An important property of this distribution is the possibility to define an error factor qα such that δ50% /qα gives the bound of the percentile δα and δ50% × qα gives the bound of the percentile δ(1−α) . This property allows to easily define confidence  intervals δα , δ(1−α) at the probability level (1 − 2α). The error factor qα is linked to the standard deviation by the following formula: qα = e Nα ·σ

(38.12)

In this formula, Nα is a tabulated coefficient coming from the normal distribution tables. The values of Nα for confidence intervals at 90, 95 and 99% confidence levels are given in Table 38.1. Then, when α is equal to 5%, Nα = 1.645 and:

858

38 Data Collection and Uncertainties

Table 38.1 Tabulated values of Nα for 90, 95 and 99% confidence levels

Confidence interval (%)

Lower percentile

Upper percentile



90

5

95

1.645

95

2.5

97.5

1.960

99

0.5

99.5

2.585

q5% = e1.645·σ The ratio of the upper bound to the lower bound gives And, finally the error factor is obtained as follows:  qα =

δ(1−α) δα

(38.13) δ(1−α) δα

=

δ50% ·qα δ50% /qα

= (qα )2 .

(38.14)

This leads to the definition of the pseudo error factor qα announced in Sect. 38.5.3, Formula 38.5. The higher qα , the larger the corresponding confidence interval and the lower the certainty associated with the corresponding data. This property has been used since the seventies in the Wash 1400 report (Rasmussen 1975) to characterize the data uncertainties through a kind of expert judgment using three values of q5% to rank the uncertainties of input data and results of probabilistic calculations: • q5% = 3 indicates a good knowledge about the related data (ratio roughly equal to 10 between the bounds of the 90% confidence interval); • q5% = 10 indicates a medium knowledge about the related data (ratio roughly equal to 100 between the bounds); • q5% = 30 indicates a poor knowledge about the related data (ratio roughly equal to 1000 between the bounds). Coming back to the definition of the lognormal distribution, it is completely defined as soon as the expected value m and the error factor qα are known, (CPR12E 1997): σ =

ln(qα ) σ2 and μ = ln(m) − Nα 2

(38.15)

Therefore, LnN (μ, σ ) ≡ LnN (m, qα ). Assimilating the maximum likelihood estimation, λˆ , to the average value and the pseudo error factor to a real error factor allows to fit a log-normal law to the estimation coming from the field feedback analysed in the previous chapter. For estimating a failure rate uncertainty this leads, for example, to:

38.5 Data Uncertainty Modelling

859

Fig. 38.10 Impact of the error factor of the failure rate on the unavailability of a repaired item

2 χα,2(k+1) ˆ LnN (λ, qα ) ≡ LnN (λ, 2 ) χ(1−α),2k

(38.16)

Then, when using RBD/ADD software packages which do not directly implement the chi-square or gamma distributions, they can be replaced by a log-normal distribution fitted according to Formula 38.16 to model the data uncertainties. It has to be noted that the triangular law, which seems simpler than the log-normal law, is in fact more difficult to use for performing similar fittings. It is even impossible when the number of observed events is low because trying to fit a triangular law having the same confidence bounds and the same expected value leads to a mode outside the support of the triangular law. Figure 38.10 illustrates the impact on the unavailability U (t) of a repaired item of the uncertainty related to its failure rate. Calculations are performed for three values of the error factors: 1.5, 3 and 9. As expected, the width of the 90% confidence interval increases when the error factor increases. As a side effect, it can be observed that the expected value E[U (t)|] of the unavailability at time t given the random variable Λ decreases when the error factor increases. Therefore, it is necessary to be cautious as using too uncertain data may lead to non-conservative results.

References ASRS (2020) https://asrs.arc.nasa.gov/. Accessed Sept 2020 Aupied JR, Procaccia H (1984) SRDF: a system for collecting reliability data from French PWR power plants. Method of failure analysis. Application to the processing of valves data. Nuclear Eng Des 81(1):127–137 (Elsevier) Clarotti CA (1989) The Bayes predictive approach in reliability theory. IEEE Trans Reliab 38(3):379–382 (IEEE, USA) CPR 12E (1997) Methods for determining and processing probabilities—second Edition. Committee for Prevention of Disasters. ISBN 90 12 08543 8. Gevaarlijke Stoffen, The Netherland Dupuis J (2007) Statistique bayésienne et algorithme MCMC. IMAT (Master 1). University of Mathematics of Toulouse, France

860

38 Data Collection and Uncertainties

EXIDA Ed 4 (2015) Safety equipment reliability handbook: 3 volumes: sensors, logic solvers and interface modules, final elements. EXIDA. Sellersville. USA FIDES (2020) A methodology for components reliability. IMDR project https://www.fides-reliab ility.org/. Accessed June 2020 Georgin J-P, Signoret J-P (1981) The maximum likelihood estimate from confidence level point of view—proposition for an improved one. Reliab Eng Int J 2 (Applied Sciences publisher. Elsevier) Harold A (2002) The Delphi method: techniques and applications. Linstone and Murray Turoff, Editors Hjorteland A, Aven T and Østebø R (2007) Uncertainty treatment in production assurance analyses throughout the various phases of a project. Reliab Eng Syst Saf (RESS) 92 (2007) 1315–1320 (Elsevier) IEEE 500 (1984) Guide to the collection and presentation of electrical, electronic, sensing component, and mechanical equipment reliability data for nuclear-power generating stations. IEEE, USA IEC 60300-3-2 Ed. 2.0 (2004) Dependability management, Part 3-2: application guide—collection of dependability data from the field. International Electrotechnical Commission, Geneva, Switzerland IEC 60605-4 (1986) Equipment reliability testing—part 4: procedures for determining point estimates and confidence limits from equipment reliability determination tests. International Electrotechnical Commission (UEC), Geneva, Switzerland IEC 60706-3 (2006) Maintainability of equipment—part 3: verification and collection, analysis and presentation of data. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 61709 (2017) Electric components—reliability—reference conditions for failure rates and stress models for conversion. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 63142 (in progress) A global methodology for reliability data prediction of electronic components. International Electrotechnical Commission (IEC), Geneva, Switzerland IEC 63162 (in progress) Electric components—reliability—reference failure rates at reference conditions. International Electrotechnical Commission (IEC), Geneva, Switzerland ISO 6527 Ed. 1.0 (1982) Nuclear power plants. Reliability data exchange. General guidelines. International organization for standardization (ISO), Geneva, Switzerland ISO 7385 Ed. 1.0 (1983) Nuclear power plants. Guidelines to ensure quality of collected data on reliability. International organization for standardization (ISO), Geneva, Switzerland ISO 14224 Ed. 3.0 (2016) Petroleum, petrochemical and natural gas industries. Collection and exchange of reliability and maintenance data for equipment. International organization for standardization (ISO), Geneva, Switzerland IRSNI (2020) https://www.iaea.org/resources/databases/irsni. Accessed Sept2020 Kortner H, Østebø R, Sandtorv H (2005) Collection of reliability and maintenance data—development of an international standard. ESREL’2005. Tri City, Poland Lannoy A, Procaccia H (1994) Méthodes avancées d’analyse des bases de données du retour d’expérience industriel. Eyrolles Lannoy A (1996) Analyse quantitative et utilité du retour d’expérience pour la maintenance des matériels et la sécurité. Eyrolles Leroy A (2018) Production availability and reliability. Use in the oil and gas industry, 1st edn, Wiley-ISTE. London, UK MIL-HDBK 217 F notice 2 (1995) Military handbook: reliability prediction of electronic equipment. Department of Defense, Washington DC. USA MIL-HBDK 217Plus (2015) Handbook of 217Plus. Reliability prediction models, Quanterion Solutions Inc. Utica NY. USA NPRD (2016) Non electronic parts reliability data. Quanterion solutions Inc. Utica. USA OREDA Handbook (2015) Ed. 6.0 offshore and onshore reliability data. Prepared by SINTEF and NTNU. Hovik. Norway

References

861

Ostebo R (2006) Use of OREDA data in Statoil, OREDA 25 years anniversary seminar, Petroleum Safety Authority (PSA), Stavanger, Norway. https://www.oreda.com/history/. Accessed Sept 2020 Ostebo R, Bjerkhaug K A, Hasselknippe B (2000) Networked, industry-wide co-operation with OREDA-24 Databank. Regularity Management Conference 2000, Statoil, Stavanger Pagès A, Gondran M (1986) System reliability: evaluation and prediction in engineering, Springer PDS data handbook (2013) Reliability data for safety instrumented system. SINTEF, Trondheim Piepszownik L, Procaccia H (1992) Fiabilité des équipements et théorie de la décision statistique fréquentielle et bayésienne. Éditions Eyrolles, Paris Procaccia H (2008) Les fondements des approches fréquentielles et Bayésiennes, SRD collection. Edition TEC & Doc. Lavoisier, France Rasmussen C (1975) Reactor safety study. An assessment of accidents risks in U.S. commercial power plants; WASH 1400 (NUREG 75/014) pp. 42–44 and pp 152–157, U.S. Nuclear Regulatory Commission, Washington, USA Sachs L (1984) Applied statistics. Springer-Verlag, New York, USA UTE C80-811A (2011) Reliability methodology for electronic systems, FIDES guide, Issue A, AFNOR éditions. France Wikipedia Delphi (2020) https://en.wikipedia.org/wiki/Delphi_method. Accessed Sept 2020 WOAD (2020) https://www.dnvgl.com/services/world-offshore-accident-database-woad-1747. Accessed Sept 2020

Index

A Absorbing state, 484, 590 Absorption (inclusion), 242 Acceptable risk, 22 Accident sequence, 388 Accumulated down time (ADT), 483, 552 Accumulated experience, 373, 822 Accumulated non-operating time, 65 Accumulated observation time, 373 Accumulated operating time, 65 Accumulated production (Apd(T )), 125, 514 Accumulated sojourn time (AST), 88, 122, 479 Accumulated time spent in a given state, 472 Accumulated up time (AUT), 126, 483, 552 Accuracy of results, 30 Analytical approaches, 468 Boolean family, 260 Data uncertainty, 373 Markov graphs, 496–498 Monte Carlo simulations, 468, 559, 632 Active repair intensity, 100 Active repair rate, 99 Active repair time, 62 Administrative delay, 47, 463, 615 Ageing failure, 53 Ageing item (Petri nets), 606 Aggregation of states, 522 Airbus program, 11 Aleatics, 17 Alpha-factor CCF model, 119 AltaRica formal language, 424 America Grande (France, 2019), 4 Analytical approach, 468 Analytical calculations, 468 AND Gate (probabilistic calculation), 246

Apollo 13 (USA, 1970), 4 Apollo 13 lunar module common cause failure, 109 Architecture analysis and design language (AADL), 424 Ariane V (France, 1996), 4 Aristotle, 754 As low as reasonably practicable (ALARP) criterion, 22, 752 Asymptotic (steady state) availability, 89 Asymptotic (steady state) unavailability, 89 Asymptotic (steady state) unavailability approximation (Markov graph), 510 Asynchronous automata, 588 Asynchronous finite state automata, 588 Automated FT generation, 392, 423 Availability, 29, 197, 461 Availability calculations (Boolean family), 287 Asymptotic availability (steady state availability), 295 Average availability calculations, 293 Instantaneous availability, 286, 288 Availability calculations (Markov graph), 481 Asymptotic (steady state) availability, 480 Average availability, 483 Instantaneous availability, 481 Production availability, 512 Availability calculations (Petri nets), 627 Average availability, 627 Instantaneous availability, 627 Production availability, 631 Availability chronogram, 287 Availability Markov graph, 481

© Springer Nature Switzerland AG 2021 J.-P. Signoret and A. Leroy, Reliability Assessment of Safety and Production Systems, Springer Series in Reliability Engineering, https://doi.org/10.1007/978-3-030-64708-7

863

864 Average availability, 126, 480 Average failure frequency, 97 Average item efficiency, 126 Average (mean) availability, 87 Average (mean) number of failure calculation (Boolean family), 297 Average (mean) number of failure calculation (Markov graph), 488 Average (mean) number of failure estimation (Monte carlo), 552 Average (mean) unavailability, 87 Average production availability, 126 Average production availability estimation (Monte Carlo), 552 Average production unavailability estimation (Monte carlo), 552 Average productivity, 514 Average productivity. See Average production availability, 126 Average repair frequency, 491 Average unavailability, 480 Average unavailability estimation (Monte Carlo), 552 Average unavailability (PFDavg), 31 Average value, 373 B Barlow-Proschan importance factor (BPIF), 365 Basic parameter CCF model, 119 Basic process command-control system (BPCS), 127, 402 Bathtub curve, 94 Bayes’ theorem, 191, 418 Bayesian belief network (BBN), 418 Child node, 419 Directed arc / arrow/ edge, 419 Node (vertice), 419 Parent node, 419 Root node, 419 Behavioural approach, 467 Behavioural models, 135, 547, 589 Belief network (BN) approach. See Bayesian belief network (BBN), 183, 191, 417 Bellman’s theorem, 15, 754 Best estimates, 21, 714 Beta-factor (exercise), 437 Beta-factor model (CCF), 115, 233, 258, 327, 532, 610, 804 Bhopal (India, 1984), 4 Biform variable, 236, 334 Binary decision diagram (BDD), 216, 249, 271

Index Binary decision diagram (BDD) building (exercise), 439 Binary item, 121, 128, 209, 459, 473 Binary systems modelling (Petri nets), 641 Birnbaum importance factor, 40, 299, 355 Block, 288 Boeing 737 Max 8 (Ethiopia, 2019), 4 Boolean Algebra, 185 Absorption (inclusion), 187 Associativity, 187 Commutativity, 187 Complementarity, 188 Conjunction (or intersection) (AND), 185 De Morgan’s Laws, 188 Disjunction (or union) (OR), 185 Distributivity, 187 Exclusive disjunction (or exclusive union) operator. See Exclusive OR gate (XOR), 188 Idempotence, 187 Negation (NOT), 185 Neutral element, 187 Boolean approach, 134, 319, 457 Boolean approach family, 183 Boolean model, 13 Boolean set, 185 Boolean variable, 185 Bottom-up (inductive) approach, 9, 196, 209 Bowtie approach, 183, 400 Breaking Model (CCF), 610 Buffon’s needles, 548

C Canvey Island (1978), 11 Capital expenditures (CAPEX), 715 Carl Adam Petri, 588 Cascade failures, 106 Catalectic failure, 52 Cause analysis. See Fault tree analysis (FTA), 388 Cause-consequence diagram (CCD), 183, 385, 423 AND gate, 388 Branching operator symbol (YES / NO gate), 387 Consequence symbol, 387 Event condition symbol, 388 Event symbol, 387 OR gate, 388 Time delay symbol, 387 Cause diagram, 386

Index Causes of deviation, 158 Cause tree approach, 417 CCF Analysis (exercise), 437 Certain event, 185 Challenger (USA, 1986), 4 Checklist approach, 140, 173, 178 Chernobyl (Russia, 1986), 3 Chronogram, 464 Cindynics, 17 Coherent system, 235, 272, 341 Cold redundancy, 539 Cold standby redundancy, 458 Coloured Petri Nets, 656 Combination of several approaches, 409 Bowtie approach, 412, 414 Combination FTs with Petri Nets (Dynamic fault trees). See FT-driven Petri nets, 416 Combination of FTs with CCD, 412 Combination of FTs with event trees, 412 Combination of FTs with FMEA/FMECA, 409 Combination RBD/FT and vice versa, 410 Combination of RBDs with FMEA/FMECA, 410 Combination of RBDs with Petri nets (dynamic RBDs). See RBD-driven Petri nets, 416 Combination with Markov processes. See FT-driven Markov process, 416. See also RBD-driven Markov process, 416 Link root cause analysis (RCA) and FTs, 416 Combinations of events, 39 Combinatorial/combinatory explosion of the number of states, 522 Combinatorial/combinatory explosion of the number of terms, 269 Common Cause Data Exchange (ICDE) Project, 111 Common cause failure candidate, 229, 331 Common cause failure (CCF), 58, 105, 229, 712 Common cause failure (CCF) analysis (minimal cut sets), 231 Common cause failure (CCF) modelling (Boolean family), 258, 319 Common cause failure (CCF) modelling (Markov), 532 Common cause failure (CCF) modelling (Petri nets), 610

865 Common cause failure data collection, 112 Common mode failures (CMF), 58, 106 Compensatory / compensating measure, 130, 758 Completely failed state, 460, 465 Component fault tree (CFT), 423 Composite block, 201 Composite primary event, 211, 292, 293, 314 Conclusion register, 175 Concorde (France, 2000), 4 Concorde program, 11 Conditional failure intensity. See Vesely failure rate, 299, 487 Conditional function, 335 Conditional probability, 104, 279, 418 Conditional probability calculation, 277 Conditional probability of failure on nonlethal shock, 323 Conditional repair intensity, 491 Confidence interval, 375, 560 Confucius, 754 Congruential pseudo random number generators, 555 Consequence diagram, 385 Consequences of deviation, 159 Conservativeness, 24, 114, 211, 233, 253, 259, 375, 490, 755 Conservativeness (exercise), 435 Contingency, 21 Continuous mode of operation, 129 Contradictory targets of safety and production, 770 Conventional safety system, 749 Correcting / mitigating measure. See Compensatory / compensating measure, 42 Corrective maintenance, 61, 461 Correlated events, 379 Costa Concordia (Italy, 2012), 4 Counter modelling (Petri nets), 606 Coupling factor, 106 Critical dangerous failure, 57 Critical down state, 48 Critical failure, 60, 487, 711 Critical importance factor (CIF), 359 Critical repair/restoration, 60 Critical safe failure, 56 Critical state, 47, 338, 487, 488 Critical unsafe failure, 57 Critical up state, 48 Criticality analysis, 169 Criticality matrix, 169 Cumulated distribution function (CDF), 553

866 Cut and Tie Sets (BDD calculation), 281 Cut and tie sets identification for coherent RBDs and FTs, 277 Cut set, 222 Cut set identification (Exercise), 434

D Danger, 18 Dangerous failure. See Unsafe failure, 56 Dangerous state, 465, 466 Data uncertainty handling, 632 Data uncertainty modelling, 845 Bayesian approach, 852 Posterior distribution, 853 Prior distribution, 853 Chi-square distribution, 848 Data accuracy versus field feedback, 845 Gamma distribution, 852 New information, 855 Posterior distribution, 855 Prior information, 854 Log-Normal distribution, 857 Statistics from field feedback, 848 Triangular distribution, 846 Uniform distribution, 846 Deductive (top-down) approach, 209 Degraded failure, 711 Degraded quality level (production systems), 710 Degraded state, 47, 124, 465 Demand frequency, 757 Demand mode of operation, 129 De Morgan’s laws, 236 Dependability, 3, 13 Dependency classifications, 106 Dependent events, 191 Dependent failures, 103, 564 Depth-first left-most (DFLM) heuristics, 276 Deterministic delay (Petri net), 595 Diagnostic analysis (backward analysis, bottom-up reasoning), 420 Diagnostic importance factor (DIF), 361 Diagnostic of the failure, 463 Diagnostic test, 54 Differential importance measure (DIM), 364 Dirac distribution, 559 Directed acyclic graph (DAG), 196, 210, 270, 419 Disabling model (CCF), 611 Discounted cash flow (DCF), 715 Disjoint paths calculation (exercise), 444

Index Dormant failure, 235 Dormant fault, 235 Down state, 44, 196, 209, 237, 285 Down state class, 121, 464, 475 Dual RBD and FT models, 265 Dynamic aspects, 547 Dynamic aspect linked to maintenance, 461 Dynamic aspects linked to system operation, 457 Dynamic dependency, 107, 319, 323 Dynamic fault tree, 646 Dynamic methods and tools, 467 Dynamic models, 34, 285 Dynamic Event Tree (DET), 397 Dynamic fault tree, 314 Priority AND gate (PAND), 315 Sequence gate (SEQ), 315 Dynamic flow diagram, 649 Convergent node, 651 Divergent node, 651 Dynamic RBD, 458 Markov model, 34 Petri nets, 587 Sequential model, 34 Dynamic systems, 13, 457 Dynamic transition (Monte Carlo), 564 Dysfunctional analysis, 29, 30, 32, 41

E Early failure (youth) period, 94 Effects of failure modes, 165 Efficiency concept, 121 Electrical analogy, 205 Electric blackouts. See Common cause failure (CCF), 110 Elgin (North Sea, Norway, 2012), 4 Endless loop prevention (Petri nets), 624 Engineering judgment, 146, 752 Equipment capacity, 710 Equivalent failure rate (approximation), 506 Equivalent production time (Teq), 126, 514 Equivalent repair rate (approximation), 510 Escalating events, 385 Eschede (Germany, 1998), 4 Estonia (Baltic Sea, 1994), 4 Event tree analysis (ETA), 35, 183, 393 Event tree graphical symbols, 394 Exact probabilistic calculation, 277 Exclusive cofactor, 336 Exclusive disjunction (exclusive union), 188 Exclusive OR gate (XOR), 301

Index Exon Valdez (Alaska, 1979), 4 Expenditures, 715 Expert judgment, 846 Explicit cause, 232 Explicit dependency. See Tangible dependency, 108 Exponential distribution, 52 Exponential of matrix, 478 Extrinsic dependency, 106

F Fail-safe system, 56 Failure. See Failure classification, 49 Failure cause, 57, 166 Failure classification, 760 Dangerous failure, 761 Critical dangerous failure, 762 Dangerous detected failure (DD), 761 Dangerous undetected failure(DU), 761 Failure detection capacity, 762 Failures detected by online diagnostic, 763 Failures detected by proof tests, 763 Failures not covered by online diagnostic nor proof tests, 763 Self-revealed failure, 763 Human failures, 782 No-part and no-effect failure, 761 Random hardware failure, 760 Safe failure, 761 Critical safe failure. See spurious failure, 762 Safe detected failure (SD), 761 Safe undetected failure (SU), 761 Spurious failure, 762 Systematic failure, 760 Failure density, 86, 310, 488 Failure detection, 167 Failure frequency, 312, 461 Failure frequency calculations (Boolean family), 299 Average failure frequency, 297 Instantaneous failure frequency. See Unconditional failure intensity, 299 Failure frequency calculations (exercise), 446 Failure frequency calculations (Markov graph), 484 Average failure frequency. See Probability of dangerous failure per hour (PFH), 488

867 Failure frequency. See Unconditional failure intensity, 487 Instantaneous failure frequency. See Unconditional failure intensity, 487 Failure frequency calculations (Petri net), 628 Failure mode, 57, 165 Failure mode checklist, 167 Failure mode effect, 140, 166 Failure mode ranking, 140 Failure modes and effects analysis (FMEA), 141, 165, 178 Failure modes, effects and criticality analysis (FMECA), 9, 140, 169, 178 Failure on demand modelling, 500 Failure propagation, 423 Failure rate, 35, 40, 52, 91, 310, 488 Failure revealed by diagnostic test, 54 Failure revealed by periodic test, 54 Fault, 50 Fault detection time, 62 Fault tolerance, 58 Fault tolerance (architectural constraints), 768 Hardware fault tolerance (HFT), 768 Route 1H, 768 Route 2H, 768 Fault tree analysis (FTA), 35, 183, 196, 209, 265, 373, 386 FD-driven Petri nets (Dynamic flow diagrams), 649 FIABEX (European Union EUREKA project), 424 FIDES (IEC63142), 13 Field feedback, 373, 562 Finite state automaton, 587 Boundedness, 590 Conflict-freedom, 590 Liveness, 590 Reachability, 590 Safeness, 590 Fired transition (Petri nets), 621 First complete failure, 464 First in, first out (FIFO), 523 First priority repairs, 712 First system failure, 526 Flixborough (United Kingdom, 1974), 4 Flow diagram (FD) modelling, 459 Composite block, 719 Production systems, 719 Formal language, 423, 591 Frequency calculations, 211

868 Frequency calculations (Boolean family), 303 FT building (exercise), 432 FT calculations, 285 FT-driven Markov process, 35, 135, 290, 467, 532 FT-driven Petri net. See Dynamic fault tree, 416, 646 AND gate, 647 Majority (m out of n) vote gate, 647 OR gate, 647 Priority AND gate (PAND), 647 Sequential gate (SEQ), 647 FT Symbols, 211 AND gate, 213 Basic event, 212 Condition event, 213 Direct copy of event, 215 Dormant (or latent) fault, 213 Elementary event, 212 Event to be developed, 212 IF gate, 215 Inverted copy of event, 215 Majority vote (m out of n) logic, 215 NAND gate, 216 NOT gate, 216 OR gate, 213 Repeated primary event, 215 Transfer gate, 215 XOR gate. See Exclusive disjunction (exclusive union), 216 Fukushima (Japan, 2011), 3 Full redundancy, 712 Functional analysis, 29, 31 Functional block diagram, 166 Functional dependency, 106 Functional safety, 134, 471, 749 Functioning sureness, 18

G Gamma distribution, 819 Gas compression unit, 717 Generation of probabilistic laws (Monte Carlo), 553 Generic data bases, 750, 842 EXIDA: Safety equipment reliability handbook, 750, 844 FIDES: Methodology for component reliability, 843 IRSNI: Incident reporting system, 843 MIL-HDBK 217 F: Reliability prediction of electronic equipment, 843

Index NPRD: Nonelectronic parts reliability data publication, 843 OREDA: Offshore and onshore reliability data, 750, 843 PDS: Data handbook, 750, 844 WOAD: World offshore accident data base, 843 Globally at least equivalent (GALE) principle, 822 Gödel, 754 GRAFCET, 588 GRIF-Workshop software package, 136, 199, 259, 434, 472, 561, 589 Gross hazard analysis (GHA), 150 Guide word, 157 H Harmonized standard, 830 Hazard, 18 Hazard analysis and critical control points (HACCP), 162 Hazard and operability study (HAZOP), 9, 32, 141, 157, 178 Hazard checklist, 173 Hazard identification (HAZID), 140, 175, 179 Hazardous element, 141, 146 Hazardous element checklist, 148 Hazardous event, 152 Hazardous event frequency (HEF), 757 Hazardous situation, 141, 146, 174, 762 Hazardous situations checklist, 149 Herald of Free Enterprise (Belgium 1987), 4 Heuristics, 276 Hidden failure, 54, 72, 130 Hierarchically performed hazard origin and propagation study (HiP-HOPS), 424 High Integrity Pressure Protection System (HIPPS), 749 Histogram, 375 History of a system, 375, 464 History from Monte Carlo simulation, 375 History (trajectory) of a random process, 549 Homogeneous Markov process, 474 Hot redundancy, 539 Human related dependency, 108 Human reliability assessment (HRA), 10 I IEC 31010 (2019), 140

Index IEC 60050-903 (2013), 139 IEC 60300-3-11 (2017), 13 IEC 60300-3-12 (2011), 13 IEC 60300-3-2 (2004), 14, 774 IEC 60300-3-3 (2017), 13 IEC 60812 (2019), 9 IEC 61025 Ed. 3.0 (in progress), 209 IEC 61025 (in progress), 9, 229, 373 IEC 61078 Ed. 3.0 (2016), 199, 229 IEC 61165 (2006), 11, 471 IEC 61508 (2010), 13, 373, 749 Alternative techniques. See ISO/TR 12489 (2013), 751 IEC 61511 (2016), 13, 129, 749 IEC 61511-3 (2016), 402 IEC 61513, 749 IEC 61882 (2016), 9 IEC 62061, 749 IEC 62502 (2010), 394 IEC 62740 (2015), 416 If-then-else logic operator, 239, 627 Implicit cause, 233 Importance factor calculations (exercise), 451 Importance factors, 350 Impossible event, 187 Incipient failure, 712 Inclusive cofactor, 338 Independent block, 319 Independent events, 190 Independent failure, 230 Independent items, 469 Independent primary event, 319 Independent probability of failure, 327 Independent protection layer (IPL), 402 Inductive (bottom-up) approach, 139 Inherent protection, 402 Initiating event, 385, 806 Input data accuracy, 774 Accumulated time of observation, 817 Field feedback, 774 Amount of field feedback, 774 Data collection quality, 774 Field proven (proven in use, proved by prior use), 774 Pre-established input data, 774 Reliability data collection, 750 Instantaneous active repair rate, 100 Instantaneous availability, A(t), 82, 125 Instantaneous conditional failure intensity (Vesely failure rate), 95 Instantaneous efficiency, 125 Instantaneous production availability, 125

869 Instantaneous unavailability estimation (Monte Carlo), 552 Instantaneous unavailability, U(t), 84 Instrumented system, 133 Integrated logistic support (ILS), 13 Intrinsic dependency, 106 Involution, 188 ISO/IEC 31010 (2019), 13 ISO/IEC guide 51 (2014), 152 ISO/TR 12489 (2013), 13, 373, 471, 751 ISO 14224 (2016), 774 ISO 26262, 749

K KB3 workbench, 424

L Lac-Mégantic (Canada, 2013), 4 Lambert importance factor. See Critical importance factor (CIF), 359 Laplace transform, 494 Layer of protection analysis (LOPA), 183, 402, 755 Lethal shock, 116, 319 Level control valve (LCV), 717 Level transmitter (LT), 717 Life cycle cost (LCC), 13 Lineage CCF, 330 Lineage dependency, 107 Linking matrix, 515 Logic dependency, 107 Logic formula, 268, 285 Logic function, 188, 195, 209 Logic links, 210 Logic solver, 133, 427 Logic structures, 196 Logic symbols, 210 Logic variable, 185, 198, 268 Logistic support (production systems), 709 Log-normal distribution, 374 Lusser’s theorem, 8

M Macondo (Gulf of Mexico, 2010), 4 Macro components, 527 Maintainability, 98 Maintenance, 61 Maintenance failure, 55 Maintenance load estimation (Monte Carlo), 552 Maintenance modelling (Markov), 535

870 Maintenance philosophy, 30, 461 Maintenance support mobilisation, 463, 536 Maintenance teams (production systems), 709 Maintenance tools mobilisation (Petri nets), 614 Majority vote (m out of n) logic, 250, 329, 770 Managing conflicts (Petri nets), 593 Marginal importance factor (MIF). See Birnbaum importance factor, 355 Marking of a Petri net, 592 Markov Andreï Andreïovitch (1856–1922), 471 Markov graph, 471, 549 Markov graph generation, 589 Markov Graph modelling (production systems), 725 Markov graph symbols, 473 State symbol, 473 Transition symbol, 473 Zero-duration (transient, transparent, non-permanent) state, 500 Markovian approach, 11, 135, 458, 467, 471 Markovian approach exercises (Pumping system), 661 Absorbing state, 664 Availability Markov graph, 668 Aggregated states, 670 Hot redundancy, 668 Probability of failure on demand, 673 Repair priority, 672 Single repair team, 670 Unavailability and failure frequency, 675 Production availability Markov graph, 677 Reliability Markov graph, 663 Alternated standby redundancy, 666 Hot redundancy, 663 Hot redundancy + CCF, 665 Probability of failure on demand, 668 Standby redundancy, 666 Unreliability and failure rate, 674 Markovian asymptotic calculations, 479 Markovian matrix, 477 Markov process, 35, 290, 471, 478 Mars Climate Orbiter 1998, 829 Mathematical foundations (Markovian approach), 475 Matrix exponentiation, 497 Matrix inversion, 496 Maximum capacity, 460

Index Maximum likelihood estimation (MLE), 373 McDonnell Douglas DC-10 airplane crash. See Common cause failure (CCF), 109 MDT related to operating failures (MDTO ), 65 Mean accumulated down time (Petri nets), 629 Mean accumulated up time (Petri nets), 629 Mean active repair time (MART), 79 Mean administrative delay (MAD), 79 Mean down time (MDT), 40, 64 MDT calculation (Markov Graph), 491 MDT estimation (Monte Carlo), 552 MDT estimation (Petri nets), 629 Mean failure frequency estimation (Monte Carlo), 552 Mean fault detection time (MFDT), 79 Mean logistic delay (MLD), 79 Mean non-operating time to first nonoperating failure (MTTFFnO ), 71 Mean non-operating time to non-operating failure (MTTFnO ), 69 Mean number of repairs, 491 Mean operating time between failures (MTBFIEV ), 75 Mean operating time to failure (MTTFIEV ), 71 Mean operating time to operating failure (MTTFO ), 69 Mean operating time to the first operating failure (MTTFFO ), 71 Mean overall repair time (MORT), 79 Mean sojourn time into a state (MST), 479 Mean standby time to hidden failures (MTTFH ), 73 Mean time between failures (MTBF), 40, 74 Mean time between failures (MTBF) (Boolean family), 297 Mean time between failures (MTBF) estimation (Petri nets), 629 Mean time between failures (MTBF) (Markov graph), 493 Mean time between restorations (MTTRes), 75 Mean time to failure (MTTF), 66 Mean time to failure (MTTF) estimation (Petri nets), 630 Mean time to failure (MTTF) (Markov graph), 493 Mean time to first failure (MTTFF), 68 Mean time to restoration (MTTRes), 79 Mean up time (MUT), 40, 64

Index Mean up time (MUT) estimation (Monte Carlo), 552 Mean up time (MUT) estimation (Petri nets), 629 Mean up time (MUT) (Markov graph), 491 Mean visit number of a state (MVN), 479 Memoryless property, 471 Meteorological conditions, 463 MIL-Std-1629A (1977), 166 Minimal cut sets calculations (exercise), 444 Minimal cut sets (MCS) analysis, 39, 112, 205, 235, 388 Minimal cut set (CCF analysis), 330 Minimal cut set definition, 222 Minimal cut set order, 227 Minimal cut set ranking by order, 227 Minimal cut Set ranking by Probabilities, 257 Minimal tie sets (MTS) analysis, 205, 222 Minimal tie sets (probabilistic calculation), 260 Minimum capacity, 460 Minterm, 333 Minuteman project of the US Air Force, 209 Mitigating factor, 394 Mitigating measure, 140, 167 Mitigation layer, 402 Mixed redundancy, 539 Model-based systems engineering (MBSE), 424 Mode of operation (Safety instrumented systems) Continuous mode of operation, 756 High demand mode of operation, 756 Low demand mode of operation, 756 Modularization of large Petri nets, 638 Monotone / monotonic Boolean function, 334 Monotone / monotonic logic function, 235 Monte Carlo simulation, 11, 36, 290, 330, 373, 398, 468, 547, 587, 589 MTBF related to non-operating failures (MTBFnO ), 78 MTBF related to operating failures (MTBFO ), 77 Multidisciplinary study team, 158, 385 Multiphase modelling, 514 Multiphase system, 37, 472, 514, 710 Multiple Greek letter CCF model, 119 Multiple production levels, 709 Multiple product systems, 710 Multiple protection layers, 755

871 Multiple safety system, 427, 790 Multistate modelling, 512 Multistate system, 37, 123, 461, 472, 710 Multistate system modelling (Petri nets), 649 MUT related to operating failures (MUTO ), 65 Mutually exclusive combinations of events, 267 N NAND gate (probabilistic calculation), 249 Nominal capacity, 460 Non-acceptable risk, 22 Non-coherent FT, 237 Non-coherent RBD, 236, 237 Non-coherent system, 235, 272, 342 Non-conservativeness, 104, 375, 793 Non-correlated events, 374 Non-critical down state, 48 Non-critical safe failure, 56 Non-critical state, 343 Non-critical unsafe failure, 57 Non-critical up state, 48 Non-explicit dependency. See Non-tangible dependency, 108 Non-lethal shock, 116, 319 Non-monotone / non-monotonic Boolean function, 334 Non-monotone / non-monotonic logic function, 235 Non-operating failure, 55 Non-operating state, 45 Non-repairable item, 62 Non-repaired item, 63, 308, 469 Non-repaired non-lethal shock, 323 Non-tangible CCF, 326 Non-tangible dependency, 108 NOR gate (probabilistic calculation), 249 NOT gate, 237, 391 NOT gate (probabilistic calculation), 249 NOT operator, 199, 214 Not revealed failure. See Hidden failure, 54 Number of failures, 64 Number of observed failures, 64, 373 Number of operating failures, 65 Numerical calculations of Markov processes, 497 O Occurrence rate of a non-lethal shock, 323 Oil export pumping system, 717 On-demand failure, 56

872 Online diagnostics, 761 Operating expenditures (OPEX), 715 Operating failure, 55 Operating state, 45 Operation philosophy, 30, 458 Opportunity, 21 OR gate (probabilistic calculation), 245 Overall repair time, 62 Overpressure protection system (OPPS), 427, 717

P Parallel structure, 329 Parallel structure (probabilistic calculation), 246 Parameters changing when conditions change (Monte Carlo), 564 Partial and full stroking tests, 435 Perfect state, 48, 124, 465 Peril, 18 Periodically tested item, 472, 518 Periodic proof test. See Periodic tests, 463 Periodic tests, 54 Petri net approach, 34, 127, 285, 587, 588 Petri net approach exercises (service station), 679 Closed at night, 692 Evolution of the entrance and exit queues, 701 Failure and repair process, 684 Limited queue at the entrance, 689 Limited queue at the exit, 690 Link between queuing and failure/repair processes, 687 Mobilisation and spare part modelling, 695 Monte Carlo simulation with one pump, 697 Monte Carlo simulation with one pump open night and day, 702 Night and day model, 692 Overall queuing and refuelling model, 690 Queuing and refuelling processes, 682 Repair team unavailable at night, 693 Petri net as support for Monte Carlo simulation, 547, 589 Petri net model, 591 Arc, 592 Downstream arc, 592 Downstream weighted arc, 601 Inhibitor arc, 601

Index Reset arc, 602 Upstream arc, 592 Upstream weighted arc, 601 Assertion, 602 General assertion, 602 Local assertion, 602 Firing rules, 592, 603 Marking of a place, 592 Place, 591 Downstream place, 592 Repeated place, 602 Upstream place, 592 Predicate, 602 Probabilistic switches, 607 Token, 592 Transition, 591 Arbitrary distribution, 596 Constant delay equal to zero, 596 Constant delay not equal to zero, 596 Dynamic transitions, 609 Exponential law, 596 Priority of the Transitions, 604 Transition with memory, 604 Validation of transitions, 592, 603 Petri net modelling (Production systems), 727 Direct building of Petri nets, 727 Use of FD-Driven Petri nets, 736 Physical delay (Petri nets), 595 Piper Alpha (North Sea, Scotland, 1988), 4 Piping and instrumentation diagram (P&ID), 158 Planned maintenance, 463, 714 Full shutdown, 714 Partial shutdown, 714 Planned testing, 712 Point value, 373 Positive risk, 20, 21 Potential accident, 145 Predictive analysis (forward analysis, topdown reasoning), 420 Preliminary hazard analysis (PHA), 32, 141, 145, 177 Pre-processing of Petri nets, 622 Pressure control valve (PCV), 717 Pressure sensor high high (PSHH), 717 Pressure sensor high (PSH), 427 Pressure transmitter (PT), 717 Prestige (France, 2002), 4 Prevention / preventive measure, 42, 147 Preventive maintenance (PM), 61, 461, 710, 718 Primary event (FT leave), 211, 289

Index Prime implicant (non-coherent RBDs and FTs), 277 Prime implicant (non-coherent RBDs and FTs), 241 Priority AND gate (PAND), 304, 391, 415, 807 Probabilistic calculation of basic logic structures, 245 Probabilistic calculations (Functional safety) Boolean Approach, 801 Accident frequency calculation (functional safety), 806 Accident occurrence modelling, 800 Common cause failure modelling, 804 Failure frequency and PFH calculations, 810 FT-driven Markov processes, 801 Multiple Safety Systems, 808 Parallel system, 803 RBD-driven Markov processes, 801 Series system, 801 Series–parallel system, 804 Unavailability and PFDavg calculations, 809 Markovian approach, 793 Multiphase Markov model, 794 Petri Net approach, 811 RBD driven Petri nets, 811 Simplified analytical approach (functional safety), 775 Systemic approaches, 755 Probabilistic calculations (functional safety), 774 Probability density function (PDF), 562 Probability of dangerous failure on demand average (PFDavg ). See Average (mean) unavailability, 295, 377, 519, 756 Probability of dangerous failure per hour (PFH). See Average failure frequency, 97, 298, 488, 756 Probability of failure, 40 Probability of the conjunction (intersection) of events, 190 Probability of the disjunction (union) of events, 189 Process flow diagram (PFD), 30 Process safety time (PST), 131, 762 Production availability study, 29, 461, 715 Production level, 710 Production system, 123, 709

873 Proof tests, 758 Optimum test interval, 778 Proof test interval, 756 Staggered proof tests, 758 Synchronous proof tests, 758 Propagating failures. See Cascade failures, 106 Protection layer, 127, 402 Pseudo error factor, 376, 382, 561, 850 Q Qualitative analysis, 385 Quantitative analysis, 385 R Random failure, 51 Random history, 548 Random number generation (Monte Carlo), 554, 587 Random number generator, 547 Random process, 80, 471 Random variable, 80, 373, 553 Rare event, 80, 716 Rasmussen N (Wash 1400), 10 RBD Building (exercise), 430 RBD calculations, 285 RBD-driven Markov process, 35, 135, 202, 290, 467, 532 RBD-driven Petri nets, 136, 416, 641 Blocks, 641 Majority vote (m out of n) node, 642 Parallel node, 642 Serial node, 641 Reducing the size of the Markov models, 522 Redundancy. See Fault tolerance, 58, 59, 227 Active redundancy, 59 Diverse redundancy, 59 Mixed redundancy, 59 Standby redundancy, 59 Reliability R(t), 40, 82, 286 Reliability, availability, maintainability and safety (RAMS), 18 Reliability-based inspection (RBI), 13 Reliability Block Diagram modelling (production systems), 721 Additional safety stimes, 724 Limited number of spare parts, 724 Number of repair teams, 724 Limited number of repair teams, 724 Reliability block diagram (RBD), 35, 183, 195, 210, 265, 307, 373 Block, 195

874 Inverted block, 199 Repeated block, 199 Majority vote (m out of n) logic, 199 Parallel structure, 198 Series structure, 198 Sub-RBD, 201 Tranfer gate, 201 Reliability calculations (Boolean family), 303, 307 Non-repaired items, 309 Repaired items, 313 Reliability calculations (Markov graph), 484 Reliability calculations (Petri nets). See Reliability estimation (Petri nets) Reliability centred maintenance (RCM), 13 Reliability chronogram, 287 Reliability data collection, 839 Battery limits, 842 Field feedback, 839 Reliability data estimation, 844 Accumulated observation period of time, 844 Accumulated operating time, 846 Accumulated repair time, 845 Bayesian approach Likelihood of data, 852 Chi-square distribution Degree of freedom, 848 Confidence interval, 848 Confidence lower bound, 849 Confidence upper bound, 849 Confidence level, 848 Delphi approach, 844 Expert judgment, 844 Frequentist approach, 852 Gamma distribution Chi-square distribution, 853 Erlang distribution, 853 Log-Normal distribution Error factor, 850 Maximum likelihood estimator (MLE), 844 Median value estimator, 845 Reliability engineering, 9, 197, 393, 471 Reliability estimation (Petri nets), 630 Reliability Markov graph, 484 Reliability parameter, 373 Reliability theory, 8 Repair, 62 Repairable item, 62 Repaired item, 63, 311, 469 Repaired non-lethal shock, 324 Repair frequency, 491

Index Repair priority, 462, 617 Non-urgent repair, 617 Opportunistic repair, 617 Urgent repair, 617 Repair time, 62 Repeated block, 249, 321 Repeated events, 272 Repeated primary event, 249, 321 Response surface, 398, 597 Restoration, 61 Restoration events, 713 Active repair time, 713 Administrative delays, 714 Logistic times, 713 Ramp-up time, 713 Run-down time, 713 Safety times, 713 Restoration state, 46 Retro-feedback impact, 458 Revealed failure, 54 Revenues, 714 Risk, 18, 21 Economic related risk, 752 Environment related risk, 752 Safety related risk, 752 Risk achievement worth (RAW), 362 Risk matrix, 22, 162, 169, 404 Acceptable zone, 752 Not acceptable zone, 752 Tolerable zone, 752 Risk priority number (RPN), 153, 169 Risk-reducing measure. See Correcting / mitigating measure, 153 Risk reduction principle, 752, 754 Necessary risk reduction, 751 Risk reduction factor (RRF), 752 Risk reduction worth (RRW), 362 RMS Titanic (North Atlantic Ocean, 1912), 4 RND, RAND, RANDOM functions, 555 Root cause. See Underlying root cause, 104

S Safe failure, 56, 132 Safe failure fraction (SFF), 765 Safeguard, 159, 385 Safe state, 465, 466 Safety and dependability, 18 Core concepts, 43 Model overview, 33 Related standardization, 833 Safety instrumented function (SIF), 402

Index Safety instrumented system (SIS), 133, 377, 717, 749 Safety integrity, 134, 749 Safety integrity level (SIL), 402, 751 Permanent SIL, 759 SIL 0, 753 SIL 1, 753 SIL 2, 753 SIL 3, 753 SIL 4, 753 SIL certificates, 753 Safety principles, 764 Emission of power principle, 764 Loss of power principle, 764 Safety system, 127 Safety valve, 427 Santiago de Compostela (Spain, 2013), 4 Satistical estimation, 547 Second priority repairs, 712 Self-approximation, 548 Self-revealed failure, 54, 130 Semi-catastrophic model (CCF), 534, 564, 613 Semi-Markov process, 474 Semi-quantitative analysis, 257 Semi-quantitative analysis (exercise), 434, 437 Sensor, 132 Separator (SEP), 717 Sequence modelling (Markov), 501 Sequence of events, 147, 385 Sequence of events, 502 Sequential analysis methods, 385 Sequential events, 197, 211 Series expansion of the exponential of a Markovian matrix, 497 Series structure, 329 Series structure (probabilistic calculation), 245 Severity of the consequences, 386 Severity ranking, 140, 149, 167, 404 Seveso accident (Italy, 1976), 4, 12 Shannon decomposition, 268 Shift crew rotation, 463 Shock model (CCF), 116 Markov graphs, 533 Petri nets, 612 RBDs and FTs, 233, 328 Shutdown valve (SDV), 717 Simulation of constant delays, 558 Constant delays, 558 Delays for periodically tested components, 558

875 Simulation of random delays, 556 Erlang law, 557 Exponential law, 556 Lognormal law, 557 Uniform law, 556 Weibull law, 556 Single cause, 33 Single-point / single failure, 60, 227 Single-point / single failure criterion, 227 Single protection layer, 754 Software failure, 50 Software related dependency, 108 Space state, 185 Spare part management, 463 Spare part modelling (Petri nets), 616 Spare part provisioning, 547, 745 Spurious failure, 56, 712 Standard content, 832 Guidelines, 832 Sandards properly speaking (IS), 832 Informative clauses, 832 Normative clauses, 832 Technical reports (TR), 832 Standard development, 831 Committe draft for vote (CDV) (IEC standards), 831 Committee draft (CD), 831 Draft international standard (DIS), 831 Final draft international standard (FDIS), 831 International standard (IS), 832 New work item proposal (NWIP), 831 Working draft (WD), 831 Standard deviation, 373 Standardization about data, 835, 841 IEC/TR 63162, 842 IEC 60300-3-2: Collection of dependability data, 835 IEC 60706-3, 842 IEC 61709, 842 IEC 63142, 842 IEC 63142: Reliability prediction for electronic components, 835 ISO 14224: Reliability data collection and exchange, 834, 841 ISO 6527, 841 ISO 7385, 841 Standardization about maintainability, 835 IEC 60300-3-10: Maintainability, 835 IEC 60300–3-11: Reliability Centered maintenance, 835 IEC 62550: Spare part provisioning, 835 Standardization bodies, 831

876 International level, 831 International Electrotechnical Commission (IEC), 831 International Standard Organization (ISO), 831 International Telecommunication Union (ITU), 831 National level, 831 American National Standards Institute (ANSI), 831 British Standards Institution (BSI), 831 French normalisation body (AFNOR), 831 Standard Norway (SN), 831 Standardization administration of the People’s Republic of China (SAC), 831 Project teams (PT), 831 Regional level, 831 CENELEC (European Committee for Electrotechnical Standardization), 831 European committee for standardization (CEN), 831 European Telecommunication Standard Institute (ETSI), 831 Pacific Area Standard Congress (PASC), 831 Sectoral level, 831 American Petroleum Institute API), 831 French National standardization bureau for petroleum industry (BNPE), 831 Institute of Electrical and Electronic Engineers (IEEE), 831 US Department Of Defense (DOD) (military standards), 831 Technical committees (TCs), 831 IEC ACOS: Advisory Committee On Safety, 834 IEC/TC 1: dependability terminology (IEV 192), 834 IEC/TC 56: Dependability, 833 IEC/TC 65 A: Industrial process measurement, control and automationsystem aspects, 833 ISO/TC 176: Quality management, 834 ISO/TC 251: Asset management, 834 ISO/TC 262: Risk management, 834 ISO/TC 69: Application of statistical methods, 834

Index ISO/TC 67/WG4: Materials, equipment and offshore structures for petroleum, petrochemical and natural gas industries - Reliability engineering and technology, 833 Working groups (WG), 831 Standardization of management, 835 IEC/ISO 31010: Risk management, 835 IEC 60300 series: Dependability management, 835 IEC 62402: Obsolescence management, 835 ISO 15663 Ed.1.0 (2021): Life cycle costing, 835 Standardization of methods, 834 IEC 60300-3-1: Analysis techniques for dependability, 835 IEC 60300-3-3: Life cycle costing, 835 IEC 60605-6: Test for constant failure rate, 835 IEC 60812: FMEA, 834 IEC 61014: Program for reliability growth, 835 IEC 61025: Fault tree analysis, 834 IEC 61078: Reliability block diagrams, 834 IEC 61163-2: Reliability test screening, 835 IEC 61165: Markovian technique, 834 IEC 61882: HAZOP, 834 IEC 62347: Dependability specifications, 835 IEC 62502: Event tree analysis, 834 IEC 62506: Accelerated testing, 835 IEC 62508: Human aspects, 835 IEC 62551: Petri net technique, 834 IEC 62628: Software dependability, 835 IEC 62673: Communication networks, 835 IEC 62740: Root cause analysis, 834 IEC 62853: Open systems, 835 ISO 20815: Production assurance, 834 Standardization process, 830 Standardization versus regulation and certification, 830 Standardized guidelines, 832 ISO/IEC guide 51: Safety standards, 832 ISO guide 73: Risk management), 834 Standby failure, 55 State efficiency, 122, 124, 512 State-machine, 424 State probability vector, 477 State-transition approach, 471

Index State-transition graph, 473 State-transition model, 289 State-transition process, 473 Static models, 34, 285 Statistical data sample, 373, 549 Statistic estimations, 587 Steady state, 469 Steady state availability. See Asymptotic (steady state) availability, 89 Steady state unavailability. See Asymptotic (steady state) unavailability, 89 Stepper (Petri net animation), 599 Stochastic delay (Petri nets), 595 Stochastic Petri net, 468, 596 Stochastic process. See Random process, 80 Stochastic (random) processes, 464 Structured What if? technique (SWIFT), 176 Sub-fault tree (sub-FT), 214 Sub-PN module (Petri nets), 641 Sûreté de fonctionnement (SdF), 18 Suspended event, 463, 582, 604 Sylvester-Poincaré formula, 252, 259, 288 Sylvester-Poincaré formula shortcomings, 265 Synchronous automata, 588 Systematic failure, 51 Systematic review, 173 Systemic approach / point of view, 15 Systemic dependency, 211, 458, 473, 485, 547 System modelling language (SysML), 424 System reconfiguration (production systems), 709 Systems typology, 468

T Table of impacted transitions (Petri nets), 622 Tangible common cause failure, 320 Tangible dependency, 108 Team in charge of the study. See Multidisciplinary study team, 173 Team leader, 158, 174 Technique for human error prediction (THERP), 10 Temporized (timed) Petri net, 594 Tenerife (Spain, 1977), 4 Tested repaired items (exercise), 446 Test staggering, 450 Three Mile Island (USA, 1979). See Common cause failure (CCF), 3, 16, 110

877 Tie set, 222 Tie set identification (exercise), 431 Time-dependent calculations (Markov), 475 Time-dependent calculations (RBDs and FTs), 285 Time-dependent failure, 55 Time-independent calculations (RBDs and FTs), 189, 192 Time-independent CCF, 327 Time-independent failure, 55 Timetable (Petri nets), 621 Tolerable risk, 22 Top-down (deductive) approach, 9, 196 Top event, 211, 229 Torrey Canyon (Scilly Islands, United Kingdom, 1967), 4 Trajectory of the underlying random (stochastic) process, 464 Transfer gate, 391 Transition rate, 474 Transportation of maintenance people, 463 Triggering event, 141, 147 Triggering event (Petri nets), 608 U Ultimate CCF, 59 Unavailability, 40, 211 Unavailability calculations (Boolean family), 287, 303 Asymptotic unavailability (steady state unavailability), 295 Average unavailability calculations, 293 Instantaneous unavailability, 288 Unavailability calculations (exercise), 446 Unavailability calculations (Markov graph), 482 Asymptotic (steady state) unavailability, 480 Average unavailability, 483 Instantaneous unavailability, 482 Unavailability calculations (Petri nets), 627 Average unavailability, 627 Instantaneous unavailability, 627 Unavailability calculation with CCF (exercise), 449 Unavailability calculation with test staggering (exercise), 450 Uncertainty, 19 Uncertainty handling in SIL calculations, 817 70% upper bound approach, 817 90th percentile of the full result distribution, 817

878 Uncertainty propagation (exercise), 453 Uncertainty propagation (Monte Carlo), 330, 373, 382, 562 Unconditional failure intensity, 96, 487 Underlying root cause, 104, 319 Unified modelling language (UML), 424 Uniform distribution (UNI), 553 Unreliability calculations (Boolean family), 211, 446 Unreliability calculations (Markov graph), 485 Unreliability estimation (Monte Carlo), 552 Unreliability estimation (Petri net), 630 Unreliability, F(t), 40, 84, 86 Unreliability function (CDF of the time to failure), 553 Unsafe failure, 56, 132 Unwanted consequence, 174 Unwanted effect, 218 Unwanted event. See Top event, 229, 388 Updating occurrence dates (Monte carlo), 564 Aging modelling, 569 Application to Weibull distributions, 577 Failure rate continuity, 569 Probabilistic continuity, 567 Temporal continuity, 568 Updating failure dates on the fly, 570 Up state, 44, 195, 210, 235, 285 Up state class, 121, 464, 475 Useful life period, 94

V Valid transition (Petri nets), 592 Variable order (BDD), 270

Index Venn diagram, 185 Vesely failure rate. See Unconditional failure intensity, 299, 312, 487 Vesely-Fussell importance factor, 260, 300, 351 Vesely-Fussell importance factor (exercise), 436 Viking Sky (Norway 2019). See Common cause failure (CCF), 21, 110 Von Neumann, 548

W Walk throughout Boolean models (BDD building), 282 Weak point, 227, 257 Wear out failure, 53 Wear out period, 94 Weather conditions, 710 What-if? method, 140, 174, 179 Work order management, 463

X XOR gate. See Exclusive disjunction (exclusive union), 236

Y Yancheng (China 2019), 21

Z Zero-duration state modelling, 500 Zero risk, 4 Zone analysis, 112