580 97 16MB
English Pages 432 [439] Year 2021
Safety and Reliability Modeling and Its Applications
Advances in Reliability Science covers traditional topics in reliability engineering (degradation models, dynamic network and product reliability, maintenance and reliability statistics) as well as important emerging topics such as multi-state systems reliability and reliability decision-making. All of these areas have developed considerably in recent years, with the rate of reliability research output climbing steeply. Books in this series showcase the latest original research & development in reliability engineering science from industry and academia, while exploring innovative research ideas for researchers considering new projects and exploring the real-world utility of these concepts for practitioners. Series Editor: Mangey Ram, Professor at Graphic Era University, Dehradun, India Safety and Reliability Modeling and Its Applications Mangey Ram, Hoang Pham 978-0-12-823323-8 Reliability and Maintenance Optimization in Multi-indenture Systems Won Young Yun 978-0-323-85054-4 Reliability Analysis and Asset Management of Engineering Systems Gilberto Francisco Martha de Souza et al 978-0-12-823521-8 Engineering Reliability and Risk Assessment Harish Garg, Mangey Ram 978-0-323-91943-2 Reliable and Resilient Logistics Systems ´ Agnieszka Tubis, Sylwia Werbinska-Wojciechowska 978-0-323-91752-0
Safety and Reliability Modeling and Its Applications
Edited by
Mangey Ram Graphic Era (Deemed to be University), Dehradun, India
Hoang Pham Rutgers University, New Jersey, United States
Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-823323-8 For Information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Matthew Deans Acquisitions Editor: Brian Guerin Editorial Project Manager: Emily Thomson Production Project Manager: Kamesh Ramajogi Cover Designer: Mark Rogers Typeset by Aptara, New Delhi, India
Contents Preface Acknowledgement About the Editors List of Contributors
1
xi xiii xv xvii
Reliability analysis of asphalt pavements: concepts and applications Abhishek Mittal 1.1 1.2 1.3 1.4 1.5
2
Preamble Concepts of reliability Literature regarding the application of reliability concepts for asphalt pavements Issues with estimation of pavement reliability Conclusions Acknowledgements Disclosure statement References
1 2 4 9 13 13 13 13
Markov modeling of multi-state systems with simultaneous component failures/repairs, using an extended concept of component importance Jacek Malinowski 2.1 2.2 2.3 2.4 2.5 2.6 2.7
3
Introduction Basic assumptions, notation and definitions Theoretical background The illustrative model of an example system Intensities of transitions between the system states Obtaining useful reliability parameters from transition intensities Conclusion and future work References
15 17 20 23 26 27 28 29
Reliability analysis of solar array drive assembly by dynamic fault tree Tudi Huang, Hong-Zhong Huang, Yan-Feng Li, Lei Shi and Hua-Ming Qian 3.1
Introduction
31 v
vi
Contents
3.2 3.3 3.4 3.5
4
DFT method DFT Modeling for SADA Reliability analysis of SADA Conclusion Acknowledgements References
32 32 34 39 40 40
Reliability and maintainability of safety instrumented system Rajesh S. Prabhu Gaonkar and Mahadev V. Verlekar 4.1 4.2 4.3 4.4 4.5 4.6
5
Introduction Literature review Problem formulation solution methodology Reliability and maintainability Case study on reliability and maintainability of SIS Fault analysis Conclusion References
43 49 63 65 70 79 88 89
Application of Markovian models in reliability and availability analysis: advanced topics Danilo Colombo, Danilo T.M.P. Abreu and Marcelo Ramos Martins 5.1 5.2 5.3 5.4 5.5 5.6 5.7
6
Introduction Markov chains theoretical foundation Application of Markov chains to the reliability and availability analysis of engineering systems Importance measures using Markov chains Uncertainty propagation in Markov chains Multiphase Markov chains and their application to availability studies Final considerations References
92 96 107 129 137 148 156 157
A method of vulnerability analysis based on deep learning for open source software Yoshinobu Tamur and Shigeru Yamada 6.1 6.2 6.3 6.4 6.5 6.6
Introduction Deep learning approach to fault big data Estimation of Vulnerability Based on Deep Learning Numerical Examples for Estimation of Vulnerability Concluding remarks Acknowledgements References
161 162 163 164 169 176 176
vii
Contents
7
Mathematical and physical reality of reliability Jezdimir Knezevic 7.1 7.2 7.3 7.4 7.5 7.6 7.7
8
Dedication Introduction Mathematical reality of reliability Voyage to the ice Physical meanings of mathematical reality of reliability Physical reality of reliability Mathematical versus physical reality of reliability Closing Question Acknowledgement References
179 179 181 185 189 193 224 224 224 226
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant safety instrumented systems Sun-Keun Seo and Won Young Yun 8.1 8.2 8.3 8.4 8.5
9
Introduction PFD of redundant safety systems Staggered testing in 1-out-of-3 structure Staggered testing in 2-out- of-3 structure Conclusion References
227 229 231 235 242 242
Modified failure modes and effects analysis model for critical and complex repairable systems Garima Sharma and Rajiv Nandan Rai 9.1 9.2 9.3 9.4 9.5 9.6
10
Introduction Repairable systems and imperfect repair Fuzzy AHP Estimation of RPN Case study Conclusion and future scope Exercise References
245 247 249 252 254 258 259 259
Methodology to select human reliability analysis technique for repairable systems Garima Sharma and Rajiv Nandan Rai 10.1 10.2 10.3 10.4
Introduction Selection of the best HRA technique for a particular case Case study of space station Conclusion and future scope Exercise
261 266 271 277 277
viii
Contents
Appendix References
11
277 280
Operation risk assessment of the main-fan installations of mines in gas and nongas conditions G.I. Grozovskiy, G.D. Zadavinb and S.S. Parfenychevc 11.1 Introduction 11.2 The ventilation system failures role in assessing the risk of flammable gases explosion 11.3 Analysis of the occurrence and development of accidents 11.4 Analysis of the probability of explosion of flammable gases/hydrogen sulfide at the mine from electrical equipment 11.5 The risk analysis results 11.6 Conclusion References
12
283 288 294 299 300 303 303
Generalized renewal processes Paulo R.A. Firmino, Cícero C.F. de Oliveira and Cláudio T. Cristino 12.1 12.2 12.3 12.4 12.5 12.6
13
Introduction The GRP models The UGRP modeling The WGRP modeling The Gumbel GRP (GuGRP) modeling Conclusion Acknowledgement References
306 307 310 311 337 349 350 350
Multiresponse maintenance modeling using desirability function and Taguchi methods Suraj Rane, Raghavendra Pai, Anusha Pai and Santosh B. Rane 13.1 13.2 13.3 13.4 13.5 13.6
14
Introduction Related works Methodology Case study Result analysis Conclusion and future research directions References
353 356 359 359 361 367 369
Signature-based reliability study of r-within-consecutive-k-out-of-n: F systems Ioannis S. Triantafyllou 14.1 Introduction
373
Contents
ix
14.2 The signature vector of the r-within-consecutive-k-out-of-n: F structure 375 14.3 Further reliability characteristics of the r-within-consecutive-kout-of-n: F structure 384 14.4 Signature-based comparisons among consecutive-type systems392 14.5 Discussion 393 References 395
15
Assessment of fuzzy reliability and signature of series– parallel multistate system Akshay Kumar, Meenakshi Garia, Mangey Ram and S.C. Dimri 15.1 Introduction 15.2 Fuzzy Weibull distribution 15.3 Evolution of signature, tail signature, minimal signature, and cost from structure function of the system 15.4 Algorithm for computing the system availability (see Levitin, 2005) as 15.5 Example 15.6 Conclusion References
Index
397 399 400 401 401 406 406 409
Preface Safety and reliability analysis is definitely one of the most multidimensional topics in system reliability engineering nowadays. This rapid development creates many opportunities and challenges for both industrialists and academics, and has completely changed the global design and systems engineering environment. More of the modeling tasks can now be undertaken within a computer environment using simulation and virtual reality technologies. During the last 50 years, numerous research studies have been published that focus on safety and reliability engineering. Supplementary experience has also been gathered from industry. Therefore, safety and reliability engineering has emerged as one of the main fields not only for scientists and researchers but also for engineers and industrial managers. This book covers the recent developments in safety and reliability modeling and its applications. It presents new theoretical issues that were not previously presented in the literature, as well as the solutions of important practical problems and case studies illustrating the applications methodology. The book Safety and Reliability Modeling and Its Applications is a combined work of a number of leading scientists, analysts, mathematicians, statisticians, and engineers who have been working on the front end of safety and reliability science and engineering. All chapters in the book are written by leading researchers and practitioners in their respective fields of expertise and present various innovative methods, approaches, and solutions not covered before in the literature. Mangey Ram, Dehradun, India Hoang Pham, New Jersey, USA
Acknowledgment The editors acknowledge Elsevier and the editorial team for their adequate and professional support during the preparation of this book. Also, we would like to acknowledge all the chapter authors and the reviewers for their availability to work on this book project. Mangey Ram Graphic Era (Deemed to be University), India Hoang Pham Rutgers University, USA
About the Editors Prof. Dr. Mangey Ram received the Ph.D. degree major in Mathematics and minor in Computer Science from G. B. Pant University of Agriculture and Technology, Pantnagar, Uttarakhand, India. He has been a faculty member for around twelve years and has taught several core courses in pure and applied mathematics at undergraduate, postgraduate, and doctorate levels. He is currently the Research Professor at Graphic Era (Deemed to be University), Dehradun, India. Before joining the Graphic Era, he was a deputy manager (probationary officer) with Syndicate Bank for a short period. He is the editor-in-chief of International Journal of Mathematical, Engineering and Management Sciences, Journal of Reliability and Statistical Studies; the editor-in-chief of six Book Series with Elsevier, CRC Press-A Taylor and Francis Group, Walter De Gruyter Publisher Germany, River Publisher; and the guest editor and member of the editorial board of various journals. He has published more than 250 research publications (journal articles/books/book chapters/conference articles) in IEEE, Taylor & Francis, Springer, Elsevier, Emerald, World Scientific, and many other national and international journals and conferences. Also, he has authored/edited more than 50 books for international publishers such as Elsevier, Springer Nature, CRC Press-A Taylor and Francis Group, Walter De Gruyter Publisher Germany, and River Publisher. His fields of research are reliability theory and applied mathematics. Dr. Ram is a Senior Member of the IEEE, Senior Life Member of Operational Research Society of India; Society for Reliability Engineering, Quality and Operations Management in India; Indian Society of Industrial and Applied Mathematics. He has been a member of the organizing committee of a number of international and national conferences, seminars, and workshops. He has been conferred with “Young Scientist Award” by the Uttarakhand State Council for Science and Technology, Dehradun, in 2009. He has been awarded the “Best Faculty Award” in 2011, “Research Excellence Award” in 2015, and “Outstanding Researcher Award” in 2018 for his significant contributions in academics and research at Graphic Era Deemed to be University, Dehradun, India. Dr. Hoang Pham is a Distinguished Professor and former Chairman (2007– 2013) of the Department of Industrial and Systems Engineering at Rutgers University, New Jersey. Before joining Rutgers, he was a Senior Engineering
xvi
About the Editors
Specialist with the Idaho National Engineering Laboratory and Boeing Company. He received his Ph.D. in Industrial Engineering from the State University of New York at Buffalo. His research areas include reliability modeling of systems with competing risks and random environments, software reliability, and statistical inference. He is the editor-in-chief of the International Journal of Reliability, Quality and Safety Engineering and an associate editor and editorial board member of several journals, and the editor of Springer Series in Reliability Engineering. His numerous awards include the 2009 IEEE Reliability Society Engineer of the Year Award. Dr. Pham is the author/coauthor of 7 books and has published his work in over 190 journal articles, 100 conference papers, and edited 18 books including Springer Handbook in Engineering Statistics and Handbook in Reliability Engineering. He has delivered over 40 invited keynote and plenary speeches at many international conferences and institutions. He is a Fellow of the IEEE and IIE.
List of Contributors Danilo T.M.P. Abreu, Analysis, Evaluation and Risk Management Laboratory (LabRisco), University of São Paulo, São Paulo, SP, Brazil Danilo Colombo, Petrobras R&D Center (CENPES), Rio de Janeiro, RJ, Brazil Cláudio T. Cristino, Department of Statistics & Informatics, Federal Rural University of Pernambuco, Recife-PE, Brazil S.C. Dimri, Department of Mathematics, Computer Sciences and Engineering, Graphic Era (Deemed to be University), Uttarakhand, India Paulo R.A. Firmino, Center for Science and Technology, Federal University of Cariri, Juazeiro do Norte-CE, Brazil Rajesh S. Prabhu Gaonkar, School of Mechanical Sciences, Indian Institute of Technology Goa (IIT Goa), Farmagudi, Ponda, Goa, India Meenakshi Garia, Department of Mathematics, M.B.P.G. College, Haldwani, Nainital, Uttarakhand, India G.I. Grozovskiy, Deputy Director General on Science, Doctor of Engineering, Professor, OJSC Scientific Technical Centre (STC) Industrial Safety, Moscow, Russia Hong-Zhong Huang, Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China Tudi Huang, Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China Jezdimir Knezevic, MIRCE Akademy, Exeter, UK Akshay Kumar, Department of Mathematics, Graphic Era Hill University, Uttarakhand, India Yan-Feng Li, Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China Jacek Malinowski, Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warszawa, Poland Marcelo Ramos Martins, Analysis, Evaluation and Risk Management Laboratory (LabRisco), University of São Paulo, São Paulo, SP, Brazil Abhishek Mittal, Principal Scientist, CSIR-Central Road Research Institute (CSIRCRRI), New Delhi, India Cícero C.F. de Oliveira, Federal Institute of Education, Science and Technology of Ceará, Crato-CE, Brazil
xviii
Contributors
Anusha Pai, Associate Professor, Computer Engineering Department, Padre Conceicao College of Engineering, Verna, Goa, India, 403722 Raghavendra Pai, Project Management Office Lead (Asia Pacific), Syngenta, Corlim, Ilhas, Goa, India, 403110 S.S. Parfenychev, Researcher Junior, OJSC Scientific Technical Centre (STC) Industrial Safety.Master’s degree student of the Moscow Aviation Institute, Faculty № 3 Control Systems, Informatics and Power Engineering, Department 307 Digital Technologies and Information Systems, Moscow, Russia Hua-Ming Qian, Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China Rajiv Nandan Rai, Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India Mangey Ram, Department of Mathematics, Computer Sciences and Engineering, Graphic Era (Deemed to be University), Uttarakhand, India Santosh B. Rane, Dean-Academics, Sardar Patel College of Engineering, Andheri, Mumbai, India, 400058 Suraj Rane, Professor, Mechanical Engineering Department, Goa College of Engineering, Farmagudi, Goa, India, 403401 Sun-Keun Seo, Department of Industrial and Management Systems Engineering, DongA University, Busan, Korea Garima Sharma, Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India Lei Shi, Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China Yoshinobu Tamura, Tokyo City Univerity, Tamazutsumi 1-28-1, Setagaya-ku, Tokyo 158-8557, Japan Ioannis S. Triantafyllou, Department of Computer Science & Biomedical Informatics, University of Thessaly, Lamia, Greece Mahadev V. Verlekar, Deccan Fine Chemicals (India) Pvt. Ltd., Santa Monica Works, Corlim, Ilhas, Goa, India Shigeru Yamada, Tottori Univerity, Minami 4-101, Koyama, Tottori-shi, 680-8552 Japan Won Young Yun, Department of Industrial Engineering, Pusan National University, Busan, Korea G.D. Zadavin, Adviser to the Director General, Candidate of Engineering Science, OJSC Scientific Technical Centre (STC) Industrial Safety, Moscow, Russia
Chapter 1
Reliability analysis of asphalt pavements: concepts and applications Abhishek Mittal Principal Scientist, CSIR-Central Road Research Institute (CSIR-CRRI), New Delhi, India
1.1 Preamble The development of infrastructure, in particular the transportation sector, plays a significant role in the economic growth of any country. The economic growth demands a good road network with good connectivity all over the country. With the reduced availability of funds, the highway agencies are placing more emphasis on the design and construction of pavements that require minimum maintenance during the service life. For this, it is necessary that pavements should be designed such that a minimum design reliability (as specified in the country’s national specifications) is achieved and the pavement construction should be done with the latest machinery and under stringent quality control requirements. In India, majority of the roads (more than 90 %) are asphalt pavements, popularly known as flexible pavements. This is due to their low construction cost (in comparison to cement concrete/rigid pavements), ease of maintenance, and relatively easier construction procedure. To ensure that the pavement has adequate strength to cater to the expected traffic, it has to be designed properly in accordance with the national specifications. For example, IRC:37 (2018) is followed for the design of flexible pavements in India. A flexible pavement is a multilayer structure consisting of many layers of materials starting from subgrade at the bottom to bituminous wearing course at the top. A typical three layer pavement structure is presented in Fig. 1.1. The structural design of pavements deals with determining the thicknesses of the various component layers keeping in consideration the material properties and the amount of traffic which is expected during the design life. The current India pavement design procedure (IRC:37, 2018) is a deterministic one wherein the various input variables like layer thicknesses, Poisson’s ratio, elastic Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00009-X Copyright © 2021 Elsevier Inc. All rights reserved.
1
2
Safety and reliability modeling and its applications
FIG. 1.1
A typical three layer pavement structure (Dilip et al., 2013)
modulus, and design traffic are all considered as fixed. However, in reality none of them are deterministic; they all are stochastic (probabilistic). So, in order to develop reliable pavement designs, the uncertainty/variability of the input variables need to be considered in the design process. This can be addressed through the use of reliability concepts within the pavement design process. The details of the reliability concepts and its applications to the pavement design process, specifically in the context of flexible pavements, are discussed in the following sections.
1.2 Concepts of reliability Reliability is defined as “the probability that a component or system will perform a required function for a given period of time when used under stated operating conditions”. (Modarres et al. 1999). In mathematical notations, the reliability can be expressed as: R = P(T ≥ t|c1 , c2 , . . . . . .)
(1.1)
Where, t = the designated period of time or cycles for the system’s operation T = time to failure or cycle to failure R = reliability of the system c1 , c2 ,…. = designated conditions, such as environmental conditions Often, in practice, the designated operating conditions for a system c1 , c2 , ….. are implicitly considered in the probabilistic reliability analysis and thus (1) reduces to R = P(T ≥ t )
(1.2)
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
3
Reliability is the probability of successful performance; thus, it is the converse of the term “probability of failure.” So,
R = 1 − Pf
(1.3)
The probability of failure is always associated with a particular performance criterion. Mathematically, the definition of probability of failure (Pf ) might be stated as (Aguiar-Moya and Prozzi, 2011): Pf = P(g(X ) ≤ 0) = . . . . . . (1.4) fX (x)dx g(X )≤0
Where, X is the vector of basic random variables, g(X) is the limit state (or failure) function for the failure mode considered and fX (x) is the joint probability density function of the vector X. The expression g(X) < 0 indicates a failure domain; g(X) > 0 indicates safe domain and g(X) = 0 denotes a failure surface. Therefore, estimation of reliability requires the solution of a multidimensional integral that can rarely be solved analytically. For this reason, other methods such as numerical integration become essential. This might not be practically feasible in probabilistic analysis because of the multi - dimensional nature of the problem wherein a dimension is associated for each basic variable, and the area of interest is usually in the tails of the distributions (Cronvall, 2011). The uncertainty from all the sources which may affect that failure of the component (or system) should be considered for a rigorous structural reliability assessment. This clearly involves taking into account all fundamental quantities entering the problem, and also the uncertainties that arise from lack of knowledge and idealized modeling. The structural reliability procedure is outlined by the following steps (Cronvall, 2011): (a) Identify all significant modes of failure of the structure or operation under consideration, and define failure events. (b) Formulate a failure criterion or failure function for each failure event. (c) Identify the sources of uncertainty influencing the failure of the events, model the basic variables and parameters in the failure functions and specify their probability distributions. (d) Calculate the probability of failure or reliability for each failure event, and combine these probabilities where necessary to evaluate the failure probability or reliability of the structural system. (e) Consider the sensitivity of the reliability results to the input, such as basic variables and parameters. (f ) Assess whether the evaluated reliability is sufficient in comparison with a target.
4
Safety and reliability modeling and its applications
1.2.1
Levels of Reliability Methods
Structural reliability methods are divided into various levels which are characterized by the extent of information about the structural problem that is used and provided. The levels of reliability methods are given below (Madsen et al., 1986): (a) Level I methods : Reliability methods that employ only one ‘characteristic’ value of each uncertain parameter are called level I methods. Examples include load and resistance factor formats, including the allowable stress formats. (b) Level II methods : Reliability methods that employ two values of each uncertain parameter (commonly mean and variance), supplemented with a measure of the correlation between the parameters (usually covariance), are called level II methods. Reliability index methods are examples of level II methods. (c) Level III methods : Reliability methods that employ probability of failure as a measure, and which therefore require a knowledge of the joint distribution of all uncertain parameters, are called level III methods. (d) Level IV methods : Reliability method that compares a structural prospect with a reference prospect according to the principles of engineering economic analysis under uncertainty, considering costs and benefits, of construction, maintenance, repair, consequences of failure, interest on capital, etc., is called a level IV method. Such design methods are still in the process of development.
1.3 Literature regarding the application of reliability concepts for asphalt pavements There is significant variability in the various input parameters involved in the pavement design process. Many of these design inputs cannot be predicted exactly due to lack of knowledge and information and uncertain future socioeconomic conditions. Also, due to nonhomogeneous materials and variable construction practices, there are inherent variations in the pavement strength. This uncertainty in prediction and natural variation of input parameters results in variable pavement system performance and early failures in pavements (Darter and Hudson, 1973). Historically, variation or uncertainty (in design) has been taken into consideration by the use of safety factors or arbitrary decisions based on experience. However, the use of such safety factors or experience based decisions without due consideration of input variables uncertainty has resulted in few failures (Hudson, 1975).
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
5
Lemer and Moavenzadeh (1971) developed one of the first models dealing with reliability of pavements. They pointed out that the factors affecting the degree of variation in pavement system parameters have a significant effect on system reliability. The limit state function for the pavement reliability problem can be written as: D = log NF − log NA
(1.5)
Where, NF = allowable number of axle load applications to failure NA = number of actual axle load applications The condition of the pavement is considered to have deteriorated below acceptable limits when NA exceeds NF , or equivalently, D < 0. Assuming lognormal distributions for NF and NA , the probability of failure is obtainable as PF = φ(–β C ), where φ(.) is the cumulative distribution function of the standard normal random variable and β C = E(D)/σ (D) is the reliability index, in which E(D) and σ (D) are the mean and standard deviation of D (Darter and Hudson, 1973). In the simulation model proposed by Alsherri and George (1988) for reliability evaluation of pavements, the following equation based on present serviceability index was used: (1.6) R = P p f ≥ pt where, pf = present serviceability index at time t, and pt = limiting (terminal) serviceability index, generally set at 2.5 for AASHTO’s design and 3.0 for premium design. The following expression was used to estimate reliability under the assumption that both pf and pt are normally distributed: ⎤ ⎡ ⎢ μ p f − μ pt ⎥ R = ⎣ 1/2 ⎦ = (z0 ) 2 2 σ p f + σ pt
(1.7)
where, φ = standard normal distribution μpf = mean value of pf μpt = mean value of pt σ pf = standard deviations of pf σ pt = standard deviations of pt z0 = standard normal deviate In the AASHTO (1993) guide for the design of pavement structures, the overall standard deviation of variation was considered by including the errors in traffic predictions and in pavement performance prediction to analyze risk
6
Safety and reliability modeling and its applications
and reliability in the design and reliability design factor was determined. The reliability of design was defined as: Reliability, R (percent ) = 100 × Probability (Actual pavement performance, Nt ≥ Actual design period traffic in ESAL, NT ) (1.8) The overall variance (S02 ) was defined as the sum of the variance in traffic prediction (Sw2 ) and the variance in prediction of pavement performance (SN2 ). S02 = Sw2 + SN2
(1.9)
The following equation for the reliability design factor (FR ) was derived: FR = 10−ZR ×S0
(1.10)
Where, S0 is the overall standard deviation of variation and ZR is the standard normal deviate. 2 ) Noureldin and his colleagues estimated the variance of traffic prediction (SW using the first order second moment approximation approach on the AASHTO’s traffic prediction equation and the following was derived (Noureldin et al., 1996): (COV. ADT∗ Dd )2 + (COV.P)2 + (COV.Ld )2 + (COV.TF)2 = 5.3 (1.11) Where ADT∗ Dd represents average daily traffic in a heavier direction; P is the percentage of trucks in the traffic mix; Ld is the lane distribution; TF is the truck factor (number of ESALs per truck). The growth factor and the design period were assumed to be constants. Using AASHTO’s flexible pavement performance
prediction model, the variance of the pavement performance prediction S2N may be obtained as (Noureldin et al., 1994) :
2 SW
2
2
2
SN2 = COV(MR) + P2 SN .COV(SN)
(1.12)
P2 = variance component of SN To determine the COV(SN), the variance of SN was estimated in the following way : 2 ¯ 2 Var(a1 ) + a¯ 2 m ¯2 ¯ 22 Var(m2 )D Var(SN) ∼ = a¯ 21 Var(D1 ) + D 1 2 ¯ 2 Var(D2 ) + a 2 2 ¯ 2 + a¯ 2 m ¯2 ¯ 2 + Var(a3 )m ¯ 22 D ¯ 23 Var(m3 )D ¯ 23 D + Var(a2 )m 2 3 ¯ 3 Var(D3 ) + a 3 3 (1.13) Kulkarni (1994) chose traffic as a design element for evaluating the reliability of alternate pavement designs with different types of pavements. The reliability
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
7
R of a pavement design was defined as : R = Probability (actual traffic load capacity, N > actual cumulative traffic, n)
(1.14)
It was suggested that both ln N and ln n (‘ln’ indicates natural logarithm) would follow normal distribution since N and n are log-normally distributed. The safety margin (SM) of design was defined as: SM = ln N − ln n
(1.15)
The reliability index (β), is defined as the ratio of mean (E) and standard deviation (SD) of safety margin (SM). β=
E[ln N]−E[ln n] E[SM] =√ SD[SM] var[ln N]+var[ln n]
(1.16)
A mechanistic pavement model, WESLEA, and empirical transfer functions were used to assess the effect of input variability on fatigue and rutting failure models (Timm et al., 2000). Monte Carlo simulation technique was used to study the uncertainty through a computer program called ROADENT. The reliability computed through ROADENT was reported lesser than that computed through AASHTO 1993 design guide. Kim and Buch (2003) categorized the uncertainties affecting pavement performance into the following four broad groups: 1. The difference in the basic properties of materials from one point to another and fluctuation on material and cross-sectional properties due to construction quality, termed as ‘spatial variability’ 2. Random measurement error in determining the subgrade soil strength, traffic volume estimation and other such factors, termed as ‘variability due to imprecision in quantifying the parameters affecting pavement performance’ 3. Assumption and idealization of a complex pavement analysis model with simple mathematical expression, termed as ‘model bias (error)’ 4. Lack of fit of regression models, termed as ‘statistical error’ The first two groups are called ‘uncertainties of design parameters’ and the last two groups are called ‘systematic errors’. Uncertainties of design parameters cause the variation within the probability distribution of the performance function, whereas systematic errors cause the variation in possible location of the probability distribution of the performance function. Kim and Buch (2003) proposed a load and resistance-factor-design (LRFDformat-based practical reliability method. A total of 13 pavement sections were designed using AASHTO method and were also redesigned using reliabilitybased design procedure where the AC thickness was changed so that the revised section would accommodate the design traffic and satisfy the threshold rut-depth at a given target reliability. The target reliability of 90 % was assigned to both
8
Safety and reliability modeling and its applications
design procedures. The reliability indices for the pavement sections determined by both methods were computed using the first order reliability method (FORM). It was indicated that the RBD procedure does successfully yield cross-sections whose reliability indices are close to the target reliability index, while the AASHTO method does not generally produce designs of uniform reliability for actual mechanistic failure criterion. To incorporate reliability in pavement design, Austroads used the laboratory fatigue relationship published by Shell Petroleum (Shell, 1978), which was further modified to include a reliability factor (RF) corresponding to the desired project reliability.
6918(0.856VB + 1.08 N = RF S0.36 mix με
5 (1.17)
Where, N = allowable number of repetitions of the load με = tensile strain produced by the load (in microstrain) VB = percent by volume of bitumen in the asphalt (%) Smix = asphalt modulus (in MPa), and RF = reliability factor for asphalt fatigue The value of reliability factor (RF) varies from 2.5 to 0.67 for desired project reliability of 80% and 97.5%. The higher the desired project reliability, the lower is the value of the reliability factor (RF). Permanent deformation was not considered as a distress mode in the Austroads design model due to the non-availability of an appropriate model which could reliably predict the development of rutting with the passage of traffic/time, as mentioned in the guide (Austroads, 2012). The NCHRP (2004) guide for the mechanistic-empirical (M-E) design of new and rehabilitated pavement structures analyzes the reliability of flexible pavement design for individual pavement distresses, such as asphaltic concrete fatigue (bottom up) cracking, longitudinal (top down) cracking, rutting or asphaltic concrete thermal cracking. The reliability (R) in general is defined as the probability that the particular distress of a design project is less than the critical level of distress over the life of the design. Kim (2006) presented a practical probabilistic design format to incorporate reliability in the M-E flexible pavement design procedure. It was suggested that uncertainties due to spatial variation and imprecision in quantifying parameters should be integrated as parameter uncertainties and quantified in terms of the standard deviation (Sp ) of pavement performance. Similarly, it was suggested that model bias and statistical error should be integrated as systematic error and quantified in terms of the standard deviation (Sm ) of pavement performance. The overall standard deviation (S0 ) was determined as follows: S0 =
2 Sp2 + Sm
(1.18)
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
9
The study suggested the following reliability-based pavement-design equation with a target reliability, R, using a rut prediction model: RDmax = S0 × βtarget + RDpredicted
(1.19)
Where, S0 = overall standard deviation, and β target = the target reliability index The depth of 12.7 mm was considered as a limit state (RDthreshold ). The value of the difference between RDthreshold and RDmax was computed and compared with the specified tolerance level. It was suggested that the design should be changed until this criterion is satisfied.
1.4 Issues with estimation of pavement reliability Though the available methods for reliability analysis can be applied for the pavement reliability estimation, however there are still several concerns regarding this. The following subsections briefly discuss the issues related with the estimation of reliability for pavements and its incorporation in the pavement design process.
1.4.1
Input Parameters Variability
For the reliability analysis, it is necessary to know all these inputs accurately and without error. A number of parameters are required for the pavement design process; all of these are uncertain and have some variability. Variability exists in pavements due to material characteristics, traffic conditions, environmental conditions, construction practices and quality control (Darter and Hudson, 1973). Variability of the input variables may be described by statistical terms (such as mean and variance) and the associated probability distribution. A useful dimensionless parameter that indicates the variability of a material’s property is the ratio of standard deviation over the mean, known as coefficient of variation (COV). Knowledge of the COV of each design input is extremely important to accurately estimate their influence on the predicted pavement life and, thereby, reducing the chances of premature failure. To simplify the calculation process, the input variables are considered to be normally distributed which is not true always. So this consideration of normal distribution for non-normal random variables introduces an error of approximation in the reliability analysis. In addition to this, the interrelationship and interaction between the input variables is not known and even if known, is generally not used in the reliability analysis. For the sake of simplicity, the input variables are considered to be independent of each other, which is not true. This “blind’ approach doesn’t render accurate predictions of reliability analysis. A summary of the variability associated with the pavement input parameters from the available literature is given in Table 1.1.
10
Safety and reliability modeling and its applications
TABLE 1.1 Summary of pavement material variability from available literature Parameter
Description
Range of COV (%)
Typical COV (%)
Distribution Type
Reference
Layer thickness
Bituminous surface
3 – 12
7
Normal
Timm et. al., 1999; Noureldin et. al. 1994
3.2 – 18.4
7.2
Normal
Aguiar-Moya and Prozzi, 2011
11.7 – 16.0
13.8
Normal
Aguiar-Moya and Prozzi, 2011
5 – 15
10
Normal
Noureldin et. al. 1994
10 – 15
12
Normal
Timm et. al. 1999; Noureldin et. al. 1994
6.0 – 17.2
10.3
Normal
Aguiar-Moya and Prozzi, 2011
Granular sub-base
10 – 20
15
Normal
Timm et. al., 1999; Noureldin et. al., 1994
Bituminous layer
10 – 20
15
Normal
Noureldin et. al., 1994
Lognormal
Timm et. al., 1999
20
Normal
Noureldin et. al., 1994
Lognormal
Timm et. al. 2000
20
Normal
Noureldin et. al., 1994
Lognormal
Timm et. al., 2000
20
Normal
Noureldin et. al., 1994
Lognormal
Timm et. al., 2000
Extreme Value Type I
Timm et. al., 2000
Bituminous binder course
Granular base
Elastic modulus
10 – 40 Granular base
10 – 30 5 – 60
Granular sub-base
10 – 30
Subgrade
10 – 30
5 – 60
5 – 50 Traffic
1.4.2
Performance models
As mentioned previously, the current Indian pavement design guidelines (IRC:37, 2018), which is a mechanistic-empirical approach, is based on two performance criteria, viz., fatigue and rutting model, for a conventional threelayered pavement structure. In simpler terms, the designed pavement must last
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
11
till the fatigue cracking of bituminous surface or rutting in the pavement reaches its terminal values, whichever happens earlier. The general form of these fatigue and rutting models is given below: Fatigue Model k2 1 1 k3 × (26) N f = k1 × C × εt MR C = 10M Vbe M = 4.84 × − 0.69 Vbe + Va Rutting Model
Nr = k4 ×
1 εv
(27) (28)
k5 (29)
Where, Nf = fatigue life of bituminous layer in terms of cumulative repetitions of equivalent 80 kN standard axle load Nr = subgrade rutting life in terms of cumulative repetitions of equivalent 80 kN standard axle load εt = maximum horizontal tensile strain at the bottom of bituminous layer εz = maximum vertical compressive strain at the top of the subgrade MR = resilient modulus of the bituminous layer (MPa) Vbe = percent volume of effective bitumen in the mix used in the bituminous layer (varies between 3.5 to 4.5 percent) Va = percent volume of air voids in the mix used in the bituminous layer (varies between 10.5 to 11.5 percent) ki = regression coefficients (i = 1 to 5) The values of these regression coefficients are given as : k1 = 1.6064 × 10−4 and 0.5161 × 10−4 for design traffic ≥ 20 msa and < 20 msa respectively; k2 = 3.89; k3 = 0.854; k4 = 4.1656 × 10−8 and 1.41 × 10−8 for design traffic ≥ 20 msa and < 20 msa respectively; and k5 = 4.5337. These fatigue and rutting transfer functions were developed and calibrated during the database collected through R-6 and R-19 research studies sponsored by MORTH. A total of around 120 numbers of bituminous concrete (BC) and 160 numbers of bituminous macadam (BM) road sections from R-6 and R-19 studies were considered for development of fatigue criterion and 86 number of BC road sections from R-6 study were analyzed for the development of rutting criterion. These pavement sections consisted of bituminous surfacing with granular bases and subbases and they were assumed as three layered structure. The average annual pavement temperature (AAPT) of all the sections were around 35°C and the bitumen of 80/100 penetration grade was used for both BC and BM surfacing. The maximum repetitions of equivalent single-axle load for the road sections were 50 msa only. The thickness of BC layer was 40 mm on most of the sections and BM was used as the bituminous binder course, just below the
12
Safety and reliability modeling and its applications
BC layer. In the collected performance data, the scatter of data points was quite large. This was attributed to the fact that the test pavements were located in different parts of the country and they have been constructed in different climatic conditions and probably quality control during the construction was not identical resulting in wide variations. Because of the variabilities involved, wide scatter was considered to be quite in order. However, the present situation is totally different from what it was when these transfer functions were developed and calibrated. The traffic on the pavements has increased tremendously, both in terms of loading and number of repetitions. The specifications of bituminous materials have changed. Flexible pavements with thick bituminous layers are quite common these days. So, to what extent the originally developed fatigue and rutting transfer functions are valid, is a matter of concern and debate. There is a need for recalibration of the fatigue and rutting transfer functions. Any changes to these transfer functions due to this recalibration may affect the overall pavement thickness and ultimately affects the reliability of the pavement. So, there is a need for recalibration of the fatigue and rutting transfer functions. Different researchers have proposed different coefficients for the fatigue and rutting equations. The development of rutting and fatigue transfer functions is through field calibration by ordinary least-square estimation (OLSE) technique/modelling of the field data. The approach of OLSE in developing these equations itself violates the basic assumptions of OLSE, due to the presence of measurement errors. A better statistical method of functional linear-measurement error (FLME) has been presented (Shukla and Das, 2008), which may be used for development of fatigue and rutting equations.
1.4.3
Interaction between the failure modes
A pavement has many failure modes, the interrelationship between these failure modes is quite complex. So it is difficult to consider such complexity in the reliability model. Often, this complexity is simplified and for this reason the analysis is not very certain/accurate. For pavements, the two primary modes of failure are fatigue cracking and rutting (IRC:37, 2018). Generally, these two are considered as independent and in series. In simple words, this indicates that that pavement may fail by any of the failure modes, viz. fatigue failure of rutting failure and the two failure modes occur totally independent of each other. However, very few studies have been done to establish the possible correlation between the two failure modes (Dilip et al. 2013; Gogoi et al. 2013; Liu and Xu 2014; Peddinti et al. 2020). These studies have indicated that there is a possible correlation between the two failure modes. This needs to be taken into consideration during the reliability analysis. However, the problem lies in the fact that the amount of correlation between the two failure modes is not precisely known as it depends on many factors and is still a matter of research.
Reliability analysis of asphalt pavements: concepts and applications Chapter | 1
1.4.4
13
Material strength degradation
A typical pavement structure consists of one or two layers of bituminous materials and two or more layers of unbound granular materials. Due to the action of traffic and environmental conditions, a pavement deteriorates over a period of time and its reliability decreases over time or load history. This decrease in reliability may be a result of the degradation in strength of the constituent pavement materials. However, the mechanical behavior of both the materials, i.e. bituminous and granular materials, is quite different. Bituminous materials are subjected to traffic induced fatigue cracking. The granular layer’s resistance to traffic loading is commonly measured in terms of permanent deformation at the base, which leads also to permanent deformation on the surface. To capture the effect of material strength degradation, time dependent reliability for both fatigue, and rutting failure modes need to be done.
1.5 Conclusions The available methods for reliability analysis can be applied for the reliability analysis of asphalt pavements. However, the issues indicated in the previous section need to be taken care of during the application of such reliability analysis for asphalt pavements.
Acknowledgements The author would like to thank Prof. (Dr.) Satish Chandra, Director, CSIRCentral Road Research Institute (CSIR-CRRI), New Delhi for his kind permission to publish this paper.
Disclosure statement No potential conflict of interest was reported by the authors.
References AASHTO, 1993. AASHTO Guide for Design of Pavement Structures. American Association of State Highway and Transportation Officials, Washington, D.C. Aguiar-Moya, J.P., Prozzi, J., 2011. Development of reliable pavement models. Texas Transportation Institute, Texas A&M UNiversity System, College Station, Texas, USA. Alsherri, A., George, K.P., 1988. Reliability model for pavement performance. J. Trans. Eng. 114 (2), 294–306. Austroads, 2012. Guide to Pavement Technology Part 2 : Pavement Structural Design. AGPT 02-12. Austroads Ltd., Sydney, Australia. Cronvall, O., 2011. Structural Lifetime, Reliability and Risk Analysis Approaches for Power Plant Components and Systems. In: VTT Publications, 775. VTT Technical Research Centre of Finland, Vuorimiehentie.
14
Safety and reliability modeling and its applications
Darter, M.I., Hudson, W.R., 1973. Probabilistic Design Concepts Applied to Flexible Pavement System Design. Centre for Highway Research, University of Texas at, Austin, Texas, USA. Dilip, D.M., Ravi, P., Babu, G.L.S., 2013. System reliability analysis of flexible pavements. J. Transport. Eng. 139 (10), 1001–1009. Gogoi, R., Das, A., Chakroborty, P., 2013. Are fatigue and rutting distress modes related? Int. J. Pavement Res. Technol. 6 (4), 269–273. Hudson, W.R., 1975. State of the Art in pedicting pavement reliability from input variability. Department of Transportation, Federal Aviation Administration, Washington, D.C. IRC:37, 2018. Guidelines for the design of flexible pavements. Indian Roads Congress, New Delhi, India. Kim, H.B., Buch, N., 2003. Reliability based pavement design model accounting for inherent variation of design parameters. 82nd Transport. Res. Board Annu. Meet., Washington, D. C. Kulkarni, R.B., 1994. Rational approach in applying reliability theory to pavement structural design. Transportation Research Record, Transportation Research Board. National Research Council, Washington D.C., USA, pp. 13–17 1449. Lemer, A.C., Moavenzadeh, F., 1971. Reliability of highway pavements. Highway Res. Rec. 36, 1–8. Liu, H., Xu, X., 2014. Reliability analysis of asphalt pavement considering two failure modes. In: Mohammadian, K., Goulias, K.G., Cicek, E., Jieh-JiuhWang, Maraveas, C. (Eds.), Proceedings of 3rd International Conference on Civil Engineering and Urban Planning III. CRC Press, Balkema, The Netherlands, pp. 291–295. Madsen, H.O., Krenk, S., Lind, N.C., 1986. Methods of Structural Safety. Prentice-Hall Inc., Englewood Cliffs, New Jersey, USA. Modarres, M., Kaminskiy, M., Krivtsov, V., 1999. Reliability Engineering and Risk Analysis - A Practical Guide. Marcel Dekker Inc., New York, USA. NCHRP, 2004. Guide for mechanistic-empirical design of new and rehabilitated pavement structures. NCHRP Research Report 1-37A, Transportation Research Board, National Research Council, Washington, D.C. Noureldin, A.S., Sharaf, E., Arafah, A., Al-Sugair, F., 1996. Rational Selection of Factors of Safety in Reliability-Based Design of Flexible Pavements in Saudi Arabia. Transport. Res. Rec.: J. Transport. Res. Board, SAGE Publications 1540 (1), 39–47. Noureldin, A.S., Sharaf, E., Arafah, A., Faisal, A.-S., 1994. Estimation of Standard Deviation of Predicted Performance of Flexible Pavements Using AASHTO Model. Transport. Res. Rec. 1449, 46–56. Peddinti, P.R.T., Munwar Basha, B., Saride, S., 2020. System Reliability Framework for Design of Flexible Pavements. J.Transport. Eng. Part B: Pavements 146 (3) American Society of Civil Engineers (ASCE). Shell, 1978. Shell Pavement Design Manual - Asphalt Pavements and Overlays for Road Traffic. Shell International Petroleum Limited, London, U.K. Shukla, P.K., Das, A., 2008. A re-visit to the development of fatigue and rutting equations used for asphalt pavement design. Int. J. Pavement Eng. 9 (5), 355–364. Timm, D.H., Birgisson, B., Newcomb, D.E., Galambos, T.V., 1999. Incorporation of reliability into the Minnesota mechanistic-empirical pavement design method. Final Report, Minnesota Department of Transportation, Minnesota. Timm, D., Newcomb, D., Galambos, T., 2000. Incorporation of reliability into mechanistic-empirical pavement design. Transport. Res. Rec. 1730 (1), 73–80.
Non-Print Items Abstract The variability of the inputs parameters is not considered during the pavement design process. As the input parameters are uncertain, this needs to be accounted for during the design process through the reliability analysis approach. However, the application of reliability concepts to pavements is not very straight forward and many aspects need to looked into. The present paper attempts to summarize such concepts and applications of reliability concepts for asphalt pavement design. Keywords Asphalt; Material strength degradation; Mechanical behavior; Pavement design; Reliability; Flexible Pavement
Chapter 2
Markov modeling of multi-state systems with simultaneous component failures/repairs, using an extended concept of component importance Jacek Malinowski Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warszawa, Poland
2.1 Introduction The concept of component importance is well-known among reliability engineers and is mostly used in the context of two-state systems with twostate components. There exist several different importance measures, e.g. Birnbaum, Fussel-Vesely, Risk Achievement Worth, Risk Reduction Worth, etc. (see Hoyland and Rausand [2009] for definitions and explanations). A comprehensive synopsis of the topic is given in the recent survey [Kalpesh and Kirtee, 2017]. Several types of component importance are also discussed in the monograph [Kuo and Zhu, 2012]. Finding the importance’s values may not be a simple task for complex systems, but the effort can be worthwhile. For example, high values can indicate critical locations in the system structure, where highly reliable components should be placed in order to reduce or minimize the risk of system failure. Also, as shown in this chapter, they can be essential in computing useful reliability characteristics such as interstate transition intensities for multistate systems with repairable components. In this chapter, importance is attributed to a group of components rather than to a single component alone. It is defined as the probability that simultaneous failure or repair of all components in a set results in a system transition from state a to state b, provided that the components are in “up” or “down” state and a>b or ay and λD(x,y) > 0, or x 0 a→b if a=b and there exist x,y∈{0,1}n such that x→y, (x)=a and (y)=b The last four definitions require some clarification. The partial order introduced in S allows to compare different levels of the system’s operating ability. The order-preservation property is an extension of the monotonicity property of a binary structure function defined in [Barlow and Proschan, 1975]. This property says that failure/repair of a component cannot improve/deteriorate the system state, which is intuitively obvious, but has to be expressed mathematically. is an order-preserving function if x 0 respectively, which explains the definition of the “→” relation in {0,1}n . The definition of “→” in S is a natural consequence of the previous one. According to it, a direct transition from a to b in S is realized by each direct transition from x to y in {0,1}n , such that (x) = a, (y) = b. Also, if no x and y exist such that x→y, (x) = a, (y) = b, then no direct transition from a to b can take place. Further, we assume that λ and μ defined by (2) and (3) are given data obtained by statistical estimation or experts’ elicitation. We now continue with the remaining notation used in the paper: Zt : system state at time t, i.e. Zt = (Xt ) a→b (t) – transition intensity with which the system changes its state from a to b, defined as follows: 1 Pr [Zt+t = b | Zt = a]; a, bS t→0 t
a→b (t ) = lim
(2.4)
(x, 1 ) – binary vector obtained from x, such that xi =1, i∈ (x, 0 ) – binary vector obtained from x, such that xi =0, i∈ a→b crit () – set of binary vectors x such that xi =1 for i∈, (x)=a and (x,0 )=b, where a>b; vectors in a→b crit () will be called critical to a direct transition from a to b, caused by simultaneous failure of all components in a→b crit () – set of binary vectors x such that xi = 0 for i∈, (x) = a and (x,1 ) = b, where ab, defined as follows: Ia→b (, t ) = Pr Xt ∈ crit (2.5) a→b () | Xi (t ) = 1, i ∈ i.e. Ia→b (,t) is the (conditional) probability that simultaneous failure of all components in causes a transition from a to b, given that these components are operable Ia→b (,t) – importance of to a transition from a to b, given that a < b, defined as follows: Ia→b (, t ) = Pr Xt ∈ crit (2.6) a→b () | Xi (t ) = 0, i ∈
Markov modeling of multi-state systems with simultaneous Chapter | 2
19
1
2 FIG. 2.1
3
The RBD of a mission-critical power supply system.
i.e. Ia→b (,t) is the (conditional) probability that simultaneous repair of all components in causes a transition from a to b, given that these components have failed. If is a binary function (i.e. S={0,1}) and is a one-element set (i.e. ={i}), then the quantities defined by (5) and (6) reduce to the Birnbaum importance of component i. For better understanding of the introduced concepts, let us consider a power supply system for a critical device. It consists of mains (1), a standby generator (2), and a UPS (3) that, in case of mains outage, supplies power from batteries for the time necessary to start the generator. The system’s RBD is shown in Fig. 2.1. Let S={0,1,2}, i.e. the system is in state 0 if power is not supplied to the device, in state 1 if supplied from UPS batteries or standby generator, and in state 2 if supplied from mains. The RBD in Fig. 2.1 implies that (0,0,0) = (0,0,1) = (0,1,0) = 0, (0,1,1) = 1, (1,0,0) = (1,0,1) = (1,1,0) = (1,1,1) = 2. If we assume that the system is fully/partly operational in state 2/1, then the inequalities 0 < 1 < 2 define the ensuing partial order in S (note that this order is linear, because each two elements of S are comparable). It is easily checked that is an order-preserving function from {0,1}3 to S. We also assume for simplicity that the system components fail independently (no simultaneous failures) and are nonrepairable. The transition diagrams for Xt and Zt are shown in Fig. 2.2 and 2.3. An arrow between two states indicates that they are in the transition relation. Analyzing these figures, we conclude that (1,1,1)→(0,1,1) yields 2→1; (1,0,0)→(0,0,0), (1,0,1)→(0,0,1), or (1,1,0)→(0,1,0) yields 2→0; and (0,1,1)→(0,0,1) or (0,1,1)→(0,1,0) yields 1→0. Note that it is possible to define a nonorder-preserving function from {0,1}n to S, but it may be difficult to interpret physically. For example, if n and S are as in the above example, then defined below is such a function. (0,0,0) = (0,0,1) = (0,1,0) = 0, (0,1,1) = (1,0,1) = 1, (1,0,0) = (1,1,0) = (1,1,1) = 2. Indeed, (1,0,0) < (1,0,1), but (1,0,0) > (1,0,1), hence is not orderpreserving.
20
Safety and reliability modeling and its applications
(1,1,1)
(0,1,1) (1,0,1)
(1,1,0)
(0,0,1) (0,1,0)
(1,0,0)
(0,0,0) FIG. 2.2
Transition diagram of Xt .
(1,0,0), (1,0,1) (1,1,0), (1,1,1) φ=2
(0,1,1) φ=1 (0,0,0), (0,0,1) (0,1,0) φ=0 FIG. 2.3
Transition diagram of Zt = (Xt ).
2.3 Theoretical background We begin this section with an auxiliary lemma that will be used to prove the main theorem. Lemma 1 Let be an order-preserving function from {0,1}n to S, a,b∈S, and a→b. Then either a>b or ay or xb or ab or a b λ Ia→b Pr [Zt = a] ⊆{1,...,n}
(2.7)
a→b (t ) =
1 (, t ), a < b μ Ia→b Pr [Zt = a] ⊆{1,...,n}
(2.8)
where λ in (7) or in (8) is the rate of simultaneous failure or repair of all components in , and (, t) = Ia→b (, t) · Pr[Xk (t ) = 1, k] = Pr Xt ∈ crit Ia→b a→b () , a > b (2.9) crit Ia→b (, t) = Ia→b (, t) · Pr[Xk (t ) = 0, k] = Pr Xt ∈ b→a () , a < b (2.10) It is more convenient to use I a→b () instead of Ia→b () in (7) and (8), because I a→b () is usually easier to compute than Ia→b (). It should be noted that the above formulas are generalizations of similar ones for a two-state system with independent components, to be found in [Korczak, 2007]. Proof: We only consider the cases a>b and ab or ab in order to prove (7). The law of total probability yields: Pr [Zt+t = b, Zt = a] = Pr [Xt+t = y, Xt = x] = x,y: (x)=a, (y)=b = Pr [Xt+t = y, Xt = x]+ x,y: x>y, (x)=a, (y)=b Pr [Xt+t = y, Xt = x] +
(2.11)
x,y: x>y, (x)=a, (y)=b
If x≯y and (x)=a>b=(y) then a transition from a to b cannot be realized as a single event in Xt . Indeed, each such event occurs as a simultaneous failure or repair of one or multiple components, i.e. as a direct transition from x to y in {0,1}n such that x>y or xy contradicts x≯y, and xb. Hence, if x≯y and (x)=a>b=(y) then a transition from a to b can only be realized as multiple direct transitions in {0,1}n . Since Xt is a Markov chain, the probability that such transitions occur in an infinitesimal time interval, divided by this interval’s length, is equal to zero. In consequence we have: 1 t→0 t lim
x,y: x>y, (x)=a, (y)=b
Pr [Xt+t = y | Xt = x] = 0
(2.12)
22
Safety and reliability modeling and its applications
i.e. the second sum on the right hand side of (11) is equal to zero. Expanding the first sum we get: Pr [Xt+t = y, Xt = x] x,y: x>y, (x)=a, (y)=b = Pr [Xt+t = y, Xt = x] x>y, D(x,y)=, (x)=a, (y)=b ⊆{1,...,n}, =∅ x,y: = Pr [Xt+t = (x, 0 ), Xt = x] ⊆{1,...,n}, =∅ x∈crit () a→b = Pr [Xt+t = (x, 0 ) | Xt = x] Pr [Xt = x] ⊆{1,...,n}, =∅ x∈crit a→b ()
(2.13) Let us note that for x∈a→b crit () it holds that Pr [Xt+t = (x, 0 ) | Xt = x] = Pr [Xk (t + t ) = 0, k ∈ | Xk (t ) = 1, k ∈ ]
(2.14)
According to the definition of λ , the right hand side of (14), divided by t, converges to λ as t→0. Thus, for x∈a→b crit () we have: 1 Pr [Xt+t = (x, 0 ) | Xt = x] = λ t→0 t In consequence, from (1), (11)-(13), and (15) we obtain: lim
(15)
a→b (t ) = lim = = =
1 Pr [Zt+t = b | Zt = a] t→0 t 1 lim 1 Pr [Zt+t = b, Zt = a] Pr[Zt =a] t→0 t 1 λ Pr [Xt = x] Pr[Zt =a] ⊆{1,...,n}, =∅ x∈crit () a→b 1 λ Pr Xt ∈ crit a→b () Pr[Zt =a] ⊆{1,...,n}, =∅
(2.16)
This completes the proof of (7). The proof of (8) is analogous. Important remark: the formulas (7) and (8) are only valid if λ and μ do not depend on the states of components outside . Otherwise, according to the third equality in (16), the expression under the sum in (7) or (8) must be changed to either of the following ones: λ (x) Pr [Xt = x], a > b (2.17) crit () x∈ a→b μ (x) Pr [Xt = x], a < b (2.18) x∈crit a→b ()
where λ (x) and μ (x) depend on xi , i∈. However, (7) or (8) still holds if λ (x) or μ (x) are equal for x∈crit a→b () or x∈crit a→b (), because λ or μ can then be written without the variable x. For better explanation, this issue will also be addressed in the next two sections. An important conclusion can be drawn from Theorem 1. If is an orderpreserving function from {0,1} to S, then Zt is a (nonhomogenous) Markov process with the transition intensities given by (7) and (8). Indeed, Pr(Zt = a)
Markov modeling of multi-state systems with simultaneous Chapter | 2
111 λB(1− πS) μE
μB 011 λE
101 λB
001 FIG. 2.4
μB
μS λBπS
110 λB
μB
010 λE
23
μE 100
λB μB
000
Transition diagram of Xt .
and Ia→b (,t) are functions of the state probabilities Pr(Xt =x), x∈{0,1}n , which, as solutions of the Kolmogorov equations, only depend on t and constant failure/repair rates λ ,/μ , ⊆{1,…,n}. In view of (7) and (8), the same property holds for transition intensities a→b (t), i.e. they do not depend on the history of Zt =(Xt ) before time t. Moreover, a→b (t) converge to constant values as t→∞, because the probabilities Pr(Xt =x) also do. Zt is thus asymptotically homogenous.
2.4 The illustrative model of an example system For further considerations we will use the model of a three-state power supply system composed of basic (B) and emergency (E) power sources, and the switch (S) automatically activating E when B fails and E is in operable condition. We assume that emergency source is in cold standby while basic source is in operation, each source can only fail during operation, and the switch can only fail, with probability π S , when activating emergency source in case of main source’s failure. Thus, the components are mutually dependent, because a failure of S, if occurs, follows that of B, and failure of E can only occur if B is out of operation. We also assume that no simultaneous repairs are possible, and the order of repair priorities is B, E, and S. Let λB denote the failure rate of B, then λB Å(1–π S ) is the failure rate of B alone, provided that S is operable, and λB Åπ S is the rate of simultaneous failure of B and S. Failure rate of E and the repair rates are denoted with λ and with the respective indices, i.e. λE , μB , μE , μS . As stated in Section 2, all the failure and repair rates are assumed to be given data. In Fig. 2.4 we can see the transition diagram of Xt =[X1 (t), X2 (t), X3 (t)], where X1 (t), X2 (t), X3 (t) are respectively the states of B, E, S at time t. The adopted assumptions yield that λ{1} depends on the state of S, i.e. λ{1} (1,1,1)=λB Å(1 – π S ) and λ{1} (1,1,0) = λ{1} (1,0,1) = λ{1} (1,0,0) = λB . Since simultaneous failure of B and S can only happen when E is in operable condition, λ{1,3} depends on the state of E, i.e. λ{1,3} (1,1,1) = λB Åπ S and λ{1,3} (1,0,1) = 0. A failure of E is only possible when B is out of operation,
24
Safety and reliability modeling and its applications
hence λ{2} (0,1,1) = λ{2} (0,1,0) = λE , but λ{2} (1,1,1) = λ{2} (1,1,0) = 0. In turn, the repair policy yields that μ{1} =μB does not depend on the states of E and S, μ{2} (1,0,1) = μ{2} (1,0,0) = μE, μ{2} (0,0,1) = μ{2} (0,0,0) = 0, μ{3} (1,1,0) = μS , and μ{3} (1,0,0) = μ{3} (0,1,0) = μ{3} (0,0,0)=0. The transition diagram of Xt is helpful in obtaining the Kolmogorov equations from which the state probabilities for Xt can be computed. As shown further, these probabilities are used to find the state probabilities Pr(Zt =a), a∈S, and transition intensities a→b (t) for the process Zt , which, in turn, are necessary to compute useful reliability characteristics of a considered system. Analyzing Fig. 2.3, we obtain the following Kolmogorov equations for Xt : dP1,1,1 (t )/dt = P0,1,1 (t )μB + P1,0,1 (t )μE + P1,1,0 (t )μS + − P1,1,1 (t )[λB (1 − πS ) + λB πS ] dP1,1,0 (t )/dt = P0,1,0 (t )μB + P1,0,0 (t )μE − P1,1,0 (t )(λB + μS ) dP0,1,1 (t )/dt = P1,1,1 (t )λB (1 − πS ) − P0,1,1 (t )(λE + μB ) dP1,0,1 (t )/dt = P0,0,1 (t )μB − P1,0,1 (t )(λB + μE ) dP0,1,0 (t )/dt = P1,1,1 (t )λB πS + P1,1,0 (t )λB − P0,1,0 (t )(λE + μB ) dP1,0,0 (t )/dt = P0,0,0 (t )μB − P1,0,0 (t )(λB + μE ) dP0,0,1 (t )/dt = P0,1,1 (t )λE + P1,0,1 (t )λB − P0,0,1 (t )μB dP0,0,0 (t )/dt = P0,1,0 (t )λE + P1,0,0 (t )λB − P0,0,0 (t )μB
(2.19)
For simplicity, we will compute only the asymptotic values of the system parameters, i.e. the values of Px (t), Ia→b (t) and a→b (t) for t→∞. This is sufficient for most practical purposes. Thus, from now on these parameters will be written without the variable t. Equating dPx (t)/dt to 0 in (19) we obtain the following equations for the steady state probabilities Px = lim t→∞ Px (t), x∈{0,1}n : P0,1,1 μB + P1,0,1 μE + P1,1,0 μS = P1,1,1 λB P0,1,0 μB + P1,0,0 μE = P1,1,0 (λB + μS ) P1,1,1 λB (1 − πS ) = P0,1,1 (λE + μB ) P0,0,1 μB = P1,0,1 (λB + μE ) P1,1,1 λB πS + P1,1,0 λB = P0,1,0 (λE + μB ) P0,0,0 μB = P1,0,0 (λB + μE ) P0,1,1 λE + P1,0,1 λB = P0,0,1 μB P0,1,0 λE + P1,0,0 λB = P0,0,0 μB P1,1,1 + P1,1,0 + P0,1,1 + P1,0,1 + P0,1,0 + P1,0,0 + P0,0,1 + P0,0,0 = 1
(2.20)
Markov modeling of multi-state systems with simultaneous Chapter | 2
25
The last equation expresses the obvious fact that the total probability of all possible outcomes is equal to 1. However, we cannot solve Eq. (2.20) without it, because the first eight equations are not algebraically independent. The solution of Eq. (2.20) is given below. P1,1,1 =
μB μE μS (λE + μB ) (λB πS + μS )[(λB + μB )(λB λE + λE μE + μB μE )] P1,1,0 = P1,1,1
λB πS μS
P1,0,1 = P1,1,1
λB λE (1 − πS ) μE (λE + μB )
P1,0,0 = P1,1,0
λE (λB + μS ) μE (λE + μB )
P0,1,1 = P1,0,1 P0,1,0 = P1,1,0
μE λE
(λB + μS ) (λE + μB )
P0,0,1 = P1,0,1
λB + μE μB
P0,0,0 = P1,0,0
λB + μE μB
(2.21)
Let us note that P1,1,1 is computed first, and each subsequent state probability of Xt is obtained from a previously computed one. Such approach allows to avoid unnecessarily complicated formulas. Clearly, closed formulas for Px can only be derived for a small system, in case of a more complex one a numerical method of solving linear equations would have to be used. An average user (electric power consumer) usually pays no attention to the details of the system’s operation. For him/her it is essential that power be supplied in sufficient quantity. Thus, a user perceives the considered system as a three-state one with the state space {B,E,F}, where power is sufficiently supplied from the mains in state B, supplied from emergency source in insufficient quantity in state E, and not supplied in state F. Let Zt = (Xt ) be a power supply process as perceived by the user, i.e. F (1, 1, 1) = F (1, 1, 0) = F (1, 0, 1) = F (1, 0, 0) = B, F (0, 1, 1) = F (0, 1, 0) = E, F (0, 0, 1) = F (0, 0, 0) = F.
(2.22)
As is usually the case, S has far less elements than {0,1}n , thus the states of Zt =(Xt ) are obtained by merging the states of Xt . The transition diagram of Zt is shown in Fig. 2.5.
26
Safety and reliability modeling and its applications
B ΛB→E ΛE→B ΛB→0
E Λ0→B ΛE→0 F
FIG. 2.5
Transition diagram of Zt = (Xt ).
2.5 Intensities of transitions between the system states For computing the transition intensities of Zt we will use the extended Birnbaum importances defined by (5) and (6) in Sections 2.2. First, we will calculate I a→b () for each a,b∈S and ⊆{1,…,n} such that I a→b () > 0. For this purpose we have to find the respective a→b crit () or a→b crit (). Analyzing Figures. 2.3 and 2.4 we conclude that crit crit B→E ({1}) = (1, 1, 1), (1, 1, 0); B→E ({1, 3}) = (1, 1, 1)
crit B→F ({1}) = (1, 0, 1), (1, 0, 0) crit E→F ({2}) = (0, 1, 1), (0, 1, 0) crit E→B ({1}) = (0, 1, 1), (0, 1, 0) crit F→B ({1}) = (0, 0, 1), (0, 0, 0)
(2.23)
which, in view of (9) and (10), yields: ({1}) = P1,1,1 + P1,1,0 ; I B→E ({1, 3}) = P1,1,1 IB→E
I B→F ({1}) = P1,0,1 + P1,0,0 I E→F ({2}) = P0,1,1 + P0,1,0 ({1}) = P0,1,1 + P0,1,0 IE→B ({1}) = P0,0,1 + P0,0,0 IF→B
(2.24)
Let us note that the remark to Theorem 2.1 only pertains to ={1} and crit B→E {1}, because λ (x) or μ (x) are constant within every other set of
Markov modeling of multi-state systems with simultaneous Chapter | 2
27
critical vectors given by (23). Thus, Theorem 2.1 yields: B→E =
λ{1} (1, 1, 1)P1,1,1 + λ{1} (1, 1, 0)P1,1,0 + λ{1,3} I B→E ({1, 3}) = P1,1,1 + P1,1,0 + P1,0,1 + P1,0,0 =
B→F =
λB (1 − πS )P1,1,1 + λB P1,1,0 + λB πS P1,1,1 P1,1,1 + P1,1,0 + P1,0,1 + P1,0,0
P1,0,1 + P1,0,0 λ{1} I B→F ({1}) = λB P1,1,1 + P1,1,0 + P1,0,1 + P1,0,0 P1,1,1 + P1,1,0 + P1,0,1 + P1,0,0 λ{2} I E→F ({2}) E→F = = λE P0,1,1 + P0,1,0 μ{1} I E→B ({1}) E→B = = μB P0,1,1 + P0,1,0 μ{1} I F→B ({1}) F→B = = μB (2.25) P0,0,1 + P0,0,0
The steady state probabilities Px , x∈{0,1}3 , appearing in the above formulas, are given by (21). Also note that we use (17) with ={1} in the first formula, because λ{1} (1,1,1) = λ{1} (1,1,0).
2.6 Obtaining useful reliability parameters from transition intensities The intensities of transitions between the system states, computed on the basis of Theorem 2.1, can be used to obtain several frequently used parameters that characterize the operation process of a system functioning in multiple operation modes. The definitions of these system parameters are given below. Ta – the expected duration of a continuous sojourn in state a Na→b (u) – the expected number of transitions from state a to state b in a time interval of length u Nlong (u) – the expected number of long breaks in operation in the time interval of length u, where a long break is a sojourn in state 0 Nshort (u) – the expected number of short breaks in operation in the time interval of length u, where a short break is an effect of manual switching from state B to state E, when S is failed and cannot automatically activate E. Let Na→b (t, t+t) be the number of transitions from state a to state b in the time interval (t, t+t] in a Markov chain Zt . It can be easily proved (see [Malinowski et al., 2013]) that t+t
E[Na→b (t, t + t )] = ∫ Pr (Zs = a)a→b (s)ds t
(2.26)
28
Safety and reliability modeling and its applications
where E denotes the expected value. From (26) it follows that ⎛ ⎞−1 Ta = ⎝ a→b ⎠
(2.27)
b∈S,b=a
Na→b (u) = uPa a→b N long (u) = Na→F (u) = NB→F (u) + NE→F (u)
(2.28) (2.29)
a=F
N short (u) = u P1,1,1 λB πS + P1,1,0 λB
(2.30)
In the formulas (2.26)–(2.30) a and b are elements of {B,E,F}. Also, PB = P1,1,1 + P1,1,0 + P1,0,1 + P1,0,0 , PE = P0,1,1 + P0,1,0 and PF = P0,0,1 + P0,0,0 , which equalities follow from Eq. (2.22). As shown in [Malinowski et al., 2015], the quantities defined by Eqs. (2.27)-(2.30) can be used to compute the SAIFI, MAIFI, ASAI, and SAIDI parameters characterizing the performance and reliability of power distribution networks. See [Chowdhury and Koval, 2009] for explanation of these parameters. Although the reliability model constructed and analyzed in [Malinowski et al., 2015] assumes independence of components, Eqs. (2.27)-(2.30) still hold in the case of dependent ones, provided that the system operates according to a Markov model. Readers wishing to explore in greater detail the topic of power systems reliability are referred to [Medjoudj et al., 2017, Singh et al., 2019] and [Tuinema, 2020].
2.7 Conclusion and future work The paper demonstrates, on the basis of a simple example, the construction of a Markov model of a multistate system whose components may not function independently. Next, a method of computing the system-level transition intensities defined by Eq. (2.4) is presented. This method assumes that the failure and repair rates of all the components are known, and uses multicomponent importances defined by Eqs.(2.5) and (2.6) to obtain the above intensities. However, this task may not be an easy one, because, according to Theorem 2.1, the state probabilities of Xt and the sets of critical vectors have to be found first, which can be tedious for complex systems. The state probabilities are computed from the Kolmogorov equations, while the critical vectors are obtained by analyzing the transition diagrams of Xt and Zt = (Xt ). All these operations have been illustrated on the provided example in Sections 2.4 and 2.5. The system-level transition intensities are an important characteristic of the system operation process, because they can be applied to calculate a number of essential reliability parameters, as shown in Section 2.6. The author believes that the presented method can be further developed to address types of components interdependence other than those considered herein, e.g. the load share dependence or delayed failures induced by other
Markov modeling of multi-state systems with simultaneous Chapter | 2
29
components’ failures. It can however be difficult to construct an appropriate Markov model. Nevertheless, these issues will be a topic of future work.
References Barlow, R.E., Proschan, F., 1975. Statistical theory of reliability and life testing. Holt, Rinehart and Winston. Chowdhury, A., Koval, D, 2009. Power Distribution System Reliability: Practical Methods and Applications. JohnWiley & Sons. Eryilmaz, S., Oruc, O.-E., Oger, V., 2016. Joint reliability importance in coherent systems with exchangeable dependent components. IEEE Trans. Reliab. 65 (3), 1562–1570. Hoyland, A., Rausand, M., 2009. System Reliability Theory: Models and Statistical Methods. John Wiley & Sons. Kalpesh, P., Kirtee, K., 2017. An overview of various importance measures of reliability system. Int. J. Math., Eng. Manag. Sci. 2, 150–171. Korczak, E., 2007. New formula for the failure/repair frequency of multi-state monotone systems and its applications. Control Cybern. 36, 219–239. Kuo, W., Zhu, X., 2012. Importance Measures in Reliability, Risk and Optimization. John Wiley & Sons. Lin, Y.-H., Li, Y.-F., Zio, E., 2016. Component importance measures for components with multiple dependent competing degradation processes and subject to maintenance. IEEE Trans. Reliab. 65 (2), 547–557. Medjoudj, Rabah, Bediaf, Hassiba, Aissani, Djamil, 2017. Power system reliability: mathematical models and applications. In: Volosencu, Constantin (Ed.), System Reliability. IntechOpen, published online at https://doi.org/10.5772/66993. Malinowski, J., et al., 2013. A method of computing the inter-state transition intensities for multistate series-parallel systems. In: Steenbergen, et al. (Eds.), Safety, Reliability and Risk Analysis: Beyond the Horizon – Proc. ESREL 2013. Amsterdam, The Netherlands, CRC Press, Taylor & Francis Group, pp. 1213–1219. Malinowski, J., et al., 2015. Reliability analysis of a small power supply system with load points operating in normal and emergency modes. In: Podofillini, et al. (Eds.), Safety and Reliability of Complex Engineered Systems – Proc. ESREL 2015. Zurich, Switzerland, CRC Press, Taylor & Francis Group, pp. 1323–1328. Miziula, P., Navarro, J., 2019. Birnbaum Importance Measure for Reliability Systems With Dependent Components. IEEE Trans. Reliab. 68 (2), 439–449. Singh, C., Jirutitijaroen, P., Mitra, J., 2019. Electric Power Grid Reliability Evaluation: Models and Methods. Wiley-IEEE Press. Tuinema, B.W., et al., 2020. Probabilistic Reliability Analysis of Power Systems. A Student’s Introduction. Springer Nature Switzerland AG.
Non-Print Items Abstract This chapter presents a method of constructing Markov models of multicomponent systems with multiple operational states, admitting the possibility of simultaneous component failures and repairs, and dependence of the respective failure rates on the other components’ states. Thus, the widely adopted assumption of component independence is relaxed. It is also demonstrated how to compute the interstate transition intensities using an extended concept of component importance. These intensities are then applied to determine some basic reliability parameters of the considered systems. For greater clarity, the presented results are illustrated by a simple example from electrical engineering. Keywords Dependent components; Interstate transition intensity; Markov model; Multicomponent importance; Multistate system; Reliability parameters; Simultaneous failures/repairs
Chapter 3
Reliability analysis of solar array drive assembly by dynamic fault tree Tudi Huang, Hong-Zhong Huang, Yan-Feng Li, Lei Shi and Hua-Ming Qian Center of System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China
3.1 Introduction Man-made satellites gradually play an important role in the continuous development of communication technology [Huang et al., 2013; Li et al., 2013]. The solar array drive assembly (SADA) is an important part of satellite systems and it can ensure that solar wing fully captures solar energy. Currently, the solar energy is the mostly technically mature cosmic energy source [Wu et al., 2011; Baghdasarian, 1998; Brophy et al., 2011]. In order to fully absorb the energy of the sun, in addition to a good solar panel, the normal direction of solar wings have to be parallel to the solar beam in order to obtain the maximum amount of solar energy. The task can be accomplished by the SADA [Sattar and Wei, 2019]. The solar wing is the source of energy for the satellite, and it is essential to the success of the entire satellite. Therefore, the reliability of the SADA is of great importance. According to the reported data, the solar wing has the highest failure frequency among all satellites subsystems, and a large part of these failures are caused by the failure of the SADA [Castet and Saleh, 2009]. Due to many hardware and software redundancies, the SADA has a complex structure and failure mechanism. To analyze its reliability accurately, the dynamic fault tree (DFT) and Markov model are introduced in this chapter. The DFT and Markov model are introduced into reliability analysis widely and many achievements have been gained. The original DFT modeling techniques were proposed by Dugan [Dugan et al., 1992] to analyze the reliability of the fault-tolerant computer system, and it is further studied by Bouissou through Boolean logic driven Markov process [Bouissou, 2007; Bouissou and Bon, 2003]. The framework of DFT analysis based on I/O-IMCs was also proposed Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00008-8 Copyright © 2021 Elsevier Inc. All rights reserved.
31
32
Safety and reliability modeling and its applications
and improved using I/O interactive Markov chains by Boudali [2007, 2010]. Chiacchio presented the dynamic fault trees resolution based on a conscious trade-off between analytical and simulative approaches, which can solve the general DFT including time dependencies, repeated events and generalized probability failure [Chiacchio et al., 2011, 2013]. The efficient approximate Markov chain method for DFT analysis is also proposed for both nonreparable and reparable system by Yevkin [2016]. Kabir [2017] gave a good overview of fault tree analysis and summarized the DFT method. An improved sequential binary decision diagrams was used to perform the quantitative analysis of DFT [Ge et al., 2015]. Meanwhile, a novel method for rare event simulation of DFT with complex repairs was also proposed [Ruijters et al., 2017]. Zhang et al. [2016] used system grading and DFT to conduct reliability analysis for floating offshore wind turbine. To analyze this failure behavior of SADA, the DFT and Markov model are combined to estimate the occurrence of the top failure event of SADA. Main works are summarized as four aspects: (1) the structure and working principle of SADA is analyzed; (2) the DFT model of SADA is built; (3) the DFT of SADA is transformed into the Markov model; (4) the reliability of SADA is analyzed based on the Markov model. The remainder of this chapter is organized as follows. In Section 3.2, the DFT method is briefly introduced. In Section 3.3, the DFT model of SADA is built based on its structure and working principle. In Section 3.4, the DFT is transformed into the Markov model and the reliability analysis of SADA is estimated. In Section 3.5, we close this paper with a brief conclusion.
3.2 DFT method Traditional static fault tree analysis methods cannot handle the systems with dynamic behaviors. For example, the solar sensors in SADA is a cold standby system, and the occurrence of the primary and standby component failure have a fixed sequence, and thus it is difficult to depict the dynamic behavior using the traditional static fault tree. The priority gates, function-correlated gates, sequential-correlated gates, cold standby gates, and hot standby gates are common logic gates in dynamic fault tree models. The conversions of the three dynamic logic gates to the Markov model are shown in Table 3.1. These gates are used for modeling the failure behavior of SADA in this chapter .
3.3 DFT Modeling for SADA 3.3.1
Structure and working principle of SADA
The SADA is the interface between the solar wing and the satellite, and is connected to the solar wing and the entire satellite. From a functional perspective, it is a necessary part for the energy system. The main function of the SADA is to
Reliability analysis of solar array drive assembly by dynamic fault tree Chapter | 3
33
TABLE 3.1 The conversion of the dynamic logic gates to the Markov model Logic gates
Transformation of logic gates to Markov chains
Cold standby gate
Hot standby gate
Function-correlated gates
drive the solar panel to rotate, and then the solar panel normal can be deployed in a parallel direction with the solar beam to obtain solar energy as much as possible to provide enough power for the satellite. The SADA mainly consists of eight parts: the sun sensors, onboard computer, driving motor, harmonic reducer, conducting ring, electrical system, position sensor, and the transmission. Failure of any mentioned part in SADA will cause the system to fail and, thus, it is a series system. Particularly, as the failure caused by human cannot be ignored, human errors are considered when processing the failure analysis.
3.3.2
DFT model
In the DFT modeling and analysis of complex systems, the dynamic fault tree needs to be modularized according to the system structure. The system is decomposed into several independent static and dynamic subtrees, and then solved by the BDD method and Markov model, respectively. The results from the analysis of the independent subtrees are then combined together to obtain the system reliability. The failure analysis of the dynamic parts in the SADA is performed as follows: Step 1: The solar sensors, onboard computers, driving motor stator windings, and the position sensors are in the cold standby mode. The electrical system
34
Safety and reliability modeling and its applications
TABLE 3.2 The code of event Codes of events
Event
Codes of events
A
Failure of the sun sensor
B
Event Failure of the onboard computer
C
Failure of the harmonic reducer D
Failure of the driving motor
E
Failure of the human factor
F
Failure of the electrical system
G
Failure of the transmission
H
Failure of the conducting ring
I
Failure of the position sensor
K
Failure of the electric brush
R
Failure of the driving motor stator winding
is in function-correlated gate due to the related functional components. The conducting ring is in active standby mode. Step 2: Select the SADA failure as the top event and analyze the top event in a top-down fashion. The code corresponding to each event is shown in Table 3.2. Suppose that all the events are statistically independent, and thus the logical relationship between the top event and the events A → I can be linked by OR-logic. Combined all the above failure analysis results from the nine modules A → I, the fault tree of the SADA can be obtained as shown in Fig. 3.1. A1 and A2 in Fig. 3.1 are the primary and standby components of A, respectively. B1 and B2, R1 and R2, F1 and F2, I1 and I2, and K1 and K2 are the primary and standby components of B, R, F, I, and K, respectively. The dynamic subtrees of the system are shown in Fig. 3.2.
3.4 Reliability analysis of SADA 3.4.1 Reliability analysis of the dynamic module based on Markov model The onboard computer is designed with cold standby redundancy, including one working component (B1) and one backup (B2). Meanwhile, the failure probability of switching is 0. The system failure mechanism is summarized as follows. Component B1 works normally and the standby component B2 stays in standby mode at beginning of use. There are two basic events leading to the failure of component B1. If one of them occurs, component B1 fails. The standby component B2 starts to work once component B1 fails. The failure mechanism of component B2 is the same as B1. The Markov model of the onboard computer is shown as Fig. 3.3.
Reliability analysis of solar array drive assembly by dynamic fault tree Chapter | 3
FIG. 3.1
The fault tree of the SADA
FIG. 3.2
The dynamic fault tree of some subsystems
FIG. 3.3
The Markov model of onboard computer
35
36
Safety and reliability modeling and its applications
The state transition rate matrix of the Markov model is given by: ⎡ ⎤ −λ14 − λ40 λ14 0 0 λ40 0 0 ⎢ 0 0 0⎥ 0 −λ14 − λ40 λ14 λ40 ⎢ ⎥ ⎢ 0 0 0 0 0 0 0⎥ ⎢ ⎥ ⎢ 0 0 0 0 0 0 0⎥ ⎢ ⎥ ⎢ 0 0 0 0 −λ14 − λ40 λ40 λ14 ⎥ ⎢ ⎥ ⎣ 0 0 0 0 0 0 0⎦ 0 0 0 0 0 0 0
(3.1)
The corresponding Kolmogorov’s back differential equation is given by Ikeda and Watanabe [2014]: d pi j (t ) = −qi pi j (t ) + qik pk j (t ) (3.2) dt k=i Eq. (3.2) can be expressed in a matrix form as follows: P (t ) = QP(t )
(3.3)
By the Laplace transform, Eq. (3.3) can be converted to: s(Pi (s) − Pi (0)) = QPi (s)
(3.4)
As all the components works normally at the beginning of use, the initial system state probabilities are: P0 (0) = 1, Pi (0) = 0, i > 0. According to the initial system state probabilities, the state distribution by the Laplace transformation can be obtained as follows: πj = 1 (3.5) Q = 0, j∈E
sP0 (s) − P0 (0) = (−λ14 − λ40 )P0 (s); sP1 (s) = λ14 P0 (s) + (−λ14 − λ40 )P1 (s); sP2 (s) = λ14 P1 (s); sP3 (s) = λ40 P1 (s); sP4 (s) = λ40 P0 (s) + (−λ14 − λ40 )P4 (s); sP5 (s) = λ40 P4 (s); sP6 (s) = λ14 P4 (s)
(3.6)
By integrating Eqs.(3.5) and (3.6), the system state probability at any time can be computed. Taking state 2 as an example, by resolving Eq. (3.6), the probability of state 2 can be obtained as follows: P2 (s) =
λ14 2 s · (s + λ14 + λ40 )2
(3.7)
By the Laplace inverse transformation, Eq. (3.7) can be expressed as:
te−(λ14 +λ40 )t 1 − e−(λ14 +λ40 )t 2 − (3.8) P2 = λ14 · λ14 + λ40 (λ14 + λ40 )2
Reliability analysis of solar array drive assembly by dynamic fault tree Chapter | 3
FIG. 3.4
37
BDD analysis of subtree E
According to system definition, states 2, 3, 5, and 6 are failure states, and then, the failure probability of the system is: P = P2 + P3 + P5 + P6
(3.9)
The analysis of other dynamic modules is similar to the above procedures.
3.4.2
Analysis of the static module based on BDD
In this section, the subtree E is taken as an example to explain the BDD analysis process. According to the analysis procedure of the BDD, the basic events of subtree E are gradually expanded downward, and the BDD figure of the subtree E is shown in Fig. 3.4. According to Fig. 3.4, the system structure function of the static subtree E is: φ(E ) = Y5 ∪ Y6 ∪ Y7
(3.10)
Similarly, the system structure functions of other static subtrees can be derived as: φ(C) = Y1 ∪ Y2 ∪ Y3 ∪ Y4 φ(D) = R ∪ X17 ∪ X18 ∪ X19 ∪ X20 φ(G) = Y8 ∪ Y9 ∪ Y10 ∪ Y11 ∪ Y12 ∪ Y13 ∪ Y14
(3.11)
φ(H ) = X33 ∪ X34 ∪ X35 ∪ K Then, the structure function of the entire fault tree is: φ = A∪B∪C ∪D∪E ∪F ∪G∪H ∪I
3.4.3
(3.12)
System reliability calculation
Peng [1992] gave the failure rate of all the devices of SADA. Here we will empirically estimate the failure rate of some devices based on this table. Assume
38
Safety and reliability modeling and its applications
TABLE 3.3 Failure rate list (unit:10−6 ft/hr) Events Event
Failure Event rate
Event
Failure rate
X1
Optical head
0.5
X19
Light emitting diode failure
0.5
X2
Sensor
0.6
X20
Detection circuit failure
0.5
X3
Signal processing line
0.12
k1
Main components of brush
0.5
X4
Onboard computer hardware
0.5
k2
Backup of brush
0.5
X5
Onboard computer software
0.25
Y1
Flexible wheel failure
0.1
X6
Winding fatigue
0.25
Y2
Gear abrasion
0.6
X7
Winding is burned
0.1
Y3
Solid lubricating film failure
0.55
X8
Drive line failure
0.12
Y4
Grease failure
0.5
X9
Locked rotor
0.12
Y5
Instruction runaway
0.5
X10
Increased friction
0.1
Y6
Hardware design error
0.2
X11
Fatigue failure
0.1
Y7
Software design error
0.2
X12
Interface failure
0.6
Y8
Clutch failure
0.6
X13
Line failure
0.125
Y9
Bearing deadlocking
0.5
X14
Transistor failure
0.5
Y10
Bond rupture
0.1
X15
Line is burned by discharge
0.1
Y11
Potentiometer rotation axis failure
0.5
X16
Bearing lubricant failure
0.1
Y12
Gear failure
0.1
X17
Insulation failure
0.15
Y13
External failure
0.5
X18
Photosensitive element failure
0.5
Y14
Comprehensive failure
0.5
that all the failures comply with exponential distribution, and all basic events and their failure rate in the fault tree are shown in Table 3.3. For an event with a failure rate λ, the reliability function is: Ri (t ) = e−λi t
(3.13)
The whole system is finally simplified as a series model, that is, the system reliability is a product of the probabilities of all the basic event, which is
Reliability analysis of solar array drive assembly by dynamic fault tree Chapter | 3
FIG. 3.5
39
System reliability function over time
expressed: R(t ) = R(t ) =
n n i=1
Ri (t )
(3.14)
Ri (t )
(3.15)
i=1
Suppose that the working time t = 5000h. The reliability of the SADA can be calculated as R(50000) = 0.6399. In order to have a better comparison, we set the time interval as [10000, 50000], and then the relationship between reliability and time is shown in Fig. 3.5. If there is no standby component in this system, the reliability is R = 0.5183. The comparison for the two cases is shown in Fig. 3.6. From Fig. 3.5, it can be seen that the system reliability gradually decreases with time. After calculation, we can also find that the reliability of the active standby gate of the brush is 0.9994, and the reliability of the single brush is 0.9753.
3.5 Conclusion In this chapter, the DFT for the SADA is established, and the fault tree of the system is divided into several static subtrees and dynamic subtrees. In this method, the Markov model is firstly used to solve the dynamic subtree, so that the dynamic logic gate is regarded as a basic event, which simplifies the fault tree and obtains a static fault tree composed of OR-gates. Then, it is solved by the BDD to obtain the reliability of the SADA. The system reliability is calculated. Finally,
40
Safety and reliability modeling and its applications
FIG. 3.6
The comparison for the cases with and without cold standby
the system reliability with or without standby are compared. In our future work, the Bayesian network will be studied for the reliability analysis of the SADA.
Acknowledgements The authors extend their sincere gratitude to the National Natural Science Foundation of China for finical support under the contract number 51775090 and sincere thanks to Technology Institute of Armored Force for providing the data in this work.
References Baghdasarian V. G. Hybrid solar panel array: U.S. Patent 5,785,280. 1998-7-28. Boudali, H., Crouzen, P., Stoelinga, M., 2007. Dynamic fault tree analysis using input/output interactive Markov chains. In: 37th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Networks (DSN’07). IEEE, pp. 708–717. Boudali, H., Crouzen, P., Stoelinga, M., 2010. A rigorous, compositional, and extensible framework for dynamic fault tree analysis. IEEE Trans. Depend. Secure Comput. 7 (2), 128–143. Bouissou, M., Bon, J.L., 2003. A new formalism that combines advantages of fault-trees and Markov models: Boolean logic driven Markov processes. Reliab.Eng. System Safe. 82 (2), 149–163. Bouissou, M., 2007. A generalization of dynamic fault trees through Boolean logic driven Markov processes (BDMP). Proc. 16th Eur. Safe. Reliab. Conf. (ESREL’07). Brophy, J, Gershman, R, Strange, N, Landau, D, Merrill, R, Kerslake, T., 2011. 300-kW solar electric propulsion system configuration for human exploration of near-earth asteroids. 47th AIAA/ASME/SAE/ASEE Joint Propulsion Conf. Exhibit 5514.
Reliability analysis of solar array drive assembly by dynamic fault tree Chapter | 3
41
Castet J, F, Saleh J, H, 2009. Satellite and satellite subsystems reliability: Statistical data analysis and modeling. Reliab. Eng. Syst. Safe. 94 (11), 1718–1728. Chiacchio, F, Cacioppo, M, D’Urso, D, Manno, G, Trapani, N, Compagno, L, 2013. A Weibull-based compositional approach for hierarchical dynamic fault trees. Reliab. Eng. Syst. Safe. 109, 45–52. Chiacchio, F, Compagno, L, D’Urso, D, Manno, G, Trapani, N, 2011. Dynamic fault trees resolution: A conscious trade-off between analytical and simulative approaches. Reliab. Eng. Syst. Safe. 96 (11), 1515–1526. Dugan, J B, Bavuso, S J, Boyd, M A, 1992. Dynamic fault-tree models for fault-tolerant computer systems. IEEE Trans. Reliab. 41 (3), 363–377. Ge, D, Lin, M, Yang, Y, Zhang, R, Chou, Q, 2015. Quantitative analysis of dynamic fault trees using improved sequential binary decision diagrams. Reliab. Eng. Syst. Safe. 142, 289–299. Huang H, Z, Li Y, F, Sun, J, Yang Y, J, Xiao N, C, 2013. Fuzzy dynamic fault tree analysis for the solar array drive assembly. J. Mecha. Eng. 49 (19), 70–76. Ikeda, N, Watanabe, S., 2014. Stochastic differential equations and diffusion processes. Elsevier. Kabir, S., 2017. An overview of fault tree analysis and its application in model based dependability analysis. Expert Syst. Appl. 77, 114–135. Li, Y F, Mi, J H, Huang, H Z, Xiao, N C, Zhu, S P, 2013. System reliability modeling and assessment for solar array drive assembly based on Bayesian networks. Eksploatacja i NiezawodnoscMaintenance and Reliability 15 (2), 117–122. Peng HL. Reliability technical manual - failure analysis technology, 1992. www.kekaoxing.com/ CLUB/forum.php?mod=viewthread&tid=1228 Ruijters, E, Reijsbergen, D, de Boer, P T, Stoelinga, M., 2017. Rare event simulation for dynamic fault trees. Comput. Safe. Reliab. and Secur. 36th Int. Conf 25–32. Sattar, M, Wei, C., 2019. Analysis of coupled torsional disturbance behavior of micro-stepped solar array drives. J. Sound Vibration 442, 572–597. Wu J, P, Yan S, Z, Xie L, Y, 2011. Reliability analysis method of a solar array by using fault tree analysis and fuzzy reasoning Petri net. Acta Astronautica 69 (11-12), 960–968. Yevkin, O., 2016. An efficient approximate Markov chain method in dynamic fault tree analysis. Qual. Reliab. Eng. Int. 32 (4), 1509–1520. Zhang, X, Sun, L, Sun, H, Guo, Q, Bai, X., 2016. Floating offshore wind turbine reliability analysis based on system grading and dynamic FTA. J. Wind Eng. Indus. Aerodyn. 154, 21–33.
Non-Print Items Abstract Solar array drive assembly (SADA) is an important part of solar wing, and has complex structure, failure mechanisms, and dynamic behaviors. The dynamic fault tree (DFT) method is an effective tool to analyze the reliability of complex system with dynamic failure behaviors, thus it is used in this chapter to analyze the reliability of the SADA. The Markov model and the binary decision diagram (BDD) are used jointly to deal with the dynamic tree and static subtree. The Markov model is used to solve the dynamic subtree. The dynamic logic gate is regarded as a basic event, which simplifies the fault tree and obtains a static fault tree composed of OR-gates, and then it is solved by the BDD to obtain the reliability of the SADA. Finally, the system reliability with or without standby are compared. Key words Binary decision diagram; Dynamic fault tree; Markov model; Reliability; Solar array drive assembly
Chapter 4
Reliability and maintainability of safety instrumented system Rajesh S. Prabhu Gaonkar a and Mahadev V. Verlekar b a School
of Mechanical Sciences, Indian Institute of Technology Goa (IIT Goa), Farmagudi, Ponda, Goa, India. b Deccan Fine Chemicals (India) Pvt. Ltd., Santa Monica Works, Corlim, Ilhas, Goa, India
Abbreviations EUC Equipment under control SIS Safety instrumented system SIF Safety instrumented function SIL Safety integrity level LS Logic solver DD Dangerous detected DU Dangerous undetected STR Spurious trip rate SO Spurious operation PST Partial stroke test GEC Goa college of engineering PFSavg Probability of failing safely PFDavg Probability of failure on demand average MRT Mean repair time MTTR Mean time to restore PTT Proof test time PST Partial stroke test MTBF Mean time between failures FTC Failure to close
4.1 Introduction 4.1.1
Preamble
Safety instrumented system (SIS) is used as an independent layer of protection called upon to control potentially hazardous deviations of the monitored process, that is, the equipment under control (EUC), and therefore to put it in a safe state. Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00005-2 Copyright © 2021 Elsevier Inc. All rights reserved.
43
44
Safety and reliability modeling and its applications
The safety functions implemented into the SIS are called SIFs. The SIS is made up of the following three subsystems: S(sensor) is made up of a set of input elements (sensors, detectors, transmitters, etc.) that monitor the evolution of the parameters representing the process behavior (temperature, pressure, flow, level, etc.). If at least one of the parameters exceeds a threshold level and remains there, this deviation constitutes the demand from the EUC. LS (Logic Solver): includes a set of logic elements (e.g., programmable logic controller or PLC) that collect information from the subsystem and carry out the decision-making process that may eventually end by activating the third subsystem. FE (Final Element): this subsystem acts directly (emergency shutdown valves) or indirectly (solenoid valves, alarms) on the process in order to neutralize its deviation by generally putting it in a safe state, within a specified time which must be identified for each safety function (Innal et al. 2015).
4.1.2
Failure modes and failure rates
D (Dangerous): the item fails to function and falls into a dangerous state. A hazardous chemical valve that is jammed open and will FTC if there is a demand for the valve to close. This failure is dangerous because the chemical flow is a hazard. This failure component is represented by failure rate λd . DD (Dangerous Detected): a dangerous failure of an item that can be detected through diagnostics. Detecting the dangerous state gives room for alternative protective actions to be taken. In the case of the hazardous chemical valve, a redundant valve may be closed to prevent the hazard from happening. The failure rate at which the item falls into dangerous-detected state is represented as λdd . Dangerous undetected (DU): a dangerous failure that is not revealed through diagnostics or other means. In the case of the hazardous chemical valve, the valve remains stuck open and the failure of the valve is not detected by any diagnostic system. The only time this failure can be revealed is if a demand arises and the system does not respond to the demand, or when the system undergoes the next proof-test process. The failure rate of this component is depicted by λdu . Safe (S): the item fails by falling into a safe state without any intended demand. This type of failure is commonly known as a spurious trip. For the gas valve, a spurious trip might take place if the valve suddenly closes without receiving any command. The valve fails but the failure is not dangerous. The main consequence of a spurious trip is loss of production. Dangerous failure rate is equal to the sum of the detected and undetected dangerous failures: λd = λdd + λdu (Jahanian 2014). Definition of safe state: The SIS must be designed to take the equipment to a safe state in response to a demand. It is not always straightforward to define the safe state, and the EUC may have different safe states during normal operation, start-up, or shutdowns. In some cases, the safe state is to maintain the state before
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.1
45
Typical risk reduction methods found in process plants Source: IEC (61511).
the demand occurred, while in other cases it means to stop the EUC. When the safe state is defined, the SIS design must also consider fail-safe operation, meaning that the SIS automatically takes the EUC to a safe state in response to foreseeable SIS failures such as loss of power, loss of signal, and loss of air supply to the pneumatic actuator. Sequence of protection layers: The typical risk-reduction methods found in process plants are also called onion models, as mentioned in Fig. 4.1, and they are used to illustrate the sequence of protection layers. The sequence starts from the center and proceeds outwards, first with the frequency reducing layers and then with the consequence-reducing layers. Fig. 4.2 mentions the layer of protection. Definition of hazardous event: A hazardous event is sometimes defined as the first significant deviation from a normal situation that may, if not controlled, develop into an accident. A high-demand SIS often contributes to reduce the likelihood of such events. For consequence-reducing SISs (usually low-demand SISs), it is the accident frequency and not the hazardous event frequency that is reduced. The likelihood of a specified undesired event occurring within a specified period or in specified circumstances. Risk: A combination of the probability of occurrence of harm and the severity of that harm, or a measure of the likelihood and consequence of adverse effects. RISK = Likelihood x Consequence
46
Safety and reliability modeling and its applications
FIG. 4.2
4.1.3
Layer of protection (Exida SIS Course).
Spurious Trips
Spurious activation of the SIS may lead to a partial or full process shutdown. The spurious activation may be due to false process demands or SIS element failures. A false process demand is one that is erroneously treated as a real process demand, for example, a stray ray of sunlight that is mistakenly read as a fire by a flame detector in a furnace. In process industry, it is important to reduce the number of spurious activations to: 1. avoid unnecessary production loss, 2. reduce the risk related to stresses caused by spurious activation, and 3. avoid hazards during unscheduled system restoration and restart. The different types of spurious activation are: Spurious operation: A spurious operation is the activation of an SIS element without the presence of a specified process demand, for example: (i) a false signal about high level from a level transmitter due to an internal failure of the transmitter, or (ii) premature closure of a spring loaded, pneumatically operated, fail-safe close safety valve due to leakage in the pneumatic circuit, and (iii) a high-level alarm from a level transmitter without the liquid level having exceeded the upper limit, due to failure to distinguish the foam from the real level of the liquid in the reactor.
Reliability and maintainability of safety instrumented system Chapter | 4
47
TABLE 4.1 Safety integrity levels: probability of failure on demand for demand
mode of operation. Safety integrity level (SIL)
Target average probability of failure on demand
4
10−5 to 10−4
3
10−4 to 10−3
2
10−3 to 10−2
1
10−2 to 10−1
Spurious trip: A spurious trip is activation of one or more SIS elements such that the SIS performs a SIF without the presence of a specified process demand. Examples: (i) two flame detectors in a 2oo3 configuration give false signal about fire, causing the final elements of the SIF to be activated, and (ii) one out of two shutdown valves in a 1-out-of-2 (1oo2) configuration of final elements closes prematurely due to an internal failure. Spurious shutdown: a spurious shutdown is a partial or full process shutdown without the presence of a specified process demand (Lundteigen and Rausand 2008).
4.1.4
Probabilistic evaluation of SIS
The quantitative (probabilistic) evaluation of SIS performance is an important step for their validation. This validation is an assurance that they can properly perform their designed safety functions. The ability of SIS to meet a given safety target (tolerable risk level) is called safety integrity, which is measured differently depending on the SIS modes of operation: SIS is normally in a passive state and will only be activated when a demand occurs (Table 4.1, IEC 61511). The frequency of demands is assumed to be less than once a year. An SIS in a process plant is an example of a low-demand SIS. The reliability of a low-demand SIS is quantified as the average probability of failure on demand (PFDavg). This means that if a demand occurs, the SIS will, on an average fail to carry out its required SIF with average probability of failure on demand (PFDavg) for the low-demand mode. This mode is typical for safety systems which are activated only on exceeding a threshold value (process upset) as mentioned in Table 4.2 (IEC 61511). Probability of dangerous failure per hour (PFH) for the high or continuous demand mode. This mode of operation is typical of safety systems that have permanent or regular operation, as mentioned in Table 4.3 (IEC 61511).
48
Safety and reliability modeling and its applications
TABLE 4.2 Safety integrity levels: probability of failure on demand for demand
mode of operation. Safety integrity level (SIL)
Target risk reduction
4
≥ 10000 to 100000
3
≥ 1000 to 10000
2
≥ 100 to 1000
1
≥ 10 to 100
TABLE 4.3 Safety integrity levels: frequency of dangerous failures of the SIF for continuous mode of operation. Safety integrity level (SIL)
Target frequency of dangerous failure (per h)
4
10− 9 to 10− 8
3
10− 8 to 10− 7
2
10− 7 to 10− 6
1
10− 6 to 10− 5
4.1.5
SIS design optimization
Looking for an optimal KooN architecture, we tend to establish a criterion for identifying the optimal KooN architecture, integrated to an SIS, with respect to safety integrity. This is done on the basis of a simple analysis of the structure of conventional KooN architectures. With regard to the safety integrity, we find that: For a given value of K, the PFDavg decreases when the value of N increases: PFD1001 > PFD1002 > PFD1003 and PFD2002 > PFD2003 > PFD2004 For a given value of N (here 2, 3 or 4), the PFDavg increases when the value of K increases: PFD1002 < PFD2002 ; PFD1003 < PFD2003 and PFD1004 < PFD2004 (Innal et al. 2015) Consider a KooN system consisting of N identical, independent units. The system is functional (available) when at least K units are functional. That is NK+1 unit should fail before the system becomes unavailable. In relation to DU failure, the system can be in one of the following three states at any given time. 1. Available state, where all the N units are available and none of them has fallen into DU state. This state starts at t=0 when the system is placed into operation for the first time, or after a proof test where all the units are as good as new. 2. Degraded state, where the number of units in DU state is more than 0 and less than N-K+1. The system can still respond to demand, if it takes place, but the system is now less tolerant to further failures of its units.
Reliability and maintainability of safety instrumented system Chapter | 4
49
3. Unavailable state, where N-K+1 or more units are in DU state and the system will not be able to respond to potential demands. It should be noticed that, by its nature, a DU fault is an undetected fault. The fault is not revealed and therefore the system will remain in operation: no operator action will be taken to repair the faulty units or to shut the system down, as there is no indication of faults. The failure of the units is assumed to remain hidden until the next proof test is performed. We should also assume that the faulty units will not be repaired or fixed by themselves, and therefore when a unit falls into DU state it will remain in DU state until the problem is revealed by the next proof test (Jahanian 2014). Process shutdown valves are normally operated in a so-called low demand mode of operation. This means that the valves are kept idle in open position for long periods and are designed to close and keep tight in case a process demand occur. The valves usually have pneumatic fail-safe close or fail-safe open actuators. Failures may occur while in open or close position and may cause the valve to FTC or fail to open in a demand situation. Such failures may remain for a long period and are called DU failures (IEC 61508, 1998). Functional testing is required to reveal potential DU failures and involves full stroke operation and leakage testing. The process needs to be shut-down during the functional test, if no bypass of the valve is available. In recent years, partial stroke testing (PST) has been introduced as a supplement to functional testing. PST means to partially close a valve, and then return it to the initial position. The valve movement is so small that the impact on the process flow or pressure is negligible, but the valve movement may still be sufficient to reveal several types of dangerous failures. PST may be suitable for process where a small valve movement does not cause disturbances that may lead to process shutdowns. For such processes, it may be economically viable to run PST more frequently than functional testing. (Lundteigen and Rausand 2008).
4.2 Literature review 4.2.1
Probability of failure on demand
The performance of the safety function is its ability to satisfy both the required SIL and the production Target, that is, it should have both, safety integrity and operational integrity. Regarding the safety integrity, PFDavg or PFH of the SIS are the relevant indicators. For operational integrity, the following indicators are proposed in the literature, the average probability of failing safely (PFSavg), the spurious trip rate (STR), and the mean time to trip spuriously. Literature has provided a new formulation for the SIS performance indicators with respect to its safety integrity, that is, PFDavg and PFH. It also provides a generic formulation of the SIS operational integrity indicators, that is, PFSavg and STR. It also discusses the SIS optimization problem and starts with a preliminary search for a balance between safety integrity and operational integrity. The IEC 61508 standard (part 6) as it is a safety-based standard, provides analytical expressions
50
Safety and reliability modeling and its applications
related to the PFDavg and PFH for only usual KooN architectures. The technical report of ISA-TR 84.00.02 gathers the PFDavg and STR analytical expressions for several typical KooN architectures. The results provided by the new analytical equation for PFDKooN and PFHKooN are very close to those obtained by using corresponding Fault tree models. Also, the results provided by the new analytical equation for PFSKooN and STRKooN are very close to those obtained by Markov models. With regard to the safety integrity, we found that for the given value of K (here 1 or 2), the PFDavg is decreasing when the value of N increases: PFD100 1 > PFD1002 > PFD1003 and PFD2002 >PFD2003 >PFD2004 For a given value of N (here 2, 3, or 4), the PFDavg increases when the value of K increases: PFD1002 < PFD2002 ; PFD1003 < PFD2003 and PFD1004 < PFD2004 With regard to the operational integrity, the conclusions are reversed as mentioned below. For a given value of K (here 1 or 2), the PFSavg increases when the value of N increases: PFS1001 < PFS1002 < PFS1003 and PFS2002 < PFS2003 < PFS2004 For a given value of N (here 2,3 or 4), the PFSavg is decreasing when the value of K increases: PFS1002 >PFS2002 , PFS1003 >PFS2003 and PFS1004 > PFS2004 Safety integrity and operational integrity are conflicting and hence a trade-off between them is necessary. The trade-offs are only relying on N and K.
4.2.2
Spurious activation
Spurious activation of an SIS and its causes. It also addresses the relationship between spurious activation and DD failures. Spurious activation is known under several different names, for example, spurious operation (SO), spurious trip, spurious stop, nuisance trip, spurious actuation, spurious initiation, false trip, and premature closure. In this chapter, spurious activation is used as a collective term. Spurious indicates that the cause of activation is improper, false, or nongenuine, while activation indicates that there is some type of transition from one state to another. There are three main types of spurious activation: 1. Spurious activation of individual SIS elements. 2. Spurious activation of SIS (i.e., of a SIF). 3. Spurious shutdown of the process. SO: It is an activation of an SIS element without the presence of a specified process demand. For example: 1. a false signal about high level from a level transmitter due to an internal failure of the transmitter, or
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.3 (2008).
51
Decisions and factors influencing spurious activation. Source: Mary Ann Lundteigen
2. high-level alarm from a level transmitter without the liquid level having exceeded the upper limit, due to failure to distinguish the foam from the real level of the liquid in the reactor. Spurious trip: It is the activation of one or more SIS elements such that the SIS performs a SIF without the presence of a specified process demand, for example: 1. Two flame detectors in a 2oo3 configuration give false signal about fire, causing the final elements of the SIF to be activated. 2. One out of two shutdown valves in a 1oo2 configuration of final elements closes prematurely due to an internal failure. Spurious shutdown: It is a partial or full process shutdown without the presence of a specified process demand. As mentioned in Fig. 4.3, the main causes of spurious activation are identified and illustrated in an influence diagram. Causes of spurious operation, there are two main causes of spurious operation of an SIS element: 1. An internal failure of the element (or its supporting equipment) leads to a Spurious Operation. 2. The input element responds to a false demand
52
Safety and reliability modeling and its applications
The SO failures due to internal failures are often considered as safe failures since they do not stop the SIS from performing on a demand. However, all safe failures do not lead to SO, and it is therefore necessary to study the safefailure modes for each element to determine which ones are relevant for SO. An internal leakage in the valve actuator of a fail-safe close safety valve may lead to SO, while the failure of a valve position indicator (limit switch) will not. The IEC standard distinguishes between two categories of safe failures: safe random hardware failures and safe systematic failures. The safe random hardware failures are mainly due to normal degradation, while the safe systematic failures are due to causes like design error, procedure deficiency, or excessive environmental exposure that may only be removed by modifying the design, implementation, installation, operation or maintenance processes, tools, or procedures. We therefore distinguish between random hardware spurious operation failures and systematic spurious operation failures in Fig. 4.3. The element design and the material selection influence the rate of random hardware spurious operation failures. This is illustrated by an arrow from element (component) quality to random hardware spurious operation failures in Fig. 4.3. A particular material may, for example, withstand high temperature and high-pressure conditions better than another material, and a sensor principle used for a level transmitter may be more vulnerable to a specific-operating condition than another. As indicated in Fig. 4.3, operation and maintenance procedure, tools and work processes, design, implementation and installation procedures, competence and training, and environmental exposure may influence the likelihood of systematic failures. The IEC standards consider systematic failures as unpredictable failures and therefore the rates of these failures do not need to be quantified. Verification, validation, and testing may reduce the rate of the occurrence of systematic failures. Verification means the activity of demonstrating for each phase of the relevant safety life cycle by analysis and/or tests, that, for specific inputs, the outputs meet in all respects the objectives and requirements set for the specific phase while validation means activity of demonstrating that the SIFs and SISs under consideration after installation meets in all respects the safety requirements specification. In the operational phase, the rate of systematic failures may be reduced by verification and validation of, that is, functional testing, visual inspection procedures, and work processes. The environmental conditions influence the occurrence of systematic failures if the conditions are outside the design limits. A common cause failure (CCF) occurs when two or more elements fail due to a shared cause. The spurious CCFs do not have the same root causes and coupling factors as the dangerous CCFs. Two safety valves may, for example, FTC on demand due to scaling, while scaling will never lead to SO of the same valves. A leakage in a common pneumatic supply system for two (fail-safe-close)
Reliability and maintainability of safety instrumented system Chapter | 4
53
safety valves may lead to SO and may impede the safety valves from closing. IEC 61508 recommends that dangerous CCFs are modeled by a beta-factor model, and part 6 of the standard includes a procedure that can be used to estimate a plant specific value of the parameter beta for dangerous CCFs. False demands are important contributors to SO of SIS elements. A false demand often shares some characteristics (e.g., visual appearance and composition) with a real process demand, and it may therefore be difficult for the input element to distinguish the two. A stray ray of sunlight may, look like a small flame from certain angles, and a flame detector may therefore erroneously read it as a flame. It may not be possible to reduce the occurrence of false demands, but we may influence how the input elements respond to them. It may, not be possible to remove sunlight or alter the density of foam, but we may select elements that are designed to better distinguish between false and real process demands, or we may relocate the elements to make them less vulnerable to false demands. One of the main contributors to spurious trips is evidently SO of SIS elements. Spurious operation may lead to a spurious trip if the number of activated elements corresponds to the number of elements needed to perform the safety function. The selected hardware configuration therefore determines whether or not a Spurious operation leads to a spurious trip. This conditional influence is illustrated by dashed arrows in Fig. 4.3. There are several other causes of spurious trips: • Loss of utilities, like pneumatic or power supply: Loss of utilities may directly lead to spurious trip if the SIF is designed fail-safe. • DD failures: In some cases, the SIS may be designed to spuriously activate a SIF from functioning on demand. A 2oo3 configuration is still able to act if a single DD failure is present. If two DD failures are present, the SIF is unable to respond to a real process demand, and the time the SIS is in such condition should be minimized. This spurious trip may be activated automatically or manually. If the spurious trip causes a process shutdown, the shutdown will usually be more controlled or smooth.
4.2.3
Causes of spurious shutdowns
A spurious trip will usually, but not always lead to a spurious shutdown of the process. If the SIF does not interact directly (or indirectly by activating other SIFs) with the process, the process may not be disturbed upon a spurious trip. A dashed arrow is therefore used to indicate that a spurious trip may (but not always) lead to a spurious shutdown. Fig. 4.3 also indicates (by dashed arrows and nodes) that different types of SIFs may lead to process shutdowns. A spurious shutdown may be caused by a spurious closure/stop of non-SIS equipment that interacts with the process, like control valves and pumps. A spurious closure of an on/off valve or a spurious stop of a pump may be due to element internal failures, human errors, or automatic control systems errors. In Fig. 4.3 we have illustrated non-SIS element failures and automatic control
54
Safety and reliability modeling and its applications
system failures by the chance node process equipment failures, and human errors as the chance node human errors. Future research may be to get more insight into the causes of SOs and spurious trips, and how safety and availability may be balanced. In many industry sectors, the fail-safe state is not well defined and a spurious trip or a spurious shutdown may lead to hazardous situations.
4.2.4
KooN configurations
Generalizing PFD formulas of IEC 61508 for KooN configurations, it has proposed the KooN systems based on the concepts and terminology used in IEC 61508. In KooN system: a system consisting of n units out of which at least k units should be functional (available) for the whole system to be functional (available). Failure of n-k+1 unit will result in system failure (unavailability). Dangerous failure rate is equal to the sum of the detected and undetected dangerous failures: λd = λdd + λdu
(2.1)
Modeling degraded KooN system consisting of n identical, independent units. The system is functional (available) when at least k units are functional. That is, n-k+1 unit should fail before the system becomes unavailable. In DU failure, the system can be in one of the following three states at any given time: 1. Available state, where all the n units are available and none of them has fallen into DU state. This state starts at t = 0 when the system is placed into operation for the first time, or after a proof test where all the units are as good as new, as shown in Fig. 4.4. 2. Degraded state, where the number of units in DU state is more than 0 and less n-k+1. The system can still respond to demand, if it takes place, but the system is now less tolerant to further failures of its units, as shown in Fig. 4.4. 3. Unavailable state, where n-k+1 or more units are in DU state and the system will not be able to respond to potential demands. It should be noticed that, by its nature, a DU fault is an undetected fault. The fault is not revealed and therefore the system will remain in operation, no operator action will be taken to repair the faulty units or to shut the system down, as there is no indication of fault(s). The failures of the units are assumed to remain hidden until next proof test is performed. We should also assume that the faulty unit(s) will not be repaired or fixed by themselves, and therefore when a unit falls into DU state it will remain in DU state until the problem is revealed by the next proof test. In the beginning of time, at t=0, all the n units are assumed to be functional. Let assume that the first unit failure takes place at time S1. At S1 the system is degraded from n units to n-1 units. Similarly, by definition, the ith failure takes
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.4
55
Degraded states and mean time. Source: Jahanian (2015).
place at time Si when the system is degraded to n-1 units, by the same definition, the system becomes completely unavailable at Sn-k+1 when n-k+1 units fail. Refer to Fig. 4.4. Now, let’s have a look at the mean time that the system could stay in each degraded state. We start from t=0 when all the n units are functional. Let us call this state 0. At S1, the system travels to the state 1 in which one unit is in DU state and the other n-1 units remain functional. Effect of repair and fault detection: The unavailability of the system due to DU failure during the time interval [0, τ ]. Let suppose a proof test is done at the end of the first-time interval and i elements have been found faulty. This means the system has been in state i, with i faulty elements, for a mean time of Di = τ /(i + 1)
(2.2)
Although the previously undetected faults are now detected, the system will still be in state i, until the faulty modules are repaired and restored. When all the units are restored, the entire system will be available and the second time interval begins. As mentioned in Fig. 4.5 let MRT denote the mean repair time (the time required at the end of the proof test to repair the faulty units and put them back in service), and PTT denote the proof test time, that is, the time that system is being tested for detection of DU faults. Adding MRT and PTT the total downtime of the system for the time interval [0, τ + MRT + PTT] will now become τ Di = (i + 1) + (MRT + PT T )
(2.3)
56
Safety and reliability modeling and its applications
FIG. 4.5
System down time due to repair. Source: Jahanian (2015).
Another factor that affects the downtime is the possibility of fault detection. When the system is capable of detecting a portion of the failure, that is, a DD component, the mean time that the system stays in a degraded state will be shorter. This is because the downtime of the individual units will be reduced by early detection of the faults and any actions taken to repair and restore the faulty units. Let the mean time to restore, or MTTR, represent the mean time from when the DD failure occurs until the time when the faults are detected and the units are repaired and restored. When all the faults of the units are of DU type λD = λDU
(2.4)
but if any dangerous faults are detected through diagnostics, then the failure rate can be broken down into two components: λD = λDU + λDD
(2.5)
and the mean time Di can be formulated as follows D = λDU (τ + MRT + PTT) + λDD(MTTR)
(2.6)
λD → (i + 1) → λD CCF: In addition to the independent failure of the constituting units, a KooN system may also fail due to common cause fault: the fault that affects all the constituting units at the same time. To model the CCF, factor method is used. Based on this method, the time that the system fails due to CCF is an exponential random variable with the rate: β ∗λ
(2.7)
Reliability and maintainability of safety instrumented system Chapter | 4
57
FIG. 4.6 PST concept (A) integrated with the SIS, (B) through an additional vendor PST package. Source: Mary Ann Lundteigen (2008).
where λ is the failure rate of a single unit and β a factor between 0 and 1. That represents the proportion of the failure that affects all identical unit sat the same time.
4.2.5
Partial stroke testing
To develop a procedure on how to determine the PST coverage, taking into account plant-specific conditions, valve design, and historical data on valve performance. The PST is introduced to detect, without disturbing the process, failures that otherwise require functional tests. Two variants of PST are implemented: 1) PST that is integrated with the SIS and 2) a separate vendor PST package as shown in Fig. 4.6. The hardware and software necessary to perform the PST are implemented into the SIS logic solver. As mentioned in Fig. 4.6A when PST is initiated based
58
Safety and reliability modeling and its applications
on a manual request, the logic solver deactivates its output for a certain period of time (typically a few seconds). The deactivated outputs cause the solenoid operated valve to start depressurizing the shutdown valve, and the shutdown valve starts to move towards the fail-safe(closed) position. Just as the valve has started to move, the logic solver outputs are re-energized, and the safety valve returns to normal (open) position. The test results may be monitored manually from the operator stations, by verifying that the valve travels to the requested position and returns to the normal state when the test is completed. In some cases, an automatically generated alarm may be activated, if the valve fails to move or return to its initial position. In the variant b, the vendor PST packages perform the same type of test sequence, but the hardware and software are implemented into separate systems. Some vendors interface the existing solenoid operated valve, while others install a separate solenoid for testing purposes as shown in Fig. 4.6B. The vendor supplied PST package may automatically generate the PST at regular intervals. In many cases the control room operators want to be in control with the actual timing of the PST, and manual activation may therefore be preferred. Test aspects: IEC 61508 and IEC 61511 distinguish between two main types of SIS-related tests in the operating phase: 1. diagnostic tests that automatically identify and report certain types of failures and failure causes, and 2. functional tests that are performed to reveal all dangerous SIS failures, so that the SIS may be restored to its design functionality after the test. The hardware and software necessary to implement the diagnostic test are sometimes referred to as diagnostics. A test that can be performed without process disturbance while the process is operational is called an online test. A diagnostic test is therefore an online test. A test that reveals all dangerous failures and where the SIS after the test is restored to an as good as new condition is called a perfect test. Advantage and disadvantage of introducing PST: PST is a supplement rather than a means to eliminate the need for functional testing. The reliability improvement that is obtained by introducing PST may be used in two ways: 1. To improve safety: PST is added to the initially scheduled functional tests. This leads to a reduction in the calculated PFD. With a lower PFD, the safety improves. 2. To reduce costs: The potential reliability improvement is used to extend the interval between functional tests so that the calculated PFD is kept unchanged. As a result, the operating and maintenance costs may be reduced as less man hours and fewer scheduled production stops are required. The reliability improvement is influenced by two factors: the PST coverage and the PST interval. The PST interval is a decision variable which is entirely up to the end user to follow up, while the PST coverage must be estimated
Reliability and maintainability of safety instrumented system Chapter | 4
59
based on a number of assumptions. The assumptions related to the PST coverage should be made for the plant-specific application rather than for an average valve performance. The factors that may influence the PST coverage are: 1. Valve design: Different types of valves may have different failure properties. A type of valve may, have failures related to leakage in closed position than B type of valve where most failures are cause by a stuck stem. For the valve A, we expect a lower PST coverage than for the valve B. The failure properties may be derived from an FMEA analysis. 2. Functional requirements: PST is not able to verify all types of functional requirements. A PST reveals whether or not a valve starts to move, but is not able to verify that the valve will continue to a fully closed position and that it keeps tight in that position. The specified valve closing time may also impact the PST coverage. 3. PST technology: The PST technology may affect which and to what extent failures are detected. While a SIS-implemented PST with simple feedback of the valve position signal may detect that a valve fails to start closing, a more advanced PST technology solution with additional sensors may indicate if other failures are present by analyzing performance deviations. 4. Operational and environmental conditions: Some operational and environmental conditions may lead to obstructions and build-up in the path towards the valve end position. Concerns have been raised about the secondary effects of PST. Introducing PST may lead to more frequent operation of some components (e.g., solenoidoperated valves) and less-frequent operation of others (e.g., full stroke operation of valves). The reliability that may be gained by introducing partial stroke test is influenced by two factors: 1. Partial stroke test coverage 2. Partial stroke test interval The partial stroke test coverage is partly a design parameter (e.g., valve design, partial stroke test hardware, and software) and partly an operational parameter (e.g., operational and environmental conditions). While vendors may influence the partial stroke test coverage by selecting valve design and partial stroke test hardware and software in accordance with the specified operational and environmental conditions, the frequency by which the partial stroke test is executed is a decision that relies on the end user alone. To determine the reliability, gained by introducing partial stroke test, it is necessary for vendors and end users to collaborate. The valve manufacturer knows how the valve is designed and the partial stroke test vendor or supplier understands the features of the partial stroke test hardware and software. This knowledge must be combined with the end users in-sight on maintenance and testing strategies and operational and environmental conditions. The new procedure suggests a framework that
60
Safety and reliability modeling and its applications
requires a joint effort from end users and valve vendors to estimate the partial stroke test coverage for a particular valve application. The procedure demonstrates some of the pitfalls of introducing the partial stroke test. Partial stroke test becomes false comfort if the partial stroke test coverage is estimated from false premises. We may install the partial stroke test hardware and software, but refrain from implementing adequate follow-up of partial stroke test results. We may use partial stroke test to save operational and maintenance costs, but fail to consider the secondary effects of extending the intervals between full-stroke testing. Our procedure asks for certain steps and analysis which may lead to a higher confidence in the selected partial stroke test coverage, and where to put focus in order to maintain the confidence throughout the operation phase. We, therefore, propose that the PST reliability checklist is used in the design phase as well as in the operation and maintenance phases. The main disadvantage of the new procedure may be related to the practical implementation: 1. The checklist questions may be used to make erroneous improvements to the partial stroke test coverage, particularly if the implementation of the checklist questions does not correspond to how the test is performed in the operational phase. It is, therefore, necessary to review the checklist questions during the operational phase and verify that the test continues to meet the initial estimated coverage. 2. The procedure also requires that the end user spends more time on understanding the partial stroke test hardware and software than what is normally done. The partial stroke test hardware and software should not be so complex that the end user is unable to verify the partial stroke test functionality under normal and failure conditions. Further research may be necessary to develop a generic and widely accepted checklist for the partial stroke test reliability.
4.2.6
SIS subject to degradation due to aging and external demands
The SISs are widely used to prevent the occurrences of hazardous events. These systems are designed to perform some specific SIFs to protect the EUC in different industries. Almost all reliability assessments of SIF are based on an assumption that the failure rates of the components within the systems are constant throughout the life of the component, which is assumed as 15 years while doing the SIL calculations. It means that all components or SIF channels are asgood-as-new when they are functioning, However, in practices many mechanical actuators, valves of SIF become more vulnerable with time, because they are chronically exposed to failure mechanisms such as corrosion, wear, fatigue. The actual lifetimes of actuators are determined not only by their reliability, but by the operating conditions, and thus the assumption of constant failure rate
Reliability and maintainability of safety instrumented system Chapter | 4
61
is questionable. For such cases researchers have identified that the failure rates of these items are non-constant. Redundant structures are often used in SIF to improve the system availability and to satisfy the required SIL, for example, two shutdown valves are installed in series to stop steam flow when the downstream temperature is too high. When one of them cannot be activated, the process, namely EUC, is still safe if the other valve works. Such kind of configuration is called as 1-out-of-2 (1oo2). Deterioration of the mechanical actuators and valve in an SIF are not only due to chronic mechanisms, for example, wear and material fatigue, but also from the external shocks, namely demands for SIF activation. For example, in a high-integrity temperature-protection system (HITPS), the required function of the actuator, valves, is to close the steam flow in the pipeline when the temperature goes beyond the set value. Occasional high temperature causes unprecedented stresses on the valve, and so the effects of such demands on degradation of the valves. Two degradation processes should be considered in assessing the performance of SIF, which consist of actuators and valve: (a) continuous aging degradation, and (b) additional damages by the randomly occurring demands. It is also natural to assume that when the overall degradation of such components arrives at a predefined level, they cannot be activated as expected when a new demand comes. Degradation challenges the common assumption of as-good-as-new after each proof test in SIF-reliability assessment. In general, the reliability of a system decreases as the degradation processes develop. Once the degradation reaches a specific level, the component will fail. The so-called specific level for SIF actuators and valve is referring to a certain performance requirement, such as the closing time and maximum-leakage rate in closed position. The components in a SIF are simultaneously tested and maintained in most cases. Failures and degradations are always hidden until periodical tests. For the valves in an HITPS, they are mainly in a dormant state in the normal operation, meaning that the performance cannot be estimated by visual inspection or diagnostic tests. The SIF are evaluated with different measures when they are operated in different modes, and the frequency of demands to activate SIF is the key to decide what measures can be used. Although more demand can obviously accelerate degradation, it is necessary to value the effects of demands in consideration of measure adaptability. The average probability of failures on demand (PFDavg) is a widely acknowledged measure to quantify the reliability of a low-demand SIF. All units are as-good-as-new as long as they are functioning at the proof tests, so the PFDavg is totally same in each test interval. It is not at all realistic for SIFs with degradations. Given that no failure is revealed in a proof test, it only means that the unit is functioning, but not as-good-as-new. It is natural to suppose that the PFDavg increases in step in different test intervals. The specific objectives include, investigating the combined effects of continuous degradation and random demands on the reliability and availability of a SIF with hidden failures.
62
Safety and reliability modeling and its applications
FIG. 4.7
Redundant SIF loop. Source: Zhang et al. (2019).
We can take HITPS as an example of SIF, the architecture is shown in Fig. 4.7. As mentioned above, the two valves in this SIF are installed in series with a 1oo2 voting configuration to meet the SIL requirement. The fundamental task for the HITPS is to control high temperatures and keep the EUC under an acceptable risk level. In general, mechanical systems are designed with safety margins to meet the specified performance requirement. The performance criteria for the HITPS, for example, leakage rate and closing time, should be a target value with deviation. In theory, the designed leakage rate should be 0 kg/s, but there is an acceptable deviation based on practical consideration. Also, the performance criteria are different under specific working scenarios. If leakage rate is lower than this acceptable deviation, the performance of valve is acceptable, and it can be stated that the valve is functioning. Higher internal leakage can also weaken control, and cause a failure in control of temperature. If the actual leakage rate is higher than the acceptable, the valve is not effective any longer for risk control. The valve will be in a failed state. The failure mode is leakage in closed position (LCP). This failure mode is mainly caused by wear and tear of the seat. The failure mode is DU failure and only can be revealed by proof tests or demands. The possible failure causes could be, normal wear due to corrosive medium. Since a valve is installed to control the temperature, the contact of its seat-sealing area with erosive medium cannot be avoided. The intention of a shutdown valve is to shut-off the steam flow in case of emergency which could leads to a hazardous situation. Operating in higher temperature can result in the damage of the seat of the valve. The damage of a valve seat can accelerate the existing wear process. Once high temperature occurs in a pipeline, the stresses on the 2 valves in Fig. 4.7 will be same or similar. The high temperature could cause a same damage on the two valves simultaneously. Considering the coupling factor, reliability analysis of 1oo2 configuration could not consider two valves separately. For the LCP failure mode of valves, three factors are of interest: acceptable deviation, frequency of closing operations, and the effects of high temperature, which will be quantified in the following analysis. First, the acceptable deviation will be the failure threshold L. The valve will be activated
Reliability and maintainability of safety instrumented system Chapter | 4
63
when a hazard or demand occurs, so the frequency of closing operation could be linked with a demand rate λde. The total degradation process of an actuator includes continuous deterioration and abrupt damages due to random demands. The occurrence times of random demands is t1, t2, ï. with parameter λde. Each demand could accelerate the degradation at some extent immediately, as y1, y2, ï. When the total degradation arrives at the failure threshold L, the valve will fail. The total degradation of one unit, Z (t), is the sum of degradation due to aging process and the instantaneous damages due to random demands. The overall degradation of unit is expressed as: Z(t ) = X (t ) + Y (t )
(2.8)
After the installation of the valves, their reliability and availability should be assessed through periodic proof tests. In order to meet the required SIL, it is necessary to maintain an accurate record of not only the operating time and proof test results but also the previous operation history. Having considered the degradation, it is interesting to consider the time between the proof test. Given that degradation has been found influential on the decision-making for testing strategies, the most constraint is the SIL level to be followed. Normally, the EUC system will shut down for the proof test of SIF. The shut down and re-operation of EUC will cause an economic loss. In order to avoid unnecessary loss, the minimum proof test frequency should be settled. Based on the operational assumption at each test date, we found that the conditional PFDavg is increasing with time under the assumption of functional in proof tests. PFDavg is negatively related with the value of failure thresholds L and positively with demand rate λde. According to the results of sensitivity analysis, we propose to decide proof test intervals based on the testing results. Flexible proof test intervals could be better option rather than fixed. At the early stage of the system, the reliability of SIF is high, and so the proof test interval could be longer based on the SIL acceptable criteria, to reduce operational costs. As the system become older, the period of proof test interval should be shorter to ensure safety.
4.3 Problem formulation solution methodology 4.3.1
Problem formulation
The SISs are widely used to prevent hazardous events. It is last level of defense in a petrochemical, chemical and nuclear industry. The SIS system should be reliable and it is usually obtained by doing the proper hazard and risk assessment, that is, process hazard analysis and SIL analysis. The next step is to develop the safety requirement specification. After getting the SIL analysis and safety requirement specification, each SIF has to be designed and evaluated to achieve the required SIL and also the PFDavg (probability of failure on demand average)
64
Safety and reliability modeling and its applications
value. As per the design proper instrument (sensor, logic solver, and final element) has to be selected to match the required SIL or PFDavg as per the design. Then the designed SIF has to be installed as committed in the design specification, so that required PFDavg is achieved and the system is reliable. After installation, the next step is the commissioning of the loop, wherein the loop test will be performed, after commissioning the validation of the loop will be done. The installation, commissioning, and validation include the factory acceptance test, site acceptance test, and functional proof test. After performing the functional proof test the system is released for operation. During operation the spurious trips should not occur, and if it occurs then it should bring the equipment on which the SIF is installed to safe state. The proof test is performed as defined in the design stage to detect the dangerous failure which can prevent the SIS from performing on demand. The maintenance of the system is done, during which parts are replaced or adjustments made before failure occurs. The objective of maintenance is to increase the reliability of the system over the long term by starving off the aging effect of wear, corrosion, fatigue, and related phenomena. To make a reliable system it is important to get the following details about the system, such as under which modes and conditions it will operate and how the system should respond to system failures and other foreseeable events. While designing the SIS system the important issues which should be taken into consideration are as follows: nature of the demands, the desired states for the various operating modes, such as start-up, normal operation, shutdowns, and foreseeable abnormal events. The following points should be addressed to make the reliable SIS system: 1. Define the safe state, that is, the desired state of the EUC in response to a hazardous event or a SIS failure. 2. Define the position of SIS in the sequence of protection layers. 3. Perform the process risk analysis and obtain the data of hazardous events that may occur if the SIS, and subsequent protection layers, fail to perform on demand. 4. Decide and record the testing strategy for the SIS. 5. Record the potential consequences of spurious activations, on the EUC and on the SIS components. 6. From the process risk assessment find the demand rate and the demand duration. 7. The team should review the components functions and their architecture and voting.
4.3.2
Solution Methodology
We have designed a SIS system with a SIL2 loop and while designing the system we have taken care of all the points which are defined in the problem statement.
Reliability and maintainability of safety instrumented system Chapter | 4
65
While designing we have calculated the required PFDavg by considering the aspects of configuration, proof test interval, spurious trips, and instrument selection for that application. We have used the process risk assessment matrix as mentioned in Fig. 4.8 to calculate the process risk and decide the required SIL.
4.4 Reliability and maintainability 4.4.1
Reliability
Reliability is a measure of successful operation for a specified interval of time. It is defined as the probability that the system will perform its intended function when required to do so if operated within its specified limits for a specified operating time interval. The definition includes five important aspects of reliability: 1. 2. 3. 4. 5.
The system’s intended function must be known. When the system is required to function must be judged. Satisfactory performance must be determined. The specified design limits must be known. An operating time interval is specified.
The calculated reliability requires that a system be successful for an interval of time. While this probability is a valuable estimate for situations in which a system cannot be repaired during a mission, something different is needed for an industrial process control system where repairs can be made, often with the process operating. Mean time to restore (MTTR): MTTR is the “expected value” of the random variable restore time” (or time to repair). The definition includes the time required to detect that a failure has occurred as well as the time required to make a repair once the failure has been detected and identified. Like MTTF, MTTR is an average value. MTTR is the average time required to move from unsuccessful operation to successful operation. Mean time between failures (MTBF): MTBF is defined as the average time period of a failure/repair cycle. It includes time to failure, any time required to detect the failure, and actual repair time. This implies that a component has failed and then has been successfully repaired. For a simple repairable component MTBF = MTTF + MTTR. While designing the reliable system, following tasks should be considered: 1. 2. 3. 4. 5. 6.
Define Realistic system requirements. Define the system usage environment. Identify the potential failure sites and mechanisms. Characterize materials and processes. Design within materials and processes capabilities. Qualify processes.
66 Safety and reliability modeling and its applications
FIG. 4.8
Process risk assessment matrix. Source: (Drawing Taken from Syngenta India Ltd. Safety Manual).
Reliability and maintainability of safety instrumented system Chapter | 4
67
7. Control processes. 7. 8. Manage system life cycle. To reduce the PFDavg, we may increase the component reliability, increase the redundancy level, perform more frequent proof tests, and/or improve the systems protection against CCFs. An SIS may perform one or more SIFs. The international standard IEC 61508 and IEC 61511 give safety life cycle requirements to the SIS, and use safety integrity level (SIL)as a measure of SIS reliability. To comply to a SIL, it is necessary to: 1) Implement various measures to avoid, reveal, and control failures that may be introduced during the SIS safety life cycle, ranging from the initial specification, to design, implementation, operation, maintenance, modifications, and finally decommissioning. 2) Select hardware architecture according to the specified architectural constraints. 3) Demonstrate by calculations that the SIS reliability meets the specified reliability targets. The IEC standards use the probability of failure on demand (PFD) as a measure of SIS reliability (Lundteigen and Rausand 2008). The quantification of the PFDavg considers several parameters: system configuration or architecture (k-out-of-N, in the following restoration times, and CCFs. If the calculated PFDavg is above the target range of a specified SIL requirement, it is necessary to evaluate how the reliability can be improved. The main strategies to enhance reliability are to either improve the inherent reliability (i.e., by introducing more reliable components, add more redundancy, or carry out regular proof testing more often). It has observed that the latter strategy has some possible negative effects. Higher operational costs may follow from more frequent planned maintenance and production stops. The overall risk level may also increase due to more abruption of normal operation. For some equipment it is possible to complete regular proof testing by partial testing, such as for shutdown valves. The PST of valves means to operate the valve just partially, for example, by 20% from the normal position, so that failures related to sticking of valves or delayed operation may be detected. Partial testing may be introduced to improve safety (by complementing existing proof-testing regime with partial testing) or reduce costs (by compensating an extension of proof test intervals with partial testing) (Lundteigen and Rausand 2008). The FTA technique is used to find the undesirable event. The analyst continues by identifying all and combinations of events that result in the identified undesired event. The fault tree is therefore quite useful when modeling failures in a specific failure mode. Fault tree is useful in SIS verification and also for determining spurious trips.
68
Safety and reliability modeling and its applications
In SISs, however, the failure mode is very important. It makes a difference if the system fails and causes a false trip versus a failure that prevents the automatic protection.
4.4.2
Maintainability
Maintainability is defined as the probability that a failed component or system will be restored or repaired to a specified condition within a specified period or time when maintenance is performed in accordance with prescribed procedures. Maintainability has following quantifiable measures: 1. 2. 3. 4. 5. 6.
Mean time to repair Median time to repair Maximum time in which a certain percentage of the failures must be repaired. Mean system down time Mean time to restore Maintenance work hour per operating hours The maintainability design features are mentioned in Fig. 4.9. Maintainability have following design methods:
1. Fault isolation and self-diagnostic: a. Diagnostic is a process of locating the fault at the level in which restoration may be accomplished. b. Diagnosis of failure with identification of fault is major task in repair process, often the longest task and the one having greatest variability in task times. 2. Parts standardization and interchangeability: a. Standardization results in reducing to a minimum the range of parts that must be maintained and stock. b. Interchangeability is a design policy that allows specified parts to be substituted within an assembly for any like part, requires both functional and physical substitutability. 3. Modularization and accessibility: a. Modularization (packaging of components in self-contained functional units) facilitates maintenance b. Design for accessibility is concerned with the configuration of the hardware down to the discard (replacement) level. 4. Repair versus replacement: a. Indenture level at which it is no longer economical to repair the failed unit, instead the failed unit is discarded and replaced with a new one. b. Decision criterion is most often an economical one (Ebeling 2014). Verification, validation and testing may reduce the rate of the occurrence of systematic failures. In the operation phase, the rate of systematic failures may be reduced by verification and validation of, for example, function testing, visual
Reliability and maintainability of safety instrumented system Chapter | 4
69
FIG. 4.9 Maintainability design features. Source: From Book on Reliability and Maintainability by Charles E Ebeling)
inspection procedures, and work processes. Competence and training initiatives are important to reduce human errors during function tests. The environmental condition influences the occurrence of systematic failures if the conditions are outside the design limits. However, in most cases it is not possible to influence the environmental conditions. The contribution from the environment is therefore illustrated by a chance node in Fig. 4.3 (Lundteigen 2008). The component reliability has been improved to such a level that further improvement is difficult and may not be cost effective; more redundancy will lead to higher cost, a more complex system, and often more spurious trips. In some applications, the space is limited, and high redundancy may not be feasible. The positive effect of increased redundancy may also be lost due to CCFs. Increased frequency of proof testing may therefore be the preferred strategy
70
Safety and reliability modeling and its applications
if the SIS reliability has to be improved, especially for existing SISs where modification of the hardware is expensive. Experience has shown that about 50 of the shutdown system failures are due to failure of final elements the shutdown valves. It is therefore important to improve the reliability (availability) of these valves. Partial stroke test is sometimes implemented to fulfill this purpose. The main DU failure modes for a shutdown valves are FTC and leak-in-close position (LCP). A partial stroke test may detect the FTC failures but not LCP failures, whereas a proof test can detect both failure mode. Therefore, an FTC failure is a type a failure, and an LCP failure is a type b failure (Jin and Rausand 2014). Perform the FTA and FMEA of the SIS system to find the failure mode of the system and implement the corrective action. Reliability centered maintenance concept is used to implement the action generated from FTA and FMEA to improve the reliability of the system.
4.5 Case study on reliability and maintainability of SIS 4.5.1
Short manufacturing procedure
A chemical company has developed a batch process to produce a product C. It is manufactured in a glass lined reaction vessel with weight indicator. In this normal reaction starts with the heal of product C approximately 6000 kg. In this heal Reactant A and Reactant B are charged simultaneously with the fixed ratio of flow rates, that is, 0.60–0.61. This addition of Reactant A Reactant B is carried out at a temperature of 37–43°C and under vacuum of -890 to -920 mbarg. During reaction HCl is generated which is scrubbed using NaOH. At the end of reaction, a sample of reaction mass is drawn for analysis of unreacted Reactant A and then mass is maintained at same temperature till analysis report is received. After this at the rate of 2900–3000 kg the reaction mass is overflowed by using fixed overflow nozzle in to next vessel and remaining heal of 6000 kg used for next batch. The quantitative risk assessment is performed and the requirement of independent hard-wired SIL2 alarm and trip with PFD of 0.001 is developed. The SIL2 loop with the required RRF is develop, installed, and commissioned. Fig. 4.10 shows the process flow diagram.
4.5.2
Hazard
Reactor reaction degassing due to overheating as a result of hydrochloride salt build-up hazards. The reaction produces a hydrochloride salt, a solid, which can build up in the reaction mass. If this is heated above 70°C, hydrogen chloride can be explosively release. Overheating may be exacerbated by the fact that the hydrochloride salt can form a crust on the vessel sides, insulating the vessel from cooling.
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.10
4.5.3
71
Process flow diagram.
Hazard consequences and targets
Loss of containment due to vessel rupture would potentially release missiles and toxic gas, both chlorine and hydrogen chloride. The potential for fatality is very high under these conditions and so a target frequency of 1/100,000 years has been set. It is assumed that anyone inside the building at the time will potentially be killed but that others will not enter the building.
4.5.4
Discussion and Recommendation
At any one point in time there are likely to be at least two people present on the plant carrying out tasks. There is the potential for other maintenance people to be present infrequently. Therefore, it is assumed that if there is a loss of containment, someone will be local to the reactor when this occurs. The release of hydrogen chloride and chlorine gas into the area is assumed to be a lethal dose and therefore anyone caught in this is likely to be killed. The company target frequency for a fatality from a single event is 1.0 ×10− 5/yr. This analysis has shown that the likelihood of a fatality from this event is 2.5×10− 7/yr, which is within the target and is therefore acceptable.
72
Safety and reliability modeling and its applications
4.5.5
Description of operation
Reactor is a glass-lined agitated vessel fitted with external limpet coils for heating and cooling. Heating and cooling is from a secondary fluid fed round the system and supported by a pumped heat exchanger system. This system is nitrogen inserted. Under normal conditions a heel of product C is maintained in the vessel and a twin charge of Reactant A and Reactant B are fed to this heel under vacuum within a defined temperature range of 40–42°C. The reaction takes 8–9 h and on completion, after analysis, the batch is transferred out, leaving a heel of product C for the next batch to start. Vacuum is applied to the vessel, which is run at 90 to 110 mbar pressure (Sadiq and Tesfamariam, 2009). This removes hydrogen chloride from the system and prevents build-up of the hydrochloride salt.
4.5.6
Failure scenarios
The gradual build-up of hydrochloride salt has been considered during the process of producing product C. Loss of vacuum will lead to a more rapid formation of hydrochloride salt. This is a solid which, on accumulating within the vessel, may form crust on the vessel wall thus impairing cooling. Above 70°C, degassing of the reaction mass and the conversion of the hydrochloride to releases hydrogen chloride. The rate of gas evolution at this temperature is sufficient to overwhelm the vacuum and over-pressurize the vessel in a short period of time.
4.5.7
Protective arrangements
Plant conditions are monitored by the operator every 30 min. This gives three opportunities to pick up abnormal conditions before an event becomes critical. One critical normal control is the monthly crust removal from the vessel sides. This is performed regularly and is monitored by vessel weight. The reaction is DCS controlled and there are preset software alarms and trips inbuilt in the system. TIC02 high-reactor jacket temperature >80°C shuts the hot water feed to the heating loop and prevents heating of the system. TIC02 low reactor jacket temperature 0.3kg/cm2 is a SIL2 rated and system which will close separate Reactant A and Reactant B feed valves and the heating loop valve.
4.5.8
Assumptions
There is sufficient time to detect an unexpected pressure/weight / temperature rise. The 30 min checks on reactor temperature are performed. Regular maintenance is carried out on all safety critical items identified in this assessment. Provided vacuum is consistently applied, insufficient salt build up will occur to cause rapid, catastrophic gas evolution. It will take at least 30 min to heat from normal reaction temperature to >70°C, the temperature at which the gas evolution starts to occur.
4.5.9
Hazard analysis
Hydrogen chloride is a by-product of the main reaction. If left in solution, this forms the hydrochloride salt which is a solid and can cause a heavy crust to build up on the vessel walls. This in turn impairs vessel cooling. If heated above 70°C, the solid sublimes and releases hydrogen chloride at a rate of 10 L/min/kg which is associated with the reaction mass degassing and breakdown of product C hydrochloride to product C, releasing HCl gas. This will be sufficient to significantly over-pressurize the reactor. The reactor vent is not sized for this duty and no credit is taken for relief in this case. It is expected that there will be at least one hour for this event to develop from loss of vacuum. During this time, the operator will have two opportunities to check conditions and be alerted to the failure, which should be detectable through pressure rise, temperature rise and abnormal weight gain, similarly computer monitoring of the batch should reveal these failures. Provided either the operator or the computer stops the feeds (Reactant A/Reactant B) the temperature rise will be halted quickly, preventing the event being realized. Supporting these is a SIS SIL2 high temperature/highpressure alarm and trip which will stop the feeds.
4.5.10
Notes
• Vacuum lost and not acted upon. • Hardware components are normally reliable and a good maintenance scheme is assumed for this system, allowing a failure frequency of 1/10 years to be set. • Computer control allows feeds despite high temperature. • A standard 0.1 probability for computer continuing to feed is given as this is not SIL rated equipment.
74
Safety and reliability modeling and its applications
• Operator fails to notice loss of vacuum/weight gain/temperature rise during checks. • The operator is making hourly checks and has a set procedure for dealing with a failed component. • This allows a probability of 0.01 to be set that this will be carried out. • Likelihood of someone being present. • It is assumed that someone will definitely be in the building and local to the reactor when the event occurs.
4.5.11
Likelihood of fatality
As well as missiles from a fragmenting vessel, HCl and Cl2 gas will be released at pressure. Whilst chlorine detectors should alert others not to enter, anyone in the area will have a high probability of being killed by the event and a probability of 0.5 is set. SIS high-temp trip fails (SIL 2) The independent hard-wired SIL2 alarm and trip is given a standard 0.001 PFD.
4.5.12
PFD calculation
We have to design a SIL2 loop with risk reduction factor (RRF) higher than 1000. The Safety Requirement Specification (SRS) checklist was prepared as per the requirement of the SIL2 loop. The SIL2 Loop was designed and developed as per the SRS and also selected the Instruments to get the required PPD. Development of SIL2 loop. To design the SIL2 loop we have used Exida software. The PFD calculation is mentioned in Fig. 4.11. The RRF value is 4669, which we have got after doing the calculation. While doing the calculation we have considered the input from the pressure and temperature sensor. The measured value of any one of the two (i.e.,1002) goes above set point, the logic solver will take the action and give the signal to the final control Element. All the final control element should be de-energized in case of abnormal condition (i.e., 3003). Each output is connected to the two valves in series, in this case at least one valve should be de-energized (i.e.,1002). The details of the instruments and hardware use is as mentioned in Table 4.4.
4.5.13
Installation and Commissioning
The drawing of the signal for pressure loop and temperature loop is as mentioned below in Figs. 4.12 and 4.13, respectively, Interface drawing of the PLC SOV with the DCS Valve for temperature and pressure loop is as mentioned in Figs 4.14 and 4.15, respectively In this calculation we have considered two valves in the series on the feed line and on the hot water line. The two valve (V01 and V01A) on the hot-water
The PFD calculation.
75
FIG. 4.11
Reliability and maintainability of safety instrumented system Chapter | 4
76
Safety and reliability modeling and its applications
TABLE 4.4 Details of instruments used. Sr. No.
Instrument
Description
Model/make
SIL certificate
1
PI01
2
TI01
Pressure in reactor
EJX530A
SIL2
Temp. in reactor
3144P
SIL3
3
V01 (Actuator)
Hot water feed
Virgo
SIL3
4
V01 (Valve)
Hot water feed
Virgo
SIL3
5
V02 (Actuator)
Reactant A feed
Virgo
SIL3
6
V02 (Valve)
Reactant A feed
Virgo
SIL3
7
V03 (Actuator)
Reactant B feed
Virgo
SIL3
8
V03 (Valve)
Reactant B
Virgo
SIL3
9
KFD2-STC4-Ex1
Analog I/P barrier
P and F
SIL2
10
KFDO-SD2-Ex1.1045
Digital O/P barrier
P and F
SIL2
11
SOV
Solenoid valve
ASCO
SIL3
12
V01A (Actuator)
Hot water feed
Virgo
SIL3
13
V01A (Valve)
Hot water feed
Virgo
SIL3
14
Logic solver
Prosafe RS
Yokogawa
SIL3
line are directly connected to the safety PLC, but in case of the valves on the feedline of Reactant A and B, we have considered two valves (V02 and SOV02 and V03 and SOV03), that is, only one valve with SOV and for the second valve only the SOV is considered, the SOV will be operated from the safety PLC, if the output from the safety PLC is normal then the SOV will be operated and will have the air at the output, which is connected to the input of the SOV of valve V03A, which is connected to the DCS. Valve V02A and V03A will be operated at the start of every batch, that is, it will be closed and open once at the start of the batch. By operating the valve once every batch we are doing proof test of the valve, which gives us the credit in the calculation. Two batches are completed in a day, but in Exida the minimum test interval is 1 per month, hence while calculating the proof test interval is 1 per month. For process flow diagram with SIL2 loop refer to Fig. 4.16. The signal from the PLC to the valve is connected from two different cards, that is, valve V01 is connected from slot 5 and valve V01A is connected from slot 8, that is, the output to the two valves on same line is from two different cards, by doing this we have reduced the failure probability, even if there is failure of one card, the other card will work, and will perform the required function and bring the process to safe state. The same philosophy is followed for connecting the other valve from safety PLC. The logic in the safety PLC is made and the required set point is given. To change the set-point you require the supervisor password. The set points are password protected. The testing of the SIL2 loop will be done one in a year by doing the simulation, alarm will be display on
Pressure loop signal drawing.
77
FIG. 4.12
Reliability and maintainability of safety instrumented system Chapter | 4
Temperature loop signal drawing.
Safety and reliability modeling and its applications
FIG. 4.13
78
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.14
79
PLC SOV with the DCS valve for reactant A.
the safety PLC monitor and on DCS screen also hooter will be sounded. Press the accept button, it will silence the hooter. Press the reset button, if the process valve is below set point, the alarm will get reset and loop will be ready for the next cycle, if the process value is above set point, on pressing reset button the loop will not get reset. Thus, by doing this we have ensured the safe operation of the loop and will remain in the safe state till it is reset.
4.5.14
Conclusion
We have developed a SIL2 which was required for the safe operation of the process in case of abnormal situation, as per the process risk assessment. The SIL2 loop is installed and commissioned as per the required PFD. By installing the SIL2 loop we have reduced the risk during abnormal situation. Hence, we can operate the process with the minimum risk and by maintaining the over-all safety of the system.
4.6 Fault analysis The FTA is a deductive methodology for determining the potential causes of incident or for system failures more generally, and for estimating the failure
80
Safety and reliability modeling and its applications
FIG. 4.15
PLC SOV with the DCS valve for reactant B safety of the system.
probabilities and reliability. The FTA is centered about determining the causes of an undesired event referred to as the top event, since fault trees are drawn with it at the top of the tree. We then work downwards dissecting the system in increasing detail to determine the root cause or combination of causes of the top event. The traditional static fault tree with AND OR gates cannot capture the dynamic behavior of system failure mechanisms, such as sequence-dependent events, spares and dynamic redundancy management and priorities of failure events. In order to overcome this difficulty, the concept of dynamic FTs is introduced by adding sequential notion to the traditional FT approach. This is done by introducing dynamic gates into FTs. The approach of dynamic FT is applied to the electrical power supply system to chemical plant having SIS and case study on same is carried out to demonstrate the application of DFT approach.
4.6.1
Introduction
Fault tree is used as a qualitative tool to identify fault in complex and SIS. The FTA is very good at pinpointing weakness in a safety system and helps to identify which part of a system are related to a particular failure. The end result of a FTA is a diagram that graphically shows combination of events that can cause system
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.16
81
Process flow diagram with SIL2 loop.
failure in an identified failure mode and helps the analyst to focus on one failure at a time. The FTA is based on the three assumptions: 1) Events are binary events, 2) Events are statistically independent and 3) Relationship between events is represented by means logical Boolean Gates (AND and OR). (Durga Rao et al., 2009: 873). The fault tree can be used to calculate the demand failure probability, unreliability, or unavailability of the system. The evaluation of a fault proceeds in two steps, first a logical expression is constructed for the top event in terms of combination of the basic events, this is referred as qualitative analysis. Secondly, this expression is used to give the probability of the top event in terms of the probabilities of the primary events. This is referred to as quantitative analysis. Thus, by knowing the probabilities of primary events we can calculate the probability of top event (Marquez et al. 2010; Manno et al. 2012).
82
Safety and reliability modeling and its applications
FIG. 4.17
PAND gate. Source: Durga Rao (2009).
The limitation of traditional fault tree is that it cannot capture the dynamic behavior of the system such as the sequence of events in time dependence, the replacement of spare parts, and priorities of failure events. To overcome these limitations dynamic fault tree (DFT) was introduced, with the development of dynamic gates like PAND, SPARE, SEQ, and FDEP the reliability behavior of systems with time dependencies can be modeled. In particular, if a fault tree includes at least one dynamic gate it becomes dynamic fault tree (Durga Rao et al., 2009: 873; Manno et al., 2012: 10334).
4.6.2
Dynamic fault tree
Dynamic fault trees (DFTs) introduce four basic (dynamic) gates, the priority AND (PAND), the sequence enforcing (SEQ), the standby or spare (SPARE), and the functional dependency (FDEP).
4.6.2.1 The Priority AND (PAND) gate The PAND gate as mentioned in Fig. 4.17 will be in failure state if all of its input components have failed before the mission time but in a predefined order (from left to right in graphical notation). Consider PAND gate with two active components, A and B. Active components are the ones which are in working condition during normal operation of the system. Active components can be either in success state or failure state. Based on the PDF of failure of component, time to failure is obtained. The failure is followed by repair whose time depends on the PDF of repair time. Similarly, for the second component also state time diagrams are developed. For generating PAND gate state time diagram, both the components state time profiles are compared. The PAND gate will be in failure state if all of its input components have failed before the mission time but in a fixed order (usually from left to right). In the first scenario as mentioned in Fig. 4.18, Component A has failed and after time component B has failed, component B has recovered but Component A is still in the failure State, hence it is identified as the failure. In the second scenario, as mentioned in Fig. 4.18, Component A has failed and after time Component B has failed, Component A has recovered
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.18
PAND gate state time possibilities. Source: Durga Rao (2009).
FIG. 4.19
SEQ gate. Source: Durga Rao (2009).
83
but component B is still in the failure state, hence it is identified as failure state (Durga Rao et al., 2009: 874; Manno et al., 2012: 10338)
4.6.2.2 The Sequence enforcing (SEQ) A SEQ gate as mentioned in Fig. 4.19, forces its inputs to fail in a fixed order, when a SEQ gate is found in a DFT, it never happens that the failure sequence takes place in different orders. While the SEQ gate allows the events to occur only in a pre-assigned order and states that a different failure sequence can never take place. Consider a three input SEQ gate having repairable components. It is generally used to represent different levels of degradation of a component. The following steps are involved (Durga Rao et al., 2009: 876; Manno et al., 2012: 10340): 1. Component state time profile is generated for first component based upon its failure and repair rate, as mentioned in Fig. 4.20. Down time of first component is mission time for the second component. Similarly, the down time of second component is mission time for the third component. 2. When first component fails, operation of the second component starts. Failure instance of the first component is taken as t = 0 for second component.
84
Safety and reliability modeling and its applications
FIG. 4.20
SEQ gate state time possibilities. Source: Durga Rao (2009).
Time to failure (TTF2) and time to repair/component down time (CD2) is generated for second component. 3. When second component fails, operation of the third component starts. Failure instance of the second component is taken as t =0 for third component. Time to failure (TTF3) and time to repair/component down time (CD3) is generated for third component. 4. The common period in which all the components are down is considered as the down time of the SEQ gate. 5. The process is repeated for all the down states of the first component.
4.6.2.3 Standby or SPARE gate SPARE gates are dynamic gates modeling one or more principal components that can be substituted by one or more backups (spares), with the same functionality as mentioned in Fig. 4.21. The SPARE gate fails when the number of operational powered spares and/or principal components is less than the minimum required. Spares can fail even while they are dormant, but the failure rate of an unpowered spare is lower than the failure rate of the corresponding powered one. (Durga Rao et al., 2009: 874; Gabriele Manno et al., 2012: 10339) Spare gate will have one active component (say A) and remaining spare components (say B). Component state-time diagrams are generated in a sequence starting with the active component followed by spare components in the left to right order. The
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.21
SPARE gate. Source: Durga Rao (2009).
FIG. 4.22
SPARE gate state-time possibilities. Source: Durga Rao (2009).
85
steps are as follows: active components: time to failures and time to repairs based on their respective PDFs are generated alternatively till they reach mission time. Spare components: When there is no demand, it will be in standby state or may be in failed state due to on-shelf failure. It can also be unavailable due to test or maintenance state as per the scheduled activity when there is a demand for it. This makes the component to have multi-states and such stochastic behavior needs to be modeled to represent the practical scenario. Down times due to the scheduled test and maintenance policies are first accommodated in the component state-time diagrams. In certain cases, test override probability has to be taken to account for its availability during testing. As the failures occurred during standby period cannot be revealed till its testing, time from failure till identification has to be taken as down time. The spare becomes active when it replaces a failed active component or a failed active spare. The failure of the gate occurs when the number of surviving components is less than the number of required components which depends on the logic of the gate. Various scenarios with the spare gate as mentioned in Fig. 4.22. The first scenario shows, demand due to failure of the active component is met by the standby component, but it has failed before the recovery of the active component. In the second scenario, demand is met by the stand-by component. But the standby failed twice when it is in dormant mode, but it has no effect on success
86
Safety and reliability modeling and its applications
FIG. 4.23
FDEP gate. Source: Durga Rao (2009).
FIG. 4.24
FDEP gate state-time possibilities. Source: Durga Rao (2009).
of the system. In the third scenario, stand-by component is already in failed mode when the demand came, but it has reduced the overall down time due to its recovery afterwards.
4.6.2.4 Functional dependency (FDEP) gate In the FDEP gate as mentioned in Fig. 4.23, there will be one trigger-input (either a basic event or the output of another gate in the tree) and one or more dependent events. The dependent events are functionally dependent on the trigger event. When the trigger event occurs, the dependent basic events are forced to occur. The FDEP gates output is a dummy output as it is not taken into account during the calculation of the systems failure probability. When the trigger event (T) occurs, it will lead to the occurrence of the dependent event (say A and B) associated with the gate. During the down time of the trigger event, the dependent events will be virtually in failed state though they are functioning. The feature of this gate is to force the input components to reach the failure state if the trigger event has occurred before they fail by themselves. This scenario is depicted in Fig. 4.24. In the second scenario, the individual occurrences of the dependent events are not affecting the trigger event (Durga Rao et al., 2009: 875; Manno et al., 2012: 10341).
Reliability and maintainability of safety instrumented system Chapter | 4
FIG. 4.25
Reliability block diagram of safety instrumented function.
FIG. 4.26
Fault tree of SIS failure.
4.6.3
87
Case study
4.6.3.1 Validation with Example Safety instrumented function as shown in the Reliability Block Diagram (RBD) of Fig. 4.25 consists of pressure sensor, logic solver, that is, PLC and final element (on/off valve). The SIS is the last level of defense. The pressure sensor senses the abnormal condition and gives the signal to the logic solver which as per the logic set in the controller gives the output to the final element to close/open the valve as per the pre-defined condition. To ensure the high reliability of logic solver, redundancy is provided in the controller, so in case of failure of one controller the other controller will take-over and keep the system running and the safety system is available. The fault analysis of the SIS system is performed using the FTA technique, to find the cause of the failure and take the appropriate action to eliminate the same. The fault tree is mentioned in the Fig. 4.26. The above fault tree can be modeled with the dynamic gates to calculate the unavailability of overall SIS.
88
Safety and reliability modeling and its applications
FIG. 4.27
4.6.4
Dynamic fault tree of SIS failure.
Dynamic fault tree of SIS failure
The dynamic fault tree has as shown in Fig. 4.27 has one SPARE gate in which one CPU is in standby mode to the Existing CPU. The first scenario shows, demand due to failure of the active CPU is met by the standby CPU, but it has failed before the recovery of the active CPU so the SIS system is in failure state. In the second scenario, demand is met by the standby component. But the standby failed twice when it was in dormant mode, but it had no effect on success of the system. In the third scenario, standby component is already in failed mode when the demand came, but it has reduced the overall down time due to its recovery afterwards. This indicates that the simultaneous unavailability of CPU1 and CPU2 will lead to the failure of the SIS.
Conclusion The FTA can be used to identify fault in complex and SIS. Fault tree cannot capture the dynamic behavior of the system such as the sequence of events in time dependence, the replacement of spare parts, and priorities of failure event, and due to this limitation dynamic fault tree was introduced. The dynamic fault tree has introduced new gate such as PAND, SEQ, SPARE, and FDEP. The dynamic FTA can be used to solve the complex and SIS fault, it takes into consideration the sequence in which the failure has occurred and also record the time of failure of the particular input connected to these gates.
Reliability and maintainability of safety instrumented system Chapter | 4
89
Bibliography David Marquez, Martin Neil, and Norman Fenton, 95 (2010) 412–425. Durga Rao, K., Gopika, V., Sanyasi Rao, V.V.S., Kushwaha, H.S., Verma, A.K., Srividya, A., 2009. Dynamic fault tree analysis using Monte Carlo simulation in probabilistic safety assessment. Reliab. Eng. Syst. Saf. 94, 872–883. Ebeling, C.E., An Introduction to Reliability and Maintainability Engineering. McGraw Hill Publication, Indian Addition, 2014. Gabriele Manno, Ferdinando Chiacchio, Lucio Compagno, Diego D’Urso, and Natalia Trapani, 39 (2012) 10334-10342. Gruhn, P., Pittman, J., Wiley, S., LeBlanc, T., 1998. Quantifying the impact of partial stroke valve testing of safety instrumented systems. ISA Trans. 37, 87–94. Innal, F., Dutuit, Y., Chebila, M., 2015. Safety and operational integrity evaluation and design optimization of safety instrumented systems. Reliab. Eng. Syst. Saf. 134, 32–50. Jahanian, H., 2015. Generalizing PFD formulas of IEC 61508 for KooN configurations. ISA Trans. 55, 168–174. Jin, H., Rausand, M., 2014. Reliability of safety instrumented systems subject to partial testing and common cause failures. Reliab. Eng. Syst. Saf. 121, 146–151. Langeron, Y., Barros, A., Grall, A., Berenguer, C., 2008. Combination of safety integrity levels (SILs): a study of IEC61508 merging rules. J. Loss Prev. Process Ind. 21, 437–449. Lundteigen, M.A., Rausand, M., 2008. Spurious activation of safety instrumented systems in the oil and gas industry: basic concepts and formulas. Reliab. Eng. Syst. Saf. 93, 1208–1217. Lundteigen, M.A., Rausand, M, 2008. Partial stroke testing of process shutdown valves: how to determine the test coverage. J. Prev. Process Ind. 21, 579–588. Manian, R., Coppit, D.W., Sullivan, K.J., Dugan, J.B., 1999. Bridging the gap between Fault Tree analysis modeling tools and the systems being modeled. Proceedings Annual Reliability and Maintainability Symposium 105–111. Manno, G., Chiacchio, F., Compagno, L., D’Urso, D., Trapani, N., 2012. MatCarloRe: an integrated FT and Monte Carlo Simulink tool for the reliability assessment of the dynamic fault tree. Exp. Syst. Appl. 39, 10334–10342. Prabhudeva, S., Srividya, A., Verma, A.K., Gopika, V., 2006. Comparative studies of various solution techniques for dynamic fault tree analysis of computer based systems. Reliab. Saf. Hazard 255– 261. Sadiq, R., Tesfamariam, S., 2009. Environmental decision-making under uncertainty using intutionistic fuzzy analytic hierarchy process (IF-AHP). Stoch Environ. Res. Risk Assess. 23, 75–91. Zhang, A., Barros, A., Liu, Y., 2019. Performance analysis of redundant safety-instrumented systems subject to degradation and external demands. J. Loss Prev. Process Ind. 62, 103946.
Non-Print Items Abstract Safety instrumented system is the independent layer of protection. Safety instrumented systems have been used for many years to detect hazardous events, and to perform required safety instrumented functions (SIFs) in the process industries to maintain or bring the process back to a safe state. If instrumentation is to be effectively used for SIFs, it is essential that this instrumentation achieves certain minimum standards and performance levels. Safety instrumented systems are used in all process industries. It also requires a process hazard and risk assessment to be carried out to enable the specification for SISs to be derived. Other safety systems are only considered so that their contribution can be taken into account when considering the performance requirements for the SISs. The SIS includes all components and subsystems necessary to carry out the SIF, from sensor(s) to final element(s). To achieve the required function, reliability and maintainability is very important. The aim of this chapter is to design a reliable system and perform regular maintenance to sustain the achieved reliability. To achieve reliability, we have to calculate the average probability of failure on demand and while doing so we take into consideration the PFD value of all the components used in the safety instrumented function (SIF), the allowed spurious trip in a year and also the prooftest interval for testing the individual SIF in a SIS system. The failure of the SIS to achieve the desired function could result in huge consequences for the safety of the monitored system and also for the production availability due to spurious trips. The SIS system in a chemical plant is used to automatically stop the final element (valve or pump) and get the process under control. Fault-tree analysis (FTA) is widely used for identifying the root causes of undesired failures in a system. The traditional static fault trees with AND, OR gates cannot capture the dynamic behavior of system failure, such as sequence-dependent events, spares, and dynamic-redundancy management. In order to overcome this difficulty, the concept of dynamic Fault tree is introduced by adding sequential notion to the traditional FT approach. System failures can then depend on component failure order as well as combination. We have applied the dynamic fault tree concept in this chapter. In the traditional fault tree for the SIS system, we usually assume that exact failure probabilities of events are collected. We mean to say that during design or development stage we do the FTA and during that time we may add new component, which may not have failure date and there may be environment impact on this component during operation. In this chapter, we find the critical components in the SIS system based on FTA and determine the weak paths in the SIS system, where action should be taken to reduce failure. Key words Safety; Instrumented system
Chapter 5
Application of Markovian models in reliability and availability analysis: advanced topics Danilo Colombo a, Danilo T.M.P. Abreu b and Marcelo Ramos Martins b a Petrobras
R&D Center (CENPES), Rio de Janeiro, RJ, Brazil. b Analysis, Evaluation and Risk Management Laboratory (LabRisco), University of São Paulo, São Paulo, SP, Brazil
Abbreviations BI Birnbaum importance BN Bayesian network cdf Cumulative density function CI Criticality importance measure CTMC Continuous-time Markov chain DIM Differential importance measure (DIM) DTMC Discrete-time Markov chain ETA Event tree analysis FMEA Failure mode and effects analysis FT Fault tree FV Fussell–Vesely HAZOP Hazard and operability studies IM Importance measure LDA Life data analysis MA Markov analysis MC Markov chain MPMC Multiphase Markov chain MTTF Mean time to failure OREDA Offshore & onshore reliability data pdf Probability density function PDF Probability of failure on demand PFH Probability of failure per hour PSA Probabilistic safety assessment RAM Reliability, availability and maintainability Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00015-5 Copyright © 2021 Elsevier Inc. All rights reserved.
91
92
Safety and reliability modeling and its applications
RAW Risk achievement worth RBD Reliability block diagram RRW Risk reduction worth SIL Safety integrity level SIS Safety instrumented system
5.1 Introduction The analysis of the reliability and availability of complex systems is an important task for all industries. In the modern scenario, reliability has become one of the most challenging and demanding theories [Ram, 2013]. This analysis allows risk-informed decision-making and the identification of cost-effective solutions to ensure the expected system reliability with the lowest life-cycle cost [Compare et al., 2017]. Based on the results of these analyzes, it is possible to evaluate issues involving safety and economic aspects such as productivity, planning of maintenance activities, inventory control, among others. The interconnections of units in a system are very important for the performance of the system [Kumar et al., 2020]. Complex engineering systems are usually composed of several units or components and each of these components can be found under different conditions or states. One approach to accurately model such systems is to evaluate all possible combinations of the component states to assess the system state, as well as the way in which the system can move from one state to another. From the reliability perspective, the transition of the system from one state to another is governed by failure and repair of the components. Furthermore, the reliability model must be elaborated during the initial phases, at the system design stage, evaluating the possible configurations of the elements (e.g., series, parallel, combinations of both) in order to obtain the best performance expected for the system throughout its life cycle. However, as the system enters the operation stage, new information is acquired, as a result of monitoring, testing, or maintenance activities. The developed model must be able to incorporate these new measures, updating the knowledge about the system state. In the analysis of systems with a large number of components, it is interesting to assess which equipment is most critical or important for the final performance of the system. When identifying such components, it is possible to prioritize their testing and maintenance activities, as well as thinking the course of action when a failure in this component occurs. Besides that, knowing the critical components would influence the decision of which one should receive the resources to be improved. The purpose of this chapter is to present a methodology that includes the aspects mentioned above. The techniques adopted involve the application of Markovian models in reliability, availability, and maintainability (RAM)
Application of Markovian models in reliability and availability Chapter | 5
93
analysis and probabilistic safety assessment (PSA). This chapter also presents some advanced topics referring to Markov models, to deal with uncertainty propagation, importance measures (IMs), and Multiphase MC (MPMC), allowing modeling systems subjected to hidden failures and test policies. Unlike classical techniques for reliability and risk analysis such as fault trees (FT) and reliability block diagrams (RBD), which are essentially static, Markovian analysis is dynamic in its essence, thus allowing capturing important aspects of the temporal and sequential dependence among events. Moreover, while the traditional FT and RBD adopt a Boolean approach, considering only two possible states for the components (up-and-down), Markovian models can naturally include several component states. Therefore, in this latter case, it is easier to analyze intermediary scenarios, including aspects such as degradation and load sharing. The failure of any component of a system can lead to the failure of the system or its degradation [Kumar and Ram, 2015], so an effective maintenance and monitoring in high-risk industrial system plays a very crucial role [Kumar et al., 2020]. Markov models have proven to be good tools for solving this type of problem. For instance, Fleming [Fleming, 2004] uses a Markovian model to evaluate inspection strategies for nuclear power-plant piping systems, coupled with a Bayes’ uncertainty analysis of the reliability parameters (e.g., failure rates). The recent applications of Markovian models in the reliability analysis include, but are not limited to: (a) the performance evaluation of balanced systems [Wang et al., 2020, Zhao et al., 2020]; (b) reliability optimization problem focusing on the reliability allocation [Peiravi et al., 2020]; (c) the development of maintenance policies [Zhang, 2020]; and (d) to support the prognostics of degradation processes [Zhang, 2020]. These recent contributions highlight the current relevance of Markovian models. They are adopted due to several interesting features such as the ability to represent multi-state components, the closed formulas to model the system reliability and availability, the representation of dynamic contexts, and the possibility of modeling multiphase phenomena, which include, for instance, components under testing policies. However, despite the advantages, Markovian models cannot be applied to any system. The basic premise of the Markovian model is the assumption of a memoryless process, that is, the transition from one state to another does not depend on the history of the system. This property requires the transition rates among states to be constant. This premise may be valid, in practice, for several systems operating in their useful life; on the other hand, it fails to capture the effects of wear out and fatigue in some components, or even the occurrence of premature failures. In this case, more comprehensive models, known as semiMarkovian processes, may be required. These processes, despite not being part of this chapter, are mentioned in the final considerations. Related to using Markov process in reliability engineering, a lot of research has been done in order to analyze and improve reliability and availability performance of different systems. Goyal and Ram [Goyal and Ram, 2017] used a
94
Safety and reliability modeling and its applications
FIGURE 5.1 Flowchart of the process of modeling a system reliability and probabilistic safety analysis using Markov chains.
Markovian process to model a wind-electric generating power plant. Kumar and Ram [Kumar and Ram, 2015] calculated various reliability characteristics from marine power plant using the Markov process. Although several references can be found in the literature focusing on the use of MC in system-reliability analysis, none of them offer a holistic approach. The main contribution of this chapter is to consolidate in a single text a framework for the process of modeling a complexengineering system using MC, highlighting how to extract safety and reliability performance indicators through IM and evaluating the effect of the uncertainty propagation about the reliability parameter of the system components. This will give to the readers a toolset to perform reliability and safety assessment using Markov chains. Fig. 5.1 shows this framework and the number of each section addressing the respective step. The process starts by getting the knowledge about the system configuration and the components reliability data. Based on system configuration and operational modes, the state space is created representing all possible scenarios of the system—functional, degraded or failed, and the reliability and maintainability data are used to generate the transition matrix. Nonetheless, it is not always possible to draw a state transitions diagram and sometimes it gets computationally intractable to deal with a large number of states. For instance, Colombo et. al. [Colombo et al., 2020] analyzed an offshore oil well with 25 safety barriers to avoid well leakage associated with 41 failure modes. If modeling is based just on the occurrence or not of each failure mode, 241 possible states should be considered for the whole system, considering that each barrier has
Application of Markovian models in reliability and availability Chapter | 5
95
only two possible individual states—up and down. However, using the proposed strategy, the number of states in the Markov chain was only 531. This chapter also offers some tips and pseudocodes in order to facilitate the modeling of a complex system and allowing the readers to not depend on any specific software. Afterwards, with a combination of state space and transition matrix, the MC is built. The Markov model is used to perform RAM analysis, focused on operational issues of the system, and a PSA focused on safety issues. Aiming at evaluating the influence of the performance, function, and logical position of its components on the performance of the whole system, Quantitative Importance Measures have been introduced in the proposed framework for quantifying the relative importance of the components on the system performance (reliability, maintainability, safety, or any performance metrics of interest). The IMs are not typically included in the Markovian analysis, despite the fact that some authors have already emphasized on the relevance and contribution of those IMs [Compare et al., 2017, Do Van et al., 2010]. However, this chapter offers a novel approach to compute IMs within the Markov chain. A linear expression of the system failure probability is adopted – which is readily computable using the Markov Chains – and several IMs are computed based on the coefficients of this expression. Differently from the traditional approaches to compute IM, which are static, the Markovian approach is dynamic, and it is possible to compute the results for each instance of time. Another fundamental concern when using the Markov chain to model system reliability is that the model parameters, that is, component failure and repair rates, are seldom perfectly known [Dhople et al., 2012]. Dhople and DomínguezGarcía [Dhople et al., 2012] proposed a method involving the use of Taylor series expansions to approximate the entries of the Markov chain stationary distribution vector as polynomial functions of uncertain parameters. The proposed framework recommends the evaluation of the effect of the uncertainty propagation on those parameters and, alternatively, in this chapter, the uncertainty analysis is presented using a Monte Carlo procedure based on the calculation of the inverse cumulative distribution function (cdf−1 ) for the epistemic uncertainty distributions [Zio, 2013]. Therefore, as depicted in Fig. 5.1, the Markovian framework is versatile and allows computing several parameters that support the decision-making process. There are two distinct types of outputs, one related to safety aspects and the other related to RAM (reliability, availability, and maintainability) aspects. The first group is usually applied to safety critical systems, whose failure could result in catastrophic accidents, loss of containment, damage to the environment, or even loss of life. The second group is usually applied in industrial systems to control quality and performance. It is often related to optimization of maintenance staff, inventory control, and to calculate the availability of system to determine the uptime of production. In addition, it is important to plan the tasks of monitoring, inspection, testing, and maintenance of the systems. Although Fig. 5.1 shows
96
Safety and reliability modeling and its applications
both groups in a separate way in order to draw attention to each aspect, in most practical systems they appear together. As examples to illustrate the application of the topics presented, a typical problem in the oil and gas industry will be modeled using MC. Information about reliability performance normally receives significant attention in design, operation, and maintenance of equipment in the oil and gas industry, particularly for safety critical systems [Selvik and Bellamy, 2020]. Many oil and gas installations rely on safety instrumented systems (SISs) to respond to hazardous events and mitigate their consequences to humans, the environment, and material assets [Lundteigen and Rausand, 2009]. These systems are usually complex and subjected to different operating conditions. Besides that, they have elements that are periodically tested, some monitored and others that are not repairable. This chapter is organized as follows. Section 5.2 introduces some basic background about MC theory. In Section 5.3 the application of MC in the complex system RAM analysis is described. Section 5.4 presents some IM definitions and explains how to calculate these IM when using MC. Section 5.5 is focused on the evaluation of uncertainty propagation using enhanced MC. Section 5.6 describes how to use MPMC to incorporate test results and monitor the system evaluation when they are available. Finally, in Section 5.7, a discussion about the limitations and other advanced topics is proposed.
5.2 Markov chains theoretical foundation Stochastic processes are tools used to describe the evolution in time of random phenomena [Baudoin, 2010]. They can be defined as a set of random variables representing the states of features of interest such as the component failures or the number of people in a queue. In the most generic case, the probability of each state changes over time. In discrete-time stochastic processes, the state of the system is observed after fixed-time intervals, for example, every hour, day, or year. Differently, in continuous-time stochastic processes, as the name suggests, the state of the system is continuously observed in the real domain. MCs are specific types of stochastic processes, since they follow the Markovian property. They are named in honor of the Russian mathematician Andrey Markov (1856–1922), who notably contributed to the knowledge about stochastic processes. The Markovian property claims that the phenomenon state in the future can be predicted based on its present state only, thus being independent of the past states. We can also refer to this class of phenomena as memoryless stochastic processes. Depending on the observation approach, these models can be classified as discrete-time Markov chains (DTMC) or continuous-time Markov chains (CTMC). This section presents the theoretical foundation of both types of Markov chains, which constitute the basis of the reliability models discussed in this chapter.
Application of Markovian models in reliability and availability Chapter | 5
5.2.1
97
Discrete-time Markov chains
Let us assume a discrete-time stochastic process over a finite-size state space E = {1, …, N}. Also, let Xn denote the process state observed at the instant n, where n is a non-negative integer. This stochastic process can be conveniently denoted by {Xn ,n ≥ 0}. If it follows the Markovian property, then Eq. (2.1) is applicable for the referred phenomenon, where Pr(•) denotes the probability of an event. Pr (Xn+1 = j|Xn = i, Xn−1 , . . . , X0 ) = Pr (Xn+1 = j|Xn = i)
(2.1)
A stochastic process following the conditions of Eq. (2.1) is a DTMC [Kulkarni, 2011]. It means that the probability of the system under analysis being in a given future state (Xn + 1 = j) depends only on its present state (Xn ) and not on the previous states (Xn − 1 , …, X0 ). Note that the subscript n denotes the present state, while n + 1 denotes the future state and any subscript smaller than n denotes past states. Particularly, X0 denotes the initial state. A typical example of a DTMC is the movement of cars in a car rental company. Suppose that the company has several car rental stations distributed along a city; also, suppose that a client can get the car in any station, as well as return it to any station too. If the car rental company does not reallocate the cars, the probability of a given car being rented at station i and returned to station j does not depend on the previous stations this car passed. It depends essentially on the parameters related to the stations i and j, for example, the distance between them. Furtherly, a DTMC is time homogeneous if the state transition probabilities are the same at all instants n [Kulkarni, 2011]. In this case, Eq. (2.2) is applicable to the stochastic process. Pr (Xn+1 = j|Xn = i) = Pr(Xn = j|Xn−1 = i) = . . . = Pr(X1 = j|X0 = i) (2.2) The values of transition probabilities are commonly arranged in a transition matrix M, of size N × N, containing N2 values. Denoting Pr (Xn+1 = j|Xn = i) by pij , the transition matrix is given by Eq. (2.3). ⎡
p1,1 ⎢ .. M=⎣ . pN,1
··· .. . ···
⎤ p1,N .. ⎥ . ⎦ pN,N
(2.3)
In a time-homogeneous Markov Chain, M is constant over time and, therefore, does not depend on n. Each pij represents a conditional probability, thus 0 ≤ pij ≤ 1 for i, j ∈ E. Additionally, given that the process is in a specific state, on the next step it should remain in the same state or jump to any other state. Therefore, the condition of
98
Safety and reliability modeling and its applications
Eq. (2.16) should be satisfied.
pi j = 1, i ∈ E
(2.4)
j∈E
The probabilities of the process being in a given state at each time can be expressed in the form of a vector, Pn , where the subscript n denotes the instant at which we observe the process. Eq. (2.5) gives the expression for the vector Pn , where pi denotes the probability of the stochastic process being at the state i. Pn = [p1 p2 . . . pN ]1×n , n ≥ 0; N ∈ E
(2.5)
Hence, the probability vector at the instant n + 1 can be computed by multiplying the probability vector of the previous state, n, by the transition matrix. Thus: Pn+1 = Pn M
(2.6)
Generally, the probability vector at the initial instant, that is, P0 , is known. Then, the probability vector at the instant n = 1 can be computed by: P1 = P0 M
(2.7)
Similarly, the probability vector at the instant n = 2 is given by: P2 = P1 M = (P0 M)M = P0 M 2
(2.8)
Note that the probability vector at the instant n = 2 can be rewritten in terms of the initial instant-probability vector, P0 , and the transition matrix. Generically, Eq. (2.9) gives the expression for calculating the probability vector at any instant n, given the initial probabilities vector and the transition matrix. Pn = P0 M n
(2.9)
Suppose, for instance, that an engineer checks the state of a machine at the end of each day. The machine can be in only two states: functioning or fault. Moreover, given that the machine was functioning on the previous day, it has the probabilities of 0.80 of staying in this state and 0.20 of migrating to the fault state—that is, failing. On the other hand, if it was found in a fault state on the previous day, the machine can stay in this state with a probability of 0.10 or be repaired with a probability of 0.90 (thus returning to the functioning state). Fig. 5.2 presents a DTMC model to this problem. Eq. (2.10) gives the transition matrix for the DTMC of Fig. 5.2.
p12 0.80 0.20 p = (2.10) Pi j = 11 p21 p22 0.90 0.10 Assuming that the machine was functioning on the initial day, the initial probabilities vector is P0 = [p1 p2 ] = [1 0]. At the final of the first day, the probabilities vector can be updated and will give new probabilities for each
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.2
state:
99
Example of a DTMC representing a two-state machine.
0.80 P1 = [1 0] 0.90
0.20 = [0.80 0.20] 0.10
At the end of the 10th day, the probabilities of each state are:
10
0.80 0.20 0.82 0.18 = [0.82 0.18] P10 = [1 0] = [1 0] 0.90 0.10 0.82 0.18
(2.11)
(2.12)
Therefore, the probabilities of the system being at the functioning and fault states after the 10th day are, 0.82 and 0.18, respectively. At this point, it is important to note that further multiplying the transition matrix by itself does not lead to significantly different results. We can observe this by computing the probabilities after the 30th day:
30
0.80 0.20 0.82 0.18 P30 = [1 0] = [1 0] = [0.82 0.18] (2.13) 0.90 0.10 0.82 0.18 Or, similarly, after the 100th day:
100 0.80 0.20 0.82 = [1 0] P100 = [1 0] 0.90 0.10 0.82
0.18 = [0.82 0.18] 0.18
(2.14)
Therefore, after a significant number of days, the probability of the system being in each state does not change. Therefore, we can say that this Markov chain reaches a steady state. This property does not mean that the system state is not changing over time, but the probability of finding it in each state after each observation period is constant. In practice, we can find the steady-state probability distribution, denoted here by π , solving the system of linear equations starting from the application of Eq. (2.15). π = [π1 π2 . . . πn ]1×n = π M
(2.15)
However, Eq. (2.15) alone does not allow finding π , since it will lead to one equation linearly dependent of the remaining. Therefore, Eq. (2.16) should also
100
Safety and reliability modeling and its applications
be included, referring to the sum of the state probabilities of π . πi = 1
(2.16)
i
Note that, in principle, π i is independent of the initial states’ probability distribution. However, this is not always true. Some DTMC can fluctuate between two limiting distributions depending on the instant we observe the process (if it is even or odd). The steady-state probability distribution will be unique if the DTMC is irreducible and aperiodic [Kulkarni, 2011]. A DTMC is said to be irreducible if, for every i, j ∈ E, there is a k > 0 such that Eq. (2.17) is true. Pr(Xk = j|X0 = i) > 0
(2.17)
Eq. (2.17) indicates that the DTMC can reach any state after a minimal of k observations, independently of the starting state. Additionally, for an irreducible DTMC, let n be an integer and d the largest integer such that Eq. (2.18) is true for every i ∈ E: Pr(Xn = i|X0 = i) > 0
(2.18)
The value of d determines the minimum number of transitions necessary for the DTMC’s return to the initial state i once the DTMC leaves it. Therefore, the DTMC can return to the initial state only after d, 2d, 3d, … discrete time transitions. If d > 1, the DTMC is said to be periodic. This means that the process can only return to the initial state after times multiple of d. On the other hand, if d = 1, the DTMC is then called aperiodic, meaning that the process can return to the initial state at any time. Another interesting feature to compute using DTMC is the occupancy time of each state, that is, the total time the process spends on a specific state during the time we observe it. Specifically, for a DTMC, it is equivalent to the number of times the state is visited, since the process is subjected to one transition per time unit. Let Nj (n) denote the number of times the process visits the state j after n units of time. We are interested in computing the occupancy time of state j after n units of time, given that the process started at state i, which is denoted here by qi,j (n). As indicated in Eq. (2.19), it is equal to the expected value of Nj (n) given that the process started at state i. qi, j (n) = E N j (n)|X0 = i (2.19) The occupancy times after n units of time can be arranged in the matrix format, which is called the occupancy times matrix, Q(n), and is given by Eq. (2.20).
Q(n) = qi, j N×N (2.20)
Application of Markovian models in reliability and availability Chapter | 5
101
If the DTMC is time-homogeneous, then the occupancy times matrix is given by Eq. (2.21). Kulkarni [Kulkarni, 2011] develops the proof of this relation. It is interesting to observe that the occupancy times depend essentially on the transition probabilities matrix, M, that characterizes the DTMC. Q(n) =
n
Mr
(2.21)
r=0
Back to the example of Fig. 5.2, suppose that the machine is functioning at the beginning of the initial day; after five days, it is possible to compute the expected amount of time for which the machine is functioning using the occupancy times matrix for n = 4, see Eq. (2.22). It is not computed for n = 5 because Eq. (2.21) considers the initial day as represented by r = 0.
4 q1,2 4.256 0.744 q = M r = 1,1 Q(4) = (2.22) q2,1 q2,2 3.347 1.653 r=0
Therefore, from Eq. (2.22), it can be inferred that since the machine was functioning at the initial day, the occupancy of the functioning state is expected to be 4.256 days in the next 5 days, as indicated by the term q1,1 . Complementarily, the system is expected to be in the fault state for 0.744 days during the same 5 days period, as indicated by the term q1,2 . On the other hand, if the machine was observed to be in the fault state on the initial day, the expected time to be functioning and in fault state in the next five days would be equal to 3.347 days (term q2,1 ) and 1.653 days (term q2,2 ), respectively. Note that the sum of the elements in each line of the occupancy times matrix must be equal to n for consistency.
5.2.2
Continuous-time Markov chains
In the previous section, we introduced the DTMC, which are stochastic processes following the Markovian property and are observed at discrete times. This section is focused on the extension of this idea to cover a process that still respects the Markovian property but is observed continuously. These are known as the continuous-time Markov chains (CTMC). Consider a continuous-time stochastic process over a finite-size state space E = {1, …, N}. Where X(t) denotes the process state observed at a specific instant t, t ≥ 0. The stochastic process {X(t), t ≥ 0} is a CTMC if, for all i, j ∈ E and for t, s ≥ 0, the relationships in Eq. (2.23) are applicable [Kulkarni, 2011] given the history of the CTMC up to s, denoted by X(u), with 0 ≤ u ≤ s. Pr (X (s + t ) = j|X (s) = i, X (u)) = Pr(X (s + t ) = j|X (s) = i)
(2.23)
Eq. (2.23) refers to the Markovian property, by stating that the system state in the instant s + t depends only on the present state, X(s), and not on the system state history, X(u). Additionally, similarly to the definition for DTMC, the CTMC
102
Safety and reliability modeling and its applications
is said to be time homogeneous if the probability of the system transiting to state j given that it is on state i is independent of s, as suggested by Eq. (2.24). Pr(X (s + t ) = j|X (s) = i) = Pr (X (t ) = j|X (0) = i)
(2.24)
Note that a time-homogeneous CTMC is not time independent. Each transition probability in a homogeneous CTMC should still be specified for each t. They can be arranged conveniently in a transition-probability matrix, M(t) – see Eq. (2.25). ⎡ ⎤ p1,1 (t ) · · · p1,N (t ) ⎢ .. ⎥ .. M(t ) = ⎣ ... (2.25) . . ⎦ pN,1 (t )
···
pN,N (t )
Where: pi, j (t ) = Pr (X (t ) = j|X (0) = i)
(2.26)
The transition time between any two states of the CTMC follows an exponential distribution, which is the only memoryless continuous probability distribution. In other words, the probability distribution of the transition time depends only on the present state and not on the past states, neither on the time spent at the current state. The exponential probability density function depends on a single parameter, λ. Eqs 2.27 and 2.28 present, respectively, the probability density function and the cumulative probability function of a random variable T exponentially distributed. f (t ) = λ exp (−λt ), t ≥ 0
(2.27)
t f (t )dt = 1 − exp (−λt )
Pr (T ≤ t ) = F (t ) =
(2.28)
0
The parameter λ is constant and refers to the rate at which the analyzed phenomenon occurs. If the random variable is the time that the event of interest occurs, as is the case of CTMC, then the unit of λ is the inverse of the adopted unit of time (e.g., hour–1 , day–1 , year–1 ). It represents the number of occurrences measured per unit of time. A CTMC can be characterized by its transition rates matrix, R, as in Eq. (2.29). Each element ri,j represents the transition rate from state i to state j, that is, the λ parameter of the corresponding time to transition that is exponentially distributed. Notably, the diagonal elements are null in the transition rates matrix. ⎡ ⎤ r1,1 · · · r1,N ⎢ .. ⎥, where r = 0 if i = j .. R = ⎣ ... (2.29) i, j . . ⎦ rN,1
···
rN,N
Application of Markovian models in reliability and availability Chapter | 5
103
By modifying the diagonal elements, from the transition rates matrix, the generator matrix of the Markov chain can be created, G. Eq. (2.30) presents the expression of the generator matrix. The terms of the generator matrix are essentially related to the derivatives of the states’ probabilities. All terms corresponding to the arrival on a given state (i = j) are positive, indicating that these states’ probabilities increase over time. On the other hand, the main diagonal terms (i = j) refer to the sum of all rates leaving the referred state and are negative, thus indicating that the probabilities decrease over time. ⎡
g1,1 ⎢ G = ⎣ ... gN,1
··· .. . ···
⎤ ⎧ g1,N ⎨gi, j = ri, j if i = j .. ⎥, where N . ⎦ ri, j if i = j ⎩gi, j = − j=1 gN,N
(2.30)
Once we know the transition rates matrix, it is possible to obtain the generator matrix and vice versa. Therefore, just one of them is necessary to define the CTMC. The generator matrix, however, allows computing the evolution of CTMC over time. At any instant of time t, the states probability distribution vector, P(t), can be obtained by solving the system of differential equations presented in Eq. (2.31). To solve this equation, it is necessary to know only the states probabilities at the instant t = 0, denoted by P(0). Currently, there are several computational tools dedicated to numerical integration that can support solving Eq. (2.31). dP(t ) = P(t )G dt
(2.31)
By observing Eq. (2.31), the expression of the generator matrix in Eq. (2.30) is clarified. The non-diagonal elements (i = j) indicate the rate at which the probability of the arrival states increases. Therefore, they are positive elements referring to the arrival rates of each state. On the other hand, the diagonal elements (i = j) indicate the rate at which the probability of the departing state decreases and, thus, are negative elements accounting for the sum of all departing rates. In order to illustrate this, Fig. 5.3 presents an example of a machine with three possible states: functioning, degraded, and fault. The transition rates among these states are known and given in units of day-1 (i.e., the expected number of occurrences per day). Suppose we want to compute the states’ probability distribution for a time span of 10 days, knowing that the system starts from the functioning state at the day 0. Initially, the generator matrix, G, should be defined, as follows: ⎡
−0.15 G=⎣ 0 1.0
0.1 −0.3 0
⎤ 0.05 0.3 ⎦ day−1 −1.0
(2.32)
104
Safety and reliability modeling and its applications
FIGURE 5.3
Example of a CTMC representing a two-state machine.
FIGURE 5.4
State probabilities over time for the three-state machine CTMC of Fig. 5.3.
Additionally, since the system starts from the functioning state, the initial probabilities vector is given by: P(0) = [1 0 0]
(2.33)
Fig. 5.4 presents the results of the state probabilities over time, obtained by solving numerically Eq. (2.31) for the time span of 10 days. At the end of the
Application of Markovian models in reliability and availability Chapter | 5
105
10th day, the probabilities of the system being in the functioning, degraded, and fault states are, respectively, 0.6780, 0.2220, and 0.1000. After sufficient time, the system enters a steady state, in which the probabilities of each state tend asymptotically to a constant value. As in the DTMC, in CTMC also it is interesting to know the occupancy time of each state. The occupancy times are obtainable from an occupancy times matrix, as presented in Eq. (2.20) for the DTMC. However, for CTMC, the process of obtaining the occupancy times matrix, Q(t), is not so simple as the DTMC case. In this section, the authors present a methodology of obtaining Q(t) from the uniformization of the CTMC, as proposed
by [Kulkarni, 2011]. Initially, define a matrix Pˆ = pˆ i, j , such that each pˆ i, j is defined as in Eq. (2.34): 1 − rri , if i = j pˆ i, j = (2.34) ri, j , if i = j Where: ri = −
N
ri, j
(2.35)
j=1
r > max {ri }, 1 ≤ i ≤ N Note that the term ri is equivalent to the diagonal element of the generator matrix. Then, the transition probabilities matrix, M(t), of the CTMC is given by Eq. (2.36). Kulkarni [Kulkarni, 2011] presents the detailed deduction of this expression. The idea behind M(t) for the continuous-time case involves an analogy with the discrete-time case and the adoption of the Poisson distribution. The matrix Pˆ is a one-step transition matrix of a DTMC. Additionally, the observation process is a Poisson process with rate r. The matrix M(t) then is the result of the sum over all possible number of transitions (k = 0, 1, 2, …) of the corresponding process. M(t ) =
∞
exp (−rt )
k=0
(rt )k ˆ k P k!
(2.36)
Furthermore, we can compute the occupancy times of the CTMC by the corresponding occupancy times matrix, given by Eq. (2.37), where t is the total time for which the process is observed. ∞
Q(t ) =
1 Pr (rt > k)Pˆ k r k=0
(2.37)
Where each term of Q(t) is given by Eq. (2.38): t qi, j (t ) =
pi, j (τ )dτ 0
(2.38)
106
Safety and reliability modeling and its applications
Particularly, as indicated in Eq. (2.39), Pr (rt > k) indicates the probability that the number of transitions of the observation process is larger than k. Pr (rt > k) = 1 −
k
exp (−rt )
l=0
(rt )l l!
(2.39)
Proving the relations of Eqs (2.36) and (2.37) is out of the scope of this chapter, but [Kulkarni, 2011] demonstrates how to obtain these relations. In practice, the term qi,j (t) gives the expected occupancy time at state j given that the observed stochastic process started at state i and is observed for a time span of t. Kulkarni [Kulkarni, 2011] also gives an algorithm to compute Q(t) efficiently to a precision , which includes the following steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Define R, t and 0 < < 1. Compute r, as defined after Eq. (2.34). ˆ as defined in Eq. (2.34). Compute P, ˆ Set A = P and k = 0. Set yek = exp ( − rt), ygk = 1 − yek, and sum = ygk. Set B = ygk • I, where I is the identity matrix with size equal to R. < t − , do: While sum r k=k+1 yek = yek • (rt)/k ygk = ygk − yek B = B + ygk • A A = APˆ sum = sum + ygk B/r approximates Q(t) with a precision , using the first k + 1 terms of Eq. (2.37).
Back to the example of Fig. 5.3, suppose we want to know the occupancy time of each state after observing the system for 100 days. By applying Eq. (2.37) and computing it using the algorithm described above with = 0.001, it is obtained: ⎡
68.1 Q(t = 100 days) = ⎣65.2 67.5
22.0 24.3 21.7
⎤ 9.9 10.5⎦ days 10.8
(2.40)
Since the system starts from the functioning state, we should look at the first line of the obtained occupancy times matrix. The expected occupancy time in the functioning, degraded, and fault states are, 68.1 days, 22.0 days, and 9.9 days, respectively. The number of transitions from a state i to a state j follows a Poisson distribution with expected value ri,j Ti , where Ti is the expected occupancy time of state i. Therefore, once we know the occupancy times, it is also possible to compute the expected number of transitions between any two states. In this case,
Application of Markovian models in reliability and availability Chapter | 5
107
it is convenient to define a matrix containing the expected number of transitions between each pair of states after a time span of t. This matrix is denoted here by ν(t) and is defined by Eq. (2.41).
(2.41) ν(t ) = νi, j Where: νi, j = ri, j Ti
(2.42)
For the example of the three-state machine presented above, the expected number of transitions matrix is given by: ⎡ ⎤ 0 6.8 3.4 0 6.6⎦ ν(t ) = ⎣ 0 (2.43) 9.9 0 0 In this case, ν(t) was computed assuming that the system started at state 1. Thus, the values of Ti for i = 1, 2, 3 are those presented in the first row of (t = 100 days), see Eq. (2.40). These values can help understand the expected system behavior over time. For instance, the term ν 3,1 indicates the expected number of transitions from state 3 to state 1, accounting for the repair activities. The value ν 3,1 = 9.9 indicates that 9.9 repairs are expected to occur. If the repair involves, for instance, the replacement of component parts, it would be accurate to keep in stock 10 spare parts for a period of 100 days. The next section focuses on the failure and repair of engineering system, giving an overview from the reliability engineering point of view.
5.3 Application of Markov chains to the reliability and availability analysis of engineering systems Reliability engineering is a vast field. In a few words, it is the area of knowledge concerned with the engineering systems’ ability to perform the functions they were designed for. It contemplates, for instance, the analysis of structures’ failures under stress, subjected to shocks. and even those vulnerable to human errors. This section focuses on a specific category of engineering systems: the failure mechanisms of which depend essentially on time. The Markovian models are especially good to represent such item behavior over time, but only if specific conditions are met, as will be discussed below. Firstly, we introduce basic reliability engineering concepts, in order to formalize them. Then, several Markovian models are presented to illustrate their applications.
5.3.1
Basics of reliability engineering
The reliability of a system is defined as the probability that the system performs its function properly during a predefined period of time under the condition
108
Safety and reliability modeling and its applications
that the system behavior is fully characterized in the context of probability measures [Ram, 2013; Ram and Manglik, 2014]. Beginning with the basic concepts, suppose that the time of failure of an item is represented by the random variable T when it is operated under specific operational conditions c1 ,c2 ,…, cn . These conditions are context dependent and refer to factors influencing the item performance over time, such as the operational environment vibrations, pH, temperature, pressure, and so on. The random variable T can be associated with a probability-density function (pdf), which is called the “failure-probability distribution”, f(t|c1 ,c2 ,…, cn ). Assuming that the relevant operational conditions are constant, we can denote the pdf by f(t), for simplicity. It is possible to compute the item’s failure probability given a mission time, t, by the corresponding cumulative distribution function (cdf), as depicted in Eq. (3.1). t F (t ) = Pr (T ≤ t ) =
f (τ )dτ
(3.1)
0
The item’s reliability for the same mission time is the unitary complement of the corresponding failure probability, as depicted in Eq. (3.2). The reliability of an item is the probability of the successful achievement of an item’s intended function [Modarres et al., 2009]. R(t ) = 1 − F (t )
(3.2)
Additionally, we can compute the mean time to failure (MTTF), which is the expected time until the item fails. It corresponds to expected value of the pdf, as indicated in Eq. (3.3). ∞ MTTF =
t f (t )dt
(3.3)
0
Based on the failure probability distribution, it is possible to define the failure rate or hazard rate function, λ(t), which is the conditional probability that the item fails in a small-time interval, given that it was working from time zero to the beginning of the time interval [Modarres et al., 2009]. λ(t ) = lim
τ →0
1 F (t + τ ) − F (t ) f (t ) = τ R(t ) R(t )
(3.4)
The hazard rate of an item is often represented by the bathtub curve, illustrated in Fig. 5.5, which got it name due to its aspect. The bathtub curve contains three distinct regions, representing the failure rate tendency along the item’s lifecycle. At the beginning, the failure rate is decreasing, representing the early life failures mainly due to construction and project failures; it is a burn-in period. After this, the item enters the service lifetime, in which the failure rate is constant and random failures occur due to complex physical and chemical mechanisms
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.5
109
The bathtub curve. Source: Colombo et al. (2020).
affecting the item. Finally, the third region refers to the late-life failures, with an increasing failure rate, corresponding to the wear-out of the item. It is important to state that the bathtub curve is representative of a population of similar items and not of individuals. For instance, a component that fails during its early life does not enter the following lifecycle phases, therefore not being subjected to the remaining types of failures. Specifically, for the items in their service lifetime, it is possible to model them with an exponential failure probability distribution, which presupposes a constant failure rate. Eqs. (3.5) and (3.6) present, respectively, the pdf and the cdf for this distribution, assuming a failure rate λ as the parameter. f (t ) = λ exp (−λt )
(3.5)
F (t ) = 1 − R(t ) = 1 − exp (−λt )
(3.6)
As presented in section 5.2, the exponential probability distribution has the special attribute of being the only memoryless continuous probability distribution. Therefore, the CTMC are adequate to model engineering systems during their service lifetime. The remaining of part of this section illustrates the application of several types of Markovian models in the reliability engineering area. It begins with the classical series and parallel configurations and then presents other interesting applications for other types of configurations, including maintainable systems.
5.3.2
Series and parallel configurations
A system with its components in a series or parallel sequence is the simplest system configurations. The former represents systems with components forming a kind of functional chain, in which a single component failure leads to the system failure; the latter represents redundant components, where the system fails only if all components fail. The ideas behind these configurations are indeed simple, but representative for several engineering systems.
110
Safety and reliability modeling and its applications
FIGURE 5.6
Markovian model of a system with series configuration.
Beginning by the series configuration, Fig. 5.6 presents a series system with two components, A and B, and the corresponding Markovian model. It is assumed that the components have only two possible states: up and down. The failure rates for the components A and B are, respectively, λA and λB . Considering every combination of components’ states, there are four possible system states: A: up, B: up → system: up; A: down, B: up → system: down; A: up, B: down → system: down; A: down, B: down → system: down; The generator matrix for this system is given by Eq. (3.7). ⎤ ⎡ λA λB 0 −(λA + λB ) ⎢ 0 −λB 0 λB ⎥ ⎥ (3.7) G=⎢ ⎣ 0 0 −λA λA ⎦ 0 0 0 0 Once the generator matrix is determined, it is possible to simulate the system behavior over time. Fig. 5.7 presents the results for each CTMC state probability over time. The parameters adopted are: λA = 1 × 10−3 h−1 ; λB = 4 × 10−3 h−1 ; and a mission time of 1000 h. It is possible to observe that the probability of the CTMC being in state 1 decreases as the probabilities of states 2 and 3 increase until a maximum, as these are transient states. The probability of state 4 is strictly increasing, since it is the absorbing state1 of the CTMC. It means that 1 Once
the system enters in an absorbing state, it cannot exit.
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.7
111
State probability over time for the series configuration.
once the system enters state 4, it cannot transit to other states. In practice, this means that the system reaches an unrepairable failure state. As will be shown in Section 5.3.5, the consideration of maintenance features can transform such states into non-absorbing states. Fig. 5.8 presents the probability over time regarding the engineering system state.2 Among the four CTMC states depicted in Fig. 5.6, only state 1 corresponds to the system up state. The remaining states—2, 3, and 4— correspond to the system down state, since they include at least one component in failure state. Therefore, the probability of the system being down in Fig. 5.8 is equal to the sum of the probabilities of states 2, 3, and 4. Note that the probability of the system up state is strictly decreasing, while the probability of the system down state is strictly increasing, because we are not yet considering the maintenance events. The parallel configuration, on the other hand, refers to systems in which the components are redundant. The failure of a single component does not lead to the system failure. Fig. 5.9 presents a system of two components, A and B, similar to that of Fig. 5.6, but assuming redundancy. The equivalent Markovian model is presented below. In this case, the system transits to the down state if and only 2 Note that the engineering system can assume only two states: up and down. These are different from
the corresponding CTMC states, which represent the possible combinations of each component state.
112
Safety and reliability modeling and its applications
FIGURE 5.8
System state (up/down) probability over time for the series configuration.
if both components are in their down state. The corresponding generator matrix is the same as presented in Eq. (3.7). We can model the corresponding CTMC using the same parameters as in the previous simulation. Naturally the CTMC states probabilities over time are identical to those presented in Fig. 5.7. The main difference in this case is the system up/down states probabilities. Fig. 5.10 presents the tendencies over time. In the parallel configuration case, only state 4 represents the system-down state, while the remaining states—1, 2, and 3—represent the system-up state. When we compare the results of Figs 5.8 and 5.10 (i.e., the series configuration and the parallel configuration with identical failure rates), as expected, the down-state probability over time of the parallel configuration is significantly lower than the up-state probability. Summarizing, the series and parallel configurations are two of the most basic systems that can be modeled by Markovian models. Despite the apparent simplicity, they are very useful in several cases, since these configurations are widely found in engineering systems. This section detailed systems composed of only two components, but analogous models can be developed for systems including any number of components, including the combination of both configurations.
5.3.3
Standby systems
Another recurrent configuration refers to standby systems. In these systems, one or more components are initially designated as active and perform the
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.9
113
Markovian model of a system with parallel configuration.
desired function. Additionally, one or more similar (but not necessarily identical) components remain in an idle state, being ready to assume the active components function once they fail. It is a kind of redundant system in which the redundant component does not need to be continuously active. The main advantage of this type of configuration refers to the fact that generally the failure rate of idle components is lower than their failure rates when active. Therefore, the failure probability while offline is significantly lower, thus improving the system’s overall reliability. Markovian models are adequate to model this kind of behavior. Fig. 5.11 presents a Markovian model for a typical standby system. In this example, the component A is active and component B is in standby. When component A fails, component B is set online through the activation of a switch. As in the examples of Section 5.3.2, the failure rates of components A and B are, λA and λB , respectively. In this case, it is important to note that these rates are applicable only if the components are active. Additionally, while in standby, the failure rate of component B is given by λSB , which is lower than λB . We can also assume that the switch has an on-demand failure probability of p. Therefore, the system failure occurs in one of the following scenarios: Component A fails while active, the switch is activated successfully and component B fails while active. Component A fails while active and component B fails while in standby. Component A fails while active, the switch is not activated successfully.
114
Safety and reliability modeling and its applications
FIGURE 5.10
System state (up/down) probability over time for the parallel configuration.
The generator matrix for this example is given by Eq. (3.8). ⎡ ⎤ λSB λA p λA (1 − p) − λA (1 − p) + λSB + λA p ⎢ 0 λB ⎥ 0 −λB ⎥ G=⎢ ⎣ 0 0 −λA λA ⎦ 0 0 0 0
(3.8)
For simplicity, it is assumed that the switch failure leads to the down state of B. Note that once the system leaves state 1, it transits to state 2 with a probability of (1 − p) and to state 4 with a probability p. Some simplifications regarding the presented example are possible [Ebeling, 1997], including: (a) identical components (make λB = λA ); and (b) no switching failure (make p = 0). In qualitative terms, the shape of the probability curves would be similar to the ordinary parallel system. However, as the probability of switch failure increases, the system naturally tends to fail more often.
5.3.4
Load-sharing systems
Components in parallel configuration can often share stress loads or deliver a fraction of the system demand for their functions. For instance, if two centrifugal water pumps are working simultaneously, it is possible that each one delivers 50% of the demanded water flow. These are called load-sharing systems. It is reasonable to suppose that the components’ failure rates depend on the regime on which the component is operating. Higher loads can induce larger stresses, thus significantly reducing the component parts reliability. There are
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.11
115
Markovian model for a typical standby system.
even cases in which the opposite occurs: the component operation is optimized for elevated loads and operating it with low loads may increase the component degradation. This former scenario is the case of maritime engines, for instance, if no preventive measures are taken [MAN Diesel & Turbo, 2011]. Markovian models are also adequate to model the reliability of this type of configuration. Fig. 5.12 presents the Markovian model for a typical load-sharing system. The system is composed of two components, A and B, operating in parallel. Each component can deliver 100% of the system demand (high load mode) alone, but when both are online, they deliver only 50% each (low-load mode). When operation in the low-demand mode, the failure rates for the components − A and B are λ− A and λB , respectively. When in the high demand mode, they are + subjected to different failure rates, λ+ A and λB . Additionally, the system fails only if both components are down. The corresponding generator matrix is given by Eq. (3.9). ⎡ − ⎤ λ− 0 λ− − λA + λ− B B A ⎢ ⎥ 0 λ+ 0 −λ+ B ⎥ B G=⎢ (3.9) + + ⎦ ⎣ 0 0 −λA λA 0 0 0 0 Fig. 5.13 presents the system behavior over time for a mission time of 1000 −3 −3 h−1 ; λ− h; h and the following failure rates: λ− A = 3 × 10 B = 5 × 10
116
Safety and reliability modeling and its applications
FIGURE 5.12
Markovian model for a typical load-sharing system.
FIGURE 5.13
State probability over time for the load-sharing system.
Application of Markovian models in reliability and availability Chapter | 5
117
−2 −1 −2 −1 λ+ h ; and λ+ h . In this case, it is notable how the A = 3 × 10 B = 5 × 10 system tends to quickly fail right after one component fail, since the failure rate of the remaining component is dramatically increased when subjected to high loads. This is observable in Fig. 5.13 by the low probabilities associated with states 2 and 3, which represent the high-load mode of operation.
5.3.5
Repairable systems
When analyzing the reliability of engineering systems, we may also want to evaluate the system behavior if it is subjected to repairs. In this case, it is necessary to consider the maintainability concept. It refers essentially to the ability of repairing a system within a given period of time. Let T denote the random variable representing the time to repair of an item. Then, T follows a probability distribution, which is called the item’s repair probability distribution. We may denote its pdf by m(t|c1 ,c2 ,…cn ),where c1 ,c2 ,…, cn refer to the conditions of repair (e.g., availability of spare parts, tools, and trained workers). For convenience, from now on, the pdf will be denoted simply by m(t), assuming that the conditions of repair are known and do not change during the system mission time. Additionally, the cdf is given by Eq. (3.10). t M(t ) = Pr (T ≤ t ) =
m(τ )dτ
(3.10)
0
If the repair time follows an exponential probability distribution, it is possible define the repair pdf/cdf based in a single parameter, the repair rate, μ. The repair rate is constant and given in number of repairs per unit of time. In this case: m(t ) = μ exp (−μt )
(3.11)
t
M(t ) = Pr (T ≤ t ) = ∫ m(τ )dτ = 1 − exp (−μt )
(3.12)
0
When considering repairs, the reliability concept is replaced by the availability concept. The reliability tends to zero as the mission time tends to the infinity, but the availability can tend to a finite value between 0 and 1, since it is possible to recover from a failure. Formally, the availability is defined as in Eq. (3.13) [Ebeling, 1997], that is, it is a fraction of the mission time in which the system remains in the up state. Availability =
Uptime Uptime = Uptime + Downtime Mission time
(3.13)
If both, the failure and repair times are exponentially distributed, then we can model the system behavior using Markovian models. By adopting the formulas
118
Safety and reliability modeling and its applications
FIGURE 5.14
Markovian model for a repairable system.
presented in Section 5.2, it is possible to compute not only the states’ probabilities over time, but also the system availability (using the occupancy times). Finally, it is possible to compute the expected number of repairs through the expected number of transitions from one state to another, therefore supporting the decision making towards the number of spare parts necessary for a given mission time. As in the previous sections, it is interesting we present the modeling of repairable systems using a didactic example. Fig. 5.14 illustrates a system with two repairable components, A and B, in series and the corresponding Markovian model. Their failure rates are λA and λB , respectively. Additionally, the repair rates are μA and μB , respectively. Eq. (3.14) gives the generator matrix for the example of Fig. 5.14: ⎤ ⎡ λB λA 0 −(λA + λB ) ⎥ ⎢ μA −(λB + μA ) 0 λB ⎥ G=⎢ ⎦ ⎣ λA μB 0 −(λA + μB ) μA −(μA + μB ) 0 μB (3.14) Figs. 5.15 and 5.16 present, respectively, the probability of each state of the Markovian model over time and the system states probabilities over time. For these simulations, the following parameters were adopted: λA = 3 × 10−3 h−1 ; λB = 5 × 10−3 h−1 ; μA = 1 × 10−2 h−1 ; and μB = 2 × 10−2 h−1 . Note that state 1 is the unique up state while the remaining are down states. It is notable how these probabilities approach a constant value as the time increases. In other
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.15
State probability over time for a repairable system.
FIGURE 5.16
System state (up/down) probability over time for the parallel configuration.
119
120
Safety and reliability modeling and its applications
words, the CTMC approaches a steady state. It indicates that the failure and repair events reach a kind of equilibrium after enough time. Since the system starts from state 1, it is possible to compute the occupancy time of each state. Remember that Section 5.2 presents an algorithm to compute the occupancy times. Let denote the occupancy time of the ith state by OTi . We have that: OT1 = 639.95 h (system up time); OT2 = 171.05 h; OT3 = 150.03 h ; OT4 = 41.97 h. Therefore, the system availability can be computed using Eq. (3.13): Availability =
639.95 h OT1 = = 63.995% Mission time 1000 h
(3.15)
It is also possible to compute the number of times each component is repaired. Component A is repaired whenever a transition occurs from state 2 to state 1 or from state 4 to state 3. On its turn, component B is repaired whenever a transition occurs from state 3 to state 1 or from state 4 to state 2. Denoting by NRA and NRB the expected number of repairs for components A and B, respectively, we have that: NRA = μA OT2 + μA OT4 = 2.13 transitions; NRB = μB OT3 + μB OT4 = 3.84 transitions.
5.3.6
State-space reduction for reliability analysis
The examples of Markovian models presented in this section so far are quite simple. They consider only two components, each with only two possible states: up and down. This leads to a state space of 22 = 4 states. Since the state space grows exponentially with the number of components, the inclusion of more components tends to significantly increase the state-space size. For instance, a system of 10 components with two states each would have a space state of 210 = 1024 states. If we consider an additional state for each component (e.g., up, down, and degraded), this will lead to a state space composed of 310 = 59049 states. Therefore, the Markovian models can quickly become computationally intractable. In order to avoid this problem, it is possible to apply state-space reduction techniques. They can significantly reduce the state-space size and are applicable particularly when the system or part of it is composed of components in series configuration. Consider, for instance, the system of Fig. 5.6, composed of two components in series. The system failure probability, FS (t), can be computed analytically by applying Eq. (3.16), where Pr (A) and Pr (B) denote the probabilities of failure
Application of Markovian models in reliability and availability Chapter | 5
121
of components A and B, respectivelly. FS (t ) = Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
(3.16)
Assuming that the events are independent, that the failure rates are constant, and developing Eq. (3.16) leads to: FS (t ) = Pr (A) + Pr (B) − Pr (A) Pr (B) 1 − exp (−λBt ) FS (t ) = 1 − exp (−λAt ) + − 1 − exp (−λAt ) 1 − exp (−λBt )
(3.17)
Simplifying the expression of Eq. (3.17) leads to: FS (t ) = 1 − exp [−(λA + λB )t]
(3.18)
The expression of Eq. (3.18) is equivalent to the failure probability of a component whose failure rate is equal to λA + λB . Therefore, it is possible to approach the same problem by considering only two states with a transition rate of λA + λB from the up state to the down state. In this case, we have reduced the state space from four to two states. In order to further present the benefits of state-space reduction, the remaining part of this section is dedicated to a practical example. The use of valves assembly is common in oil and gas production to control, direct, and regulate the flow. Depending on the phase of the well life cycle, the assembly can be different. On the production phase, in onshore wells, it is called Christmas Tree in resemblance with the decorated tree traditionally exhibited during Christmas celebration, as shown in Fig. 5.17. During maintenance or tests of the well, sometimes it is necessary to install a similar assembly, called surface-test tree. Independent of the name or the phase of the well, there are basically redundant valves that allow access to the well both inside the production column or the annular space. Fig. 5.18 presents a simplified diagram of the Christmas Tree. This diagram shows the physical arrangement of the valves. Although the equipment has several functions, let us consider only the safety function of isolating the well from the environment, which avoids external leakage (EL). According to the diagram, the components have the following functions: • Valve V1 isolates the well from the region “A” of the Christmas Tree; • Valve V2 isolates the region “A” of the Christmas Tree from the external environment; • Valve V3 isolates the region “A” of the Christmas Tree from the external environment; • Valve V4 isolates the region “A” of the Christmas Tree from the region “B” of the Christmas Tree; • Valve V5 isolates the region “B” of the Christmas Tree from the external environment;
122
Safety and reliability modeling and its applications
FIGURE 5.17
Christmas tree examples.
FIGURE 5.18
Christmas tree assembly diagram.
Application of Markovian models in reliability and availability Chapter | 5
123
TABLE 5.1 Valves communicating each space in the Christmas tree example. To → From ↓
Well
Well
Region A
Region B
External environment
V4
V2, V3
V1
Region A Region B
V5, V6, V7
External environment
FIGURE 5.19
Part of the reduced Markovian model, focusing on the transition between two states.
• Valve V6 isolates the region “B” of the Christmas Tree from the external environment; • Valve V7 isolates the region “B” of the Christmas Tree from the external environment. If we attempt to model this problem by the conventional Markovian approach, considering only two states for each valve (up/down), this would lead to a state space of size 27 = 128. However, we can significantly reduce the state-space size if we take into account the four regions that the valves isolate from each other: (a) the well; (b) region “A” of the Christmas Tree; (c) region “B” of the Christmas Tree; and (d) the external environment. Instead of considering the component states, we can consider the state of each region as isolated or not isolated. This would reduce the state-space size to 24 = 16 states. Table 5.1 presents the valves that isolate one region from another. The transition rates for the reduced model are equivalent to the sum of the failure rates from each valve that may cause the pressure communication. Fig. 5.19 presents a part of the reduced Markovian model, focusing on a specific transition. One state considers that region A is not isolated, but the
124
Safety and reliability modeling and its applications
FIGURE 5.20
Reliability of the Christmas tree.
environment is. The other state considers that the environment is not isolated (i.e., there is a leakage of hydrocarbons to the external environment). In this case, there are two valves that can cause this leakage: V2 and V3. Therefore, the transition rate is equal to the sum of both valves’ transition rates. Fig. 5.20 presents the reliability of the Christmas tree for a mission time of one year (8760 h). For each valve, a failure rate of 1.00E-05 h-1 was assumed. By including the repair rate of each valve, assumed as 1.00E-02 h-1 , it is also possible to compute unavailability of the Christmas tree, as presented in Fig. 5.21. Both assumed rates can be considered as reference values in the offshore field.
5.3.7
A practical application of state-space reduction
The state-space reduction can make it feasible to model complex engineering systems that would be intractable otherwise. One practical application, which is closely related to the example in Section 5.3.6 is the Markovian model to compute the reliability of a subsea oil well, presented by Colombo et al. [Colombo et al., 2020]. As presented in Fig. 5.22, an oil well is composed of several components, the function of which is to: (a) deliver the hydrocarbons from the reservoir to the production lines (PLs); and (b) prevent uncontrolled hydrocarbon leakages from the reservoir to the environment. This latter function is referred to as “well integrity”.
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.21
125
Unavailability of the Christmas tree.
The oil well modeled contains dozens of components, each with its own failure modes. The total number of failure modes is equal to 41. Considering a binary approach for each failure mode—occurred or not occurred— the state space would include 241 states, that is, more than 1 trillion states. Obviously, such a model is not computationally feasible. The alternative was to adopt an approach based on the oil-well cavities. Each cavity is a region of the well delimited by oil-tight components. The cavities can allow pressure communication between each other through the component failure modes. Before detailing the cavities, it is important to the reader who is not familiar with the oil and gas industry to understand how a subsea production well works. The oil flows from the reservoir to a subsea manifold or the floating production unit from the reservoir through the production tubing (PT), the Christmas tree (for short, Xmas tree, or XT), and the PL. The well is constructed using concentric casings, fixed with cementing. The casings are deployed on the subsea wellhead system (SWHS), which is fixed on the seabed. The last casing string is the production casing (PC). The void space between two consecutive casings is named “annulus” and the annuli are identified following the alphabetical order from inside to outside—annulus A, annulus B, and so on. To avoid leakage from the reservoir to the annulus A, the oil well contains an elastomeric element, the packer, which is placed between the PT and the PC. The PT contains a safety valve inside it, the downhole safety valve (DHSV) to prevent the undesired flow from inside the tubing, if necessary. The tubing is deployed on the tubing hanger (TH), which is deployed on the
126
Safety and reliability modeling and its applications
FIGURE 5.22 (2020).
Schematic drawing of an oil well and its components. Source: Colombo et al.
production adapter base (PAB), connected to the SWHS. It is also possible to control the flow using the Xmas tree production master valve (M1) and the production wing valve (W1). The Xmas tree and the PAB allow access to the annulus A for maintenance purposes. This is achieved through the annulus access line (AAL), the annulus wing valve (W2), annulus master valve (M2), and the annulus intervention valves (AI1 and AI2). It is also possible to access the well through the swab valves above the Xmas tree (S1 and S2), by removing the tree cap. Finally, the gas-lift valve (GLV) is adopted to allow injecting gas into the oil stream to reduce the mixture density and improve the oil flow. As presented in Fig. 5.23, the model contains 13 cavities: 12 cavities corresponding to the well and 1 cavity representing the external environment. Then, the state space is built based on the tightness of each cavity: whether it is pressure communicated or not. Since cavity 1 is always communicated, the state space can disregard its status. Therefore, the initial state-space size is of 212 = 4096 states. This is significantly less than the initial proposal. However, it is possible to further decrease the number of states, by removing impossible states. For
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.23
127
Oil well cavities. Source: Colombo et al. (2020).
instance, it is not possible to have a pressure communication between two cavities that do not share common components in their boundaries. Therefore, by removing these impossible states, the state space is significantly reduced to only 531 states, leading to a model that is computationally feasible for ordinary computers. For each component, the failure modes that may affect the well integrity were considered. The failure rates adopted are presented in Table 5.2. The failure modes are presented as acronyms for simplicity. They refer to: EL, annulus leakage (AL), leakage in closed position (LCP), and failure to close (FTC). Once the failure rates are known, the failure rate between the cavities is equal to the sum of the failure rates for each component in their common boundary. Fig. 5.24 presents the reliability of the well barriers for 30 years of mission time. After 30 years, the reliability is equal to 0.877. If the repair rates of each component are taken into account, it is possible to compute the availability of safety barriers also, and not only the reliability. Additionally, if the model is combined with MPMC (see Section 5.6), it is also possible to consider the impact of test policies.
128
Safety and reliability modeling and its applications
TABLE 5.2 Failure rates adopted for the oil well model. Failure mode
Failure rate (1/h)
VX Ring – EL
1.30E-07
XT block – EL
3.00E-07
PAB block – EL
3.00E-07
XT cap – EL
5.00E-08
Connection between PAB and XT – EL
1.40E-07
PS below DHSV – AL
1.57E-08
PS above DHSV – AL
3.20E-08
DHSV – FTC
6.18E-07
DHSV – LCP
7.17E-07
Annulus access line – EL
3.70E-08
Production line – EL
1.70E-08
Packer – AL
3.00E-08
Production casing – EL
8.20E-08
SWHS – EL
5.00E-07
TH seal – EL
2.14E-07
AI1 valve – FTC
1.00E-07
AI1 valve – EL
1.80E-07
AI1 valve – LCP
2.70E-07
AI2 valve – FTC
1.00E-07
AI2 valve – EL
1.80E-07
AI2 valve – LCP
2.70E-07
M1 valve – FTC
2.37E-07
M1 valve – EL
1.00E-08
M1 valve – LCP
1.77E-07
M2 valve – FTC
2.37E-07
M2 valve – EL
1.00E-08
M2 valve – LCP
1.77E-07
S1 valve – FTC
2.37E-07
S1 valve – EL
1.00E-08
S1 valve – LCP
1.77E-07
S2 valve – FTC
2.37E-07
S2 valve – EL
1.00E-08
S2 valve – LCP
1.77E-07
Application of Markovian models in reliability and availability Chapter | 5
129
TABLE 5.2 Continued. Failure mode
Failure rate (1/h)
W1 valve – FTC
2.37E-07
W1 valve – EL
1.00E-08
W1 valve – LCP
1.77E-07
W2 valve – FTC
2.37E-07
W2 valve – EL
1.00E-08
W2 valve – LCP
1.77E-07
XO valve – FTC
2.37E-07
XO valve – EL
1.00E-08
XO valve – LCP
1.77E-07
GLV – AL
1.20E-05
5.4 Importance measures using Markov chains Importance measures are used in various fields to evaluate the relative importance of events or components [Zhu and Kuo, 2014]. When analyzing the system reliability, system availability, or posed risk by the operation of an engineering system, it is essential to identify those events or components that contribute the most to the outcome. Since components do not contribute to the system in the same way [Cheok et al., 1998], the reliability engineer needs a technique that allows the identification and ranking the most influential components. Based on this, he could answer several questions, such as: (a) Which component contributes the most to system downtime? (b) Which component, if failed, will increase the total risk posed by the system operation the most? (c) What would be the optimal choice to invest resources and time to improve the system? In order to identify each component or event contribution to the total risk (or another output), the method of importance ranking could be used [Modarres, 2006]. The IM are used to support decision making involving system operation, safety and maintenance [Noroozian et al., 2018]. Some applications of IM in reliability design include component assign problems, redundancy allocation, system upgrading, and fault diagnosis and maintenance [Zhu and Kuo, 2014]. The most common inputs used as IM are the probability or frequency of components failure modes. It is also possible to evaluate the importance of external events to the system. The most common output evaluated in importance ranking is the total risk; nevertheless, other parameters as reliability, availability, or unavailability also can be used. Traditionally the IMs are calculated within FT analysis (FTA) bounding the evaluation to a static analysis, not capturing aspects related to temporal and
130
Safety and reliability modeling and its applications
FIGURE 5.24
Reliability of the well barriers.
sequential dependence among events. The purpose of this section is not only to present and discuss the most common IMs, but also to extend the definition and calculation methods for the Markovian models, in order to improve the modeling and analysis power of these tools. Section 5.4.1 summarizes each of the most used IMs proposed in the literature and their mathematical formulation. Section 5.4.2 presents the methodology developed for the calculation of these IMs using MC. Finally, Section 5.4.3 illustrates the application of IMs computation to an engineering system.
5.4.1
Traditional importance measures
Birnbaum [Birnbaum, 1969] was the first to introduce the concept of importance measure in 1969; however, nowadays there are several IMs that can be found in literature. There are absolute and relative kinds of IMs; the first one evaluates each element (failure modes, components, or events) contribution to the outcomes of interest (reliability, availability, or risk) and the second one compares the contribution of each element with another one to those outcomes. The most traditional measures are: • Birnbaum importance (BI) • Criticality importance measure (CI) • Risk achievement worth (RAW)
Application of Markovian models in reliability and availability Chapter | 5
131
• Risk reduction worth (RRW) • Fussell–Vesely (FV) • Differential importance measure (DIM) These IMs can be divided into two classes [Modarres, 2006; Borgonovo and Apostolakis, 2001; Bhattacharya and Roychowdhury, 2014]: • Design centered (DC) IMs: used to support decisions about system design and re-design, for example, BI and RRW. • Test and maintenance-centered (TMC): used to evaluate the impact on the system performance of the maintenance strategies and changes in components, for example, FV, CI, and RAW. Next, each of the above IMs will be defined and their mathematical formulation will be presented. Total risk will be used as an output of interest and the probability, frequency, or rate of occurrence of basic events will be used as input. In practice, the risk can be replaced by other variables such as system reliability or availability, as well as other parameters can be used as input.
5.4.1.1 Birnbaum importance The BI of a specific event can be defined as the rate of change of the output of the system with respect to the variation of the probability of occurrence of the event. Mathematically, it is the derivative of the risk R in relation to the probability of occurrence Pi of a specific event, as indicated in Eq. (4.1): IBi =
dR dPi
(4.1)
The above expression is BI’s absolute form. However, it can also be calculated in relative form, as in Eq. (4.2). I RBi =
dR/ dPi R
(4.2)
As BI measures the marginal increase or decrease of the outcome of interest of the system when the probabilities of a basic event vary, it can be used for a sensitivity analysis. In Section 5.4.2 it will become clear that the BI does not depend on the performance of the element itself, but on how this element contributes to or impact the system configuration, that is, if it has redundancies or if its failure can lead to the entire system failure. This IM is useful during the design phase of a system, evidencing in which components the design should be focused on to allocate redundancies or concentrate efforts to improvements. This is important to reduce the impact of the components’ failures on the system performance.
132
Safety and reliability modeling and its applications
5.4.1.2 Criticality importance measure As BI does not take into account the performance of the element itself, it ends up having a limitation in its application in the decision-making process. It is well known that improving the performance of a component with a high probability of failure is easier than improving a component that already has a low probability of failure. In order to take into account the performance of the element itself, an extended version of the relative BI definition can be used as proposed in Eq. (4.3). ICi =
dR Pi × dPi R
(4.3)
This new measure is called the CI measure.
5.4.1.3 Risk-achievement worth The RAW IM of an event i(IRAW ) represents the changes in the total risk (or any other performance of the system, if defined properly), given the occurrence of the event. It is given by Eq. (4.4). IRAW =
R (Pi = 1) R
(4.4)
This IM is especially important during the operational phase of a system. The higher the value of IRAW for a given component, the more critical its failure or degradation is for the risk posed by the system operation or its unavailability. Thus, IRAW could support the decision about the action to be taken when a component fails. In addition, if any component has a high RAW value, additional protections, such as redundancies, of the system against its failure may be necessary.
5.4.1.4 Risk reduction worth The RRW IM of an event i(IRRW ) represents the potential change in the total risk when the event probability (or probability of failure of an element) is set to zero. Hence, it is the opposite of the RAW. Mathematically, it is given by Eq. (4.5). IRRW =
R R(Pi = 0)
(4.5)
This IM shows the theoretical limit of risk reduction or reliability improvement of a system given the elimination of a failure mode. In other words, it shows the result in the system performance if the component was perfect. Using IRRW it is possible to identify the potential benefits of improving a specific element.
Application of Markovian models in reliability and availability Chapter | 5
133
5.4.1.5 Fussell–Vesely This IM was proposed by Vesely et al. in 1983 [Vesely et al., 1986] and is defined as the fractional contribution of all scenarios containing a specific event or element in the total risk. Mathematically this is the same as setting to zero the contribution of all other scenarios, and calculating the fraction that this result represents from the original risk, as in Eq. (4.6). R = R(scenarios that contain the specific element) + R (other scenarios) the specific element) IFV = R(scenarios that contain R (4.6) This result can be easily obtained by evaluating the risk related to the occurrence of the minimal cut sets that contain the specific event and the total risk.
5.4.1.6 Differential importance measure The DIM was introduced by Borgonovo and Apostolakis in 2001 [Borgonovo and Apostolakis, 2001] as the fraction of the differential variation of the total risk regarding differential variation of the probability of occurrence of a specific failure (failure mode or an external event). It is defined as in Eq. (4.7). (∂R/∂Pi ).dPi IBi .Pi = n ∂R/∂P .dP j j j=1 IB j .Pj j=1
DIMi = n
(4.7)
Where Pi indicates a variation in the probability of the ith component. The evaluation of DIM requires only the first-order partial derivatives of the chosen output with respect to the component’s inputs, as seen in Eq.(4.7). DIM can be used, in risk-informed decision-making, to quantify the relative contribution of a component of the system to the total variation of system performance provoked by the changes in system parameters values [Do Van et al., 2010]. This IM is related to Birnbaum (or Criticality) IMs, producing the same ranking order based on the importance of elements or events. However, the DIM has an important property of been additive. Eq. (4.8) describes mathematically this property. DIM (e1 , e2 , . . . , en ) = DIMe1 + DIMe2 + . . . + DIMen
5.4.2
(4.8)
Importance analysis using Markov chains
Although IMs have traditionally been evaluated within FTA, the concepts can be expanded to other methods of reliability and risk analysis, such as MC. Due to the limitations of the traditional FT models, more complex models have been used over the years. With the increase in computational power, today it is possible to model systems using MC with dozens, hundreds, or even thousands of states. In this sense, it is interesting to have a way of calculating IM using these models. Most IMs are obtained through the so-called methods “one-at-a-time”, that is, methods in which each parameter of the model is varied individually while
134
Safety and reliability modeling and its applications
keeping the others constant. The final effect on the output is evaluated for each input. Although this method neglects dependencies and interactions between variables, it is simple and provides good indicators to the analysts. To derive the method of calculating IM within MC, a representation of the total risk of the system, R, is adopted as a linear combination of risk scenarios. Proposed by Wall et. al. in 2001 [Wall et al., 2001], this expression is presented in Eq. (4.9). R = a.Pi + b
(4.9)
Where the term “a.Pi ” is the risk contribution of all scenarios containing the ith event and “b” is the contribution of scenarios that do not contain it. Therefore, Pi is the probability of the ith event and a accounts for the terms multiplying this probability (e.g., the probability of other events). As mentioned before, the term R could be understood as some other parameter of interest, such as system reliability or availability. In the same way, the term P can be considered as the probability or frequency of occurrence of some failure mode. Fricks and Trivedi [Fricks and Trivedi, 2003] highlight that the two main factors that determine a component importance in the system are: 1. The structure of the system, and 2. The reliability/availability of the component The terms “a” and “b” of the Eq. (4.9) correspond to the factor (1) and the term Pi corresponds to factor; (2) If the values of “a” and “b” could be determined for each event i, it would be trivial to compute all the presented IMs in Section 5.4.1 Therefore, the step-by-step process described below, that aims at calculating the values of “a” and “b”, can improve the analysis using a MC model, allowing the evaluation of the IMs, thus supporting a more informed decision-making process. For these steps, let us assume a system composed of several components, for which the probability of failure depends on the probability of failure of each component, Pi , i = 1, 2, …. The risk function, R, is given by the system’s probability of failure. After this, we define the following steps: • R(base): is the system failure probability computed using the base MC, that is, adopting the estimated value of probability for each element. • R(Pi = 0): is the system failure probability considering the probability of failure of the ith as being zero. This is easily implemented in the MC by making the transition rate of the respective event equal to zero. • R(Pi = 1): is the system failure probability considering the probability of failure of the ith as being unitary. This is implemented by assuming that the systems’ initial state is the one in which the ith component is down. At a first approach, it is common to try adopting the analogous case of R(Pi = 0) and making the failure rate of a single component so large that its failure probability would result unitary. However, when dealing with numerical integration algorithms, this can easily lead to numerical instability.
Application of Markovian models in reliability and availability Chapter | 5
135
Based on the values defined above and calculated in the MC and Wall et. al. [Wall et al., 2001] expression for the total risk, we have the following the expressions depicted in Eq. (4.10). R (base) = a · Pi + b R (Pi = 0) = a · 0 + b = b R (Pi = 1) = a · 1 + b = a + b
(4.10)
Rearranging Eq. (4.10), one can find the terms a and b, as in Eq. (4.11). b = R (Pi = 0) a = R (Pi = 1) − R (Pi = 0)
(4.11)
It is then possible to calculate, through Eqs (4.12– 4.16), the IMs defined in Section 5.4.1. dR IB = =a (4.12) dPi dR IC = × Pi = aPi (4.13) dPi R (Pi = 1) a+b (4.14) IRAW = = R (base) aPi + b aPi + b R (base) IRRW = = (4.15) R (Pi = 0) b aPi R(scenarios that contain the specific element) IFV = = (4.16) R aPi + b The DIM can also be calculated based on the Birnbaum’s IM.
5.4.3
Example
Returning to the Onshore Christmas Tree example, presented in Section 5.3, item 5.3.6, it should be observed that valve number one is present in all possible scenarios of EL. This could be an indicator that this is the most important valve in the system; therefore, it should intuitively receive more attention for testing and maintenance. Nonetheless, this simple analysis ignores possible differences in the failure rate of each valve and which IM would be more appropriate to analyze this specific operation. However, now it is possible to compute the IM of each valve using the techniques presented previously. The importance of the IM is related to the fact that some components have more significance in the functioning of the system than the others [Bhattacharya and Roychowdhury, 2014]. Table 5.3 shows the results of the IM computation adopting the MC model. For comparison purpose, the same IMs were calculated using a FT model, as presented in Table 5.4. The failure rates represent typical values obtained in the industry for the components. The first conclusion is that there is no difference between the results obtained by the Markovian
136
Safety and reliability modeling and its applications
TABLE 5.3 IM computed using MC. Component Failure rate (1/h) IFB
IRAW
IB
IRAW
IC
DIM
V1
1.00E-05
1.0000 11.9228 0.1328 Inf
V2
1.00E-05
0.6318 7.5329
0.0794 2.4881 0.5981 0.2524
1.0000 0.4219
V3
5.00E-06
0.3228 7.5329
0.0760 1.4134 0.2925 0.2415
V4
1.00E-05
0.1502 1.7914
0.0096 1.0781 0.0725 0.0306
V5
1.00E-05
0.1243 1.4816
0.0059 1.0461 0.0441 0.0186
V6
5.00E-06
0.0635 1.4816
0.0056 1.0220 0.0216 0.0178
V7
1.00E-06
0.0129 1.4816
0.0054 1.0043 0.0042 0.0172
TABLE 5.4 IM computed using FT analysis. Component Failure rate (1/h) IFB
IRAW
IB
IRAW
IC
DIM
V1
1.00E-05
1.0000 11.9228 0.1328 Inf
1.0000 0.4219
V2
1.00E-05
0.6318 7.5329
0.0794 2.4881 0.5981 0.2524
V3
5.00E-06
0.3228 7.5329
0.0760 1.4134 0.2925 0.2415
V4
1.00E-05
0.1502 1.7914
0.0096 1.0781 0.0725 0.0306
V5
1.00E-05
0.1243 1.4816
0.0059 1.0461 0.0441 0.0186
V6
5.00E-06
0.0635 1.4816
0.0056 1.0220 0.0216 0.0178
V7
1.00E-06
0.0129 1.4816
0.0054 1.0043 0.0042 0.0172
approach presented in Section 5.4.2 or by the traditional methods using the FT models. As intuitively mentioned, the valve V1 has the greatest value for all the six IM. The IRRW of valve V1 is infinite because if it would be possible to make this valve perfect, that is, with probability of failure equal to zero, the total risk of the system would be also zero. The IRAW of valves V5, V6, and V7 are identical, even though they have different failure rates. This is because when one of these valves fail, the others become useless. The same goes to V2 and V3, but not for V4. The valve V4 has a different FV importance than V2 and V3 because it has redundancy in the set of valves V5, V6, and V7. It is possible to see in this example that the FV importance depends not only on the characteristic of the component (e.g., failure rate) but also on the position of the component in the system configuration. The BI works like a sensitivity measure of the system showing the change in the probability of failure (or the unavailability) of the system in relation to the
Application of Markovian models in reliability and availability Chapter | 5
137
failure probability of the item. As demonstrated in Section 5.4.2, the Birnbaum IM does not depend on the component characteristics, but only on the remaining configuration when the specific component is eliminated. This can explain the reason of the valve V5 has a greater Birnbaum IM value than V6 and V7. As V6 and V7 has a lower rate of failure than V5, the system is more dependent of V5 because these two valves have a lower probability of failure. The DIM leads to the same conclusions that the Birnbaum IM, although it has the property of being addictive. In fact, all valves DIM must add up to one. The IC , besides the configuration of the system, considers the failure probability of the component itself. The CI is presented in its relative form. The RAW IM is an IM that can be used as an indicator about decision-making when a failure occurs. For instance, if valve V1 fails during operation of the evaluated Christmas tree, the EL risk is multiplied by a factor of approximately 12. Considering this, it would be highly recommended to repair this valve as fast as possible. The RRW IM can be used to identify possible candidates to be improved. Would be a good decision to put money and time trying to improve the performance of valve V7? It would probably not a good investment decision because the total risk would practically be the same. Observing the FV importance measure, it is possible to see that the valve V1 has a value of one, indicating that the failure of this valve is in all possible scenarios of EL. The valves V2 and V3 occupy a symmetric position in the assembly; however, they have a different FV importance due to their different failure rates. The FV IM depends on the valve position in the assembly, because this will impact the number of scenarios of EL with the valve participation, and also on the valve failure rate, because it will affect the probability of these scenarios. It is interesting that intuitively engineers believe that increasing the number of redundancies in a system would increase its reliability or reduce the risk. However, based on this example, one can realize it may not be an optimal solution. Given that the valve V1 is a critical item to the system operation and that the importance of redundant valves decreases as the number of redundancies increases, limiting the overall gain for the system, maybe the best solution can be the improvement of the valve performance.
5.5 Uncertainty propagation in Markov chains Both the model choice and input data used when modeling the reliability and availability of systems are subject to uncertainties, which may arise from several sources. When choosing a mathematical model to structure the problem or physical situation, the analysts rely on assumptions and simplifications of the real word. For instance, to perform a PSA or a RAM analysis we have several support models and techniques at disposal, both qualitative and quantitative.
138
Safety and reliability modeling and its applications
For example, Failure Mode and Effect Analysis (FMEA), Hazard and Operability Study (HAZOP), FT, RBD, MC, Bayesian network (BN), and Event Tree Analysis (ETA). Each of these has its own hypothesis and limitations. No model can perfectly describe the reality, so there will always be uncertainties. For instance, Section 5.2 brought some limitations and assumptions to which Markovian models are subjected. Besides the choice of the model itself, other uncertainties could be present in the model, like the system configuration itself and the operational and environmental conditions. This uncertainty depends on the capacity to represent the physics of the relevant process and the behavior of the system under given conditions [NASA, 2011]. However, this model development includes parameters whose numerical values are, sometimes, unknown and should be estimated. This process of estimating the characteristics of the components of the system, such as its performance reliability, is an important source of uncertainty. To ensure the quality of the analysis of the system, it is important to check how these uncertainties would influence the outputs of the model. In fact, engineering risk analysis generally rely on probabilistic tools designed to quantify and display uncertainties when the information base is incomplete [Paté-Cornell, 1996]. So, firstly, it is important to explain which uncertainties are within the analysis conducted in this chapter. Uncertainties are related to situations involving lack of information or imperfect information. In a way, uncertainty could be understood as a measure of how good a piece of information is. Another important step is to define how to mathematically express the uncertainty, the use of probabilities distributions one of the most common ways. The uncertainties can be classified in two groups: aleatory uncertainties and epistemic uncertainties. The aleatory uncertainty is random and inherent in the physical process itself. For example, if someone performs a life test with identical equipment and under the same conditions, different failure times may be obtained. The aleatory uncertainties deal with observable quantities, like time, temperature, velocity, or distance. Epistemic uncertainties are related to the state of knowledge about building a representation of the reality. It deals with nonobservable quantities like the dependence or independence of components of a system or the choice of the probability distribution of times to failure of a component. The model or epistemic uncertainty arises from the assumptions made during the modeling process and the parameter or aleatory uncertainty arises from the data used to estimate the inputs parameter of the model. When developing a MC model to perform RAM analysis or PSA, we are subject to both, model and parametric uncertainty. We are also subject to both epistemic and aleatory uncertainty. As discussed in Section 5.2, MCs are defined basically by two elements: (a) the state space (qualitative); and (b) transition matrix (quantitative). These two pieces of MC are subject to epistemic uncertainty. The occupancy time, number of visits to a specific state, and number of transitions are measurable quantities subjected to aleatory uncertainty. It is
Application of Markovian models in reliability and availability Chapter | 5
139
worth highlighting that when using MC, the probability distribution of the time to occurrence of an event (e.g., failure, degradation, repair) is given and must follow an exponential distribution. Rouvroye and van den Bliek [Rouvroye and van den Bliek, 2002] compared different safety analysis techniques. These authors point out that Markov analysis (MA) has the most modeling power and complexity compared to other techniques such as parts count analysis, RBD and FTA. They argue that MA covers most aspects for quantitative safety evaluation except uncertainty analysis. This aspect would be covered by a technique named Enhanced Markov Analysis, which includes the uncertainty analysis using the Markov chain. This is exactly what is being discussed in this section: how to improve the MC model by performing uncertainty propagation? Going beyond what was proposed for the Enhanced Markov Analysis, this chapter also discusses the importance analysis, as seen in the previous section and the incorporation of tests in a multiphase analysis in the next section. The importance analysis allows the identification of each component’s contribution to the final result. Hence, the IM highlights which input parameters the uncertainty propagation analysis should be focused on. This means that if it is not feasible to investigate all the parameters of uncertainties, at least the most important should be evaluated. Another possibility is to perform a sensitivity analysis, showing the change in the result due to a variation in the inputs and prioritizing the parameters whose variation most impact the result. As this chapter discusses system reliability modeling, it will first focus on the epistemic model, which represents the state of knowledge regarding the numerical values of the parameters and the validity of the model assumptions [NASA, 2011]. Besides that, assuming that, after an evaluation, the MC was chosen to represent the system, it will investigate the input parameters uncertainty. This section does not intend to be a fundamental theoretical reference about uncertainty modeling, but a guide on how to evaluate parameters uncertainties, their propagation through the MC, and how to present the impact on the outputs. Section 5.5.1 introduces a basic procedure to determine the failure rates and parameter uncertainties. Section 5.5.2 presents a probabilistic method based on Monte Carlo simulation to perform uncertainty propagation. Finally, Section 5.5.3 illustrates the use of method in a study case.
5.5.1
Failure rates and their uncertainties
Parametric uncertainty is a significant concern to reliability models based on Markov chain since the component failure and repair rate are seldom perfectly known [Dhople et al., 2012]. In this section, we focus on how failure rates are generated and how we can evaluate uncertainty in these parameters. The same will be valid for other parameters like the repair rate or the probability of occurrence of any external event. All these parameters are evaluated based on historical data or experience, and the lack of knowledge about similar
140
Safety and reliability modeling and its applications
components or events. When assuming that the behavior of the component or event in question can be represented by knowledge of the behavior of similar components or equipment, we incur uncertainties. The purpose of this section is to exemplify this process of generating parameter estimation based on past experience and how to represent mathematically the uncertainty generated in this process.
5.5.1.1 Data collection and parameters estimation This section deals with the problem of collecting and analyzing data to estimate parameters of the MC model. There are several kinds of input parameters in a reliability model, such as: • • • •
external event frequencies; component failure frequencies; component test and maintenance data; and human error rates or probabilities.
To exemplify, the process of gathering data and estimating parameters will be considered the case of estimating the failure rate. In lieu of precise numbers, failure and repair rates can be modeled as random variables and their distributions can be determined by various methods (Wall et al., 2001). Usually a probability distribution of time to failure is used to represent the reliability performance of an equipment and this probability distribution is specified based on a life-data analysis (LDA). There are several sources to obtain life data of equipment: • • • • • •
data obtained in the field: identical components and identical conditions; data obtained in the field: identical components under different conditions; data obtained in the field: similar components under identical conditions; data obtained in the field: similar components under different conditions; laboratory data; general engineering or scientific knowledge about design, manufacture and operation of the equipment; and • expert opinion or experience. It is obvious that gathering life data from equipment that are identical to one have been analyzed and give the most accurate information under the same operation and environmental conditions. When going to similar equipment and different operation conditions, the uncertainty increases. In the absence of data, it is possible to use expert opinion or knowledge about the development and construction of equipment. It is clear that when using data specific of the system or plant being investigated, the level of uncertainty is lesser than when using generic data or expert opinion.
Application of Markovian models in reliability and availability Chapter | 5
141
The international standard ISO 14224 provides key guidance on how to achieve quality information about equipment failures for decision-making in the oil and gas industry, including specific guidance on data collection concepts and how to record and categorize causes of failure [Selvik and Bellamy, 2020]. This international standard was influenced by the experience gained with two generic reliability databases: OREDA [ORED Participants, 2015] and Wellmaster [Exprosoft, 2020]. OREDA (Offshore & Onshore Reliability Data) is one of the main reliable data sources in the oil and gas industry. Since 1981, OREDA participants have been collecting data regarding time to failure, failure modes and repair times of equipment used in exploration and production (E&P). Another database, but specific to oil and gas well equipment, is the Wellmaster. Both databases present data in form of number of failures per years or hours of equipment operation. Statistical models are one of the most widely pursued frameworks to cope with the problem of reliability estimation [Colombo et al., 2020]. To illustrate how to estimate the failure rate of a valve from Christmas tree used in the example of Section 5.3, consider the following data: • Total operational time: 20.000 years3 ; • Number of failures registered: 500 failures Assuming that all monitored valves are identical or similar and that the valves are in their useful life period, we can use the exponential distribution whose only parameter is the failure rate λ. An unbiased estimator can be obtained by Eq. (5.1) [Epstein, 1960]: ˆ = N(t ) = Number of failures λ t Accumulate life on test
(5.1)
Applying to the previous data according to Eq. (5.1) we obtain a failure rate of 2.85 × 10−6 h−1 : ˆ = λ
500 = 2.85 × 10−6 h−1 20000 × 365 × 24h
(5.2)
And, consequently, Eq. (5.3) presents an estimator to the MTTF. For the exponential distribution, it is simply the inverse of the failure rate. = 1 = 20000 years = 40 years MTTF ˆ 500 λ
(5.3)
This is a point estimator based on a sample of valves. If one is planning to use this data to represent the entire population of valves, given the limitation of facility specific data, it is recommended that they evaluate the confidence 3 Considering
the sum of lifetimes of several components, naturally it is impossible to observe such a lifetime for a single component.
142
Safety and reliability modeling and its applications
interval of the parameter. There are other ways to calculate the failure rate of an equipment, especially if the analyst has the access to time to failure of its individual components. For instance, Colombo et al. [Colombo et al., 2020] used statistical methods and machine learning algorithms to estimate the reliability performance from DHSV used in offshore wells, and present the MTTF of the equipment and its confidence interval.
5.5.1.2 Parameter uncertainty A priori, it is impossible to know the exact time of failure of a certain component in the system under analysis. This information will be known only when the component fails, but then there is no further use for this information in the model. Therefore, the reliability engineer must investigate the behavior of other similar components and operating conditions, to estimate the performance of the component being analyzed. The analyst will, then, face three issues: • How similar are the components in the database regarding the one that has been analyzed? • Are the operational conditions and the environment also similar? • How much data is available? ˆ has a parameter uncerSo, as in the previous example, the point estimator λ tainty that depends on the type of equipment, condition of use, and the amount of data. It is clear that the real value of λ for the specific component is unknown. Therefore, it is possible to represent the state of knowledge regarding the numerical value of λ using an epistemic probability density function π (λ) [NASA, 2011]. Finding the epistemic pdf is a challenging part of uncertainty analysis; however, it is not in the scope of this chapter. It could involve discovering the possible lower and upper limits to values of λ, comparing data with other databases, interviewing experts of the equipment, and consulting the designers and manufactures of the equipment. To exemplify, we consider, for the failure rate estimated in Eq. (5.1), the variability caused by using the λ or MTTF of a sample as an estimator for the valve population. According to Cocozza-Thievent [Cocozza-Thievent, 1997] it is possible to obtain a two-sided confidence interval to a level of significance to the point estimate of λ using the chi-square distribution, see Eq. (5.4). 1 1 (5.4) zε , z ε 2t /2,2(n+1) 2t 1− /2,2n Where n is the number of occurrences; t is the total operational time; z,ν is the 100% percentile of a chi-square distribution (χ 2) with ν degrees of freedom. An advantage of this approach is that it allows calculating the confidence interval even in the absence of occurrence (failure) during the interval [0, t]. Replacing the values of number of failures and total time in Eq. (5.4) and adopting a significance level ε of 5%, allows computing a 95% confidence interval for the failure rate estimator. Eq. (5.5) presents the computed confidence
Application of Markovian models in reliability and availability Chapter | 5
interval.
2.61 × 10−6 , 3.11 × 10−6
143
(5.5)
To illustrate the importance of the amount of data, consider that the total operational time of a group of similar valves is 2,000 years and the number of failures registered is 50. This combination will give the same rate of failure as the previous case, but the new 95% confidence interval is now different, as indicated in Eq. (5.6). 2.16 × 10−6 , 3.69 × 10−6 (5.6) With less data, the 95% confidence interval is wider, evidencing a higher level of uncertainty. Returning to the problem of the probability density function to model the epistemic uncertainty, π (λ), the lognormal distribution is used frequently in risk analysis to represent epistemic uncertainty in failure rates. The motivation behind the popularity of lognormal distributions is presented in the nuclear reactor safety study WASH 1400 [USNRC, 1975]. The lognormal distribution fits some available data; it allows considering high values of λ due to the right skewed shape; represent extreme environments; and, it is easy to adjust with just limited values. The latter factor is useful for the case presented earlier, when the 95% confidence interval to the failure rate of the valve is calculated. The lognormal distribution has two parameters (μ and σ ) and is defined by Eq. (5.7)
(ln λ − μ)2 1 (5.7) exp − π (λ) = √ 2σ 2 2π σ λ Where μ is the mean value of the variable λ and σ is its standard deviation, these two parameters can be obtained through several statistical techniques. However, one of highlighted reasons of using this distribution is that the parameters can be obtained from the limited values of a confidence interval. For example, using a two-sided 95% confidence interval leads to the values of λ being presented in Eq. (5.8). λ97.5% = exp (μ + 1.960σ ) λ2.5% = exp (μ − 1.960σ )
(5.8)
Using the confidence interval previously obtained, it is possible to fit a lognormal distribution with parameters given by Eq. (5.9). Fig. 5.25 presents the lognormal distribution obtained and highlights the confidence bound. λ97.5% = 3.11 × 10−6 μ = −12.77 ⇒ (5.9) σ = 0.044 λ2.5% = 2.61 × 10−6 This is one example of how to estimate epistemic probability distribution for the values used as inputs in the model. It is up to the analyst to evaluate the uncertainties in each of the variables used as input. The next Section will
144
Safety and reliability modeling and its applications
FIGURE 5.25
Lognormal epistemic pdf for the failure rate of the valve.
present a procedure to apply these epistemic probability distributions to perform uncertainty propagation in a Markov Chain using the Monte Carlo method.
5.5.2 Procedure to evaluate the uncertainty propagation in a Markov Chain The two classes of uncertainty propagation methods are: the method of moments and the probabilistic methods [Modarres, 2006]. The most common probabilistic method used is the Monte Carlo method. The Monte Carlo method consists of generating a pseudo random number according to a specific distribution (e.g., epistemic probability density function) for each model parameter. Thereafter, using this set of random numbers, the model is run and the correspondent result is obtained (some output of interest to the model). This corresponds to just one round of the Monte Carlo simulation. This procedure is repeated several times using different sets of those random parameters, such as 100,000 times, and samples of the output are obtained. This sample set is used to create a representation of the output variability, such as a histogram, and to calculate the limits for the desired confidence interval, percentiles, expected value, or any other parameter of interest [Zio, 2013]. Although there are some sophisticated Monte Carlo method variations, in this chapter it is presented as a straightforward Monte Carlo based procedure for uncertainty propagation due to the failure rate sampling uncertainties. The first step is to define a probability distribution for the failure rate, which is the parameter for the exponential distributed time to failure/repair in the case of MC.
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.26
145
Inverse transform method to sample random numbers from generic distribution.
Then, the parameters of the failure rate probability distribution are estimated, as seen in the previous Section. Once the probability distribution is defined for each parameter of the model, it is possible to simulate the system behavior supported by the inverse cumulative distribution function (cdf) along with Monte Carlo procedure [Zio, 2013]. A sample of ν set of values is generated for uncertainty evaluation, where ν is the number of Monte Carlo simulations. In each simulation, the calculation of failure rate for each equipment occurs by means of the inverse cdf evaluated at a random number R, as presented in Fig. 5.26. After randomly defining all failure rates, the transition matrix can be generated. Using this matrix, it is possible to calculate the outputs of MC, like reliability, availability, or the risk related to the system operation. As each Monte Carlo simulation generates one sample of output, at the end of simulation there will be a set of ν samples of possible results. The result of the uncertainty analysis can be expressed in several ways; however, in this chapter the histogram and failure probabilities as a function of time will be explored, both being presented in the next section.
5.5.3
Example
Using the same example of the Onshore Christmas Tree (defined and developed in Sections 5.3 and 5.4) after evaluating the EL probability considering a specific value for the failure rate of each valve, it would be important to analyze the impact of the variability of those failure rates in the probability of the leakage states. To perform this uncertainty analysis, the data that was used to estimate the failure rates must be explored, as in the Table 5.5. As discussed in Section 5.5.1, ˆ and its 95th confidence it is possible to have a point estimate to failure rate λ interval based on the number of failures and total time of operation of each valve. Considering the upper and lower limits of this confidence interval and using the
146
Component
Number of failures
Total time (years)
Point estimate
Lower boundary
Upper Boundary
λ(h(−1)
λ2.5%
λ97.5%
(h(−1)
μ
σ
(h(−1)
V1
700
7991
1.00E-05
9.29E-06
1.08E-05
-11.513
0.03743
V2
650
7420
1.00E-05
9.26E-06
1.08E-05
-11.513
0.03882
V3
350
7991
5.00E-06
4.50E-06
5.54E-06
-12.207
0.05272
V4
720
8219
1.00E-05
9.30E-06
1.07E-05
-11.513
0.03691
V5
680
7763
1.00E-05
9.28E-06
1.08E-05
-11.513
0.03797
V6
400
9132
5.00E-06
4.53E-06
5.50E-06
-12.207
0.04936
V7
70
7991
1.00E-06
7.92E-07
1.25E-06
-13.821
0.11584
Safety and reliability modeling and its applications
TABLE 5.5 Failure data of onshore Christmas tree and parameters estimation.
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.27
147
Failure probability histogram based on the uncertainty analysis.
Eq. (5.4), a lognormal distribution was fit to model the uncertainty about the failure rate of each valve, as presented in the two last columns of the Table 5.5. Thereafter, these lognormal distributions can be used to sample the failure rate of the correspondent valve for each simulation in the Monte Carlo simulation method [Zio, 2013]. Afterwards running N = 1000 simulations of the Monte Carlo method, N possible probabilities of occurrence of EL are obtained, which can be presented in a histogram as shown in Fig. 5.27. And, finally, this histogram can be used to estimate the expected value for the probability of occurrence of the EL and the desired confidence interval considering the uncertainty propagation. For this example, for one year of mission time, the following estimates were obtained: • • • •
Mean = 0.0111 Median = 0.0111 Lower confidence bound (2.5%) = 0.0101 Upper confidence bound (97.5%) = 0.0121
The failure probability estimate using the MC without uncertainty analysis was 1.1% after 1 year of operation. One can notice that the expected value of the distribution presented in Fig. 5.27 is also 1.1%, but it is also possible to evaluate the variability of this result. Supposing that the target level of EL probability after one year of operation is 1% and considering the obtained results with the evaluation of the uncertainty
148
Safety and reliability modeling and its applications
FIGURE 5.28
Failure probability as a function of time considering a 95th confidence interval.
propagation, it is possible to estimate that the system has a chance of 98.4% that the system does not attend this requirement. In addition, by analyzing the shape of the histogram the analyst can evaluate if he/she can rely on the result or not. A flat histogram will indicate a high chance of the real behavior of the system deviating from the expected response while a steeper curve will indicate a higher chance that the real behavior is according to the expected response. Another interesting way of representing the result of a reliability or risk analysis is showing the behavior of the output as function of the time. This is the socalled risk curve used frequently in safety analysis. After performing an uncertainty analysis, this curve can be presented with the upper and lower boundaries of confidence interval. Instead of using the mentioned curve, the decision maker can rely on a failure probability range. Fig. 5.28 shows the curves of expected value, 2.5th percentile and 97.5th percentile for a 95% confidence interval.
5.6 Multiphase Markov chains and their application to availability studies Up to this point, conventional Markovian models were used to evaluate the evolution of the state of an engineering system along time. The model states usually represent a combination of failure or functioning states of system components. An improvement in the methodology presented so far can be made
Application of Markovian models in reliability and availability Chapter | 5
FIGURE 5.29
149
Simple example of a CTMC.
by allowing the introduction of discrete events in the MC, such as component tests. This new approach is called MPMC. This chapter starts by presenting the basic concepts of MPMC in Section 5.6.1. Then, it gives a practical example of application, regarding the assessment of Safety Integrity Levels (SIL). Section 5.6.2 presents the SIL concepts and, finally, Section 5.6.3 presents an example of application.
5.6.1
Basic concepts of MPMC
Consider the simple example illustrated in Fig. 5.29 of a CTMC for a single component with two possible states: available and unavailable. This component transits from available to unavailable state under a constant failure rate, λ, and goes back to operational state under a repair rate, μ. One assumption intrinsic to this model is that the failure is immediately observable once it occurs. In this case, therefore, the repair starts as soon as the component fails. However, this is not always the case in the reality. Some failure modes are detectable only when the component is demanded, being it for the normal operation, or during test activities. These are known as hidden failures. In this way, a more realistic model of the component could consider four states: (a) system available; (b) dangerous failure not detected; (c) system under test; and (d) system under repair. Only on the first state (a) the system is available to execute its function properly; however, the unavailable states were divided into three parts. This is especially useful when analyzing a safety system and looking to improve test/maintenance task strategy to minimize the risk related to the system unavailability. Fig. 5.30 shows the MPMC corresponding to this case. In this MPMC of Fig. 5.30 the transitions between state 1 and state 2 and between state 4 and state 1 continue to occur at a failure and repair rate, respectively. These are equivalent to the continuous transitions of conventional CTMC, such as in the example of Fig. 5.29. They are properly identified in the diagram by continuous red arcs. The other transitions are discrete time transitions, similar to what occur in DTMC as discussed in Section 5.2. For instance, the system can be tested once a month, so the transitions from the state 1 to state 3 or state 2 to state 3 are periodically determined, that is, they are discrete events. The test has a time duration, meaning that the system will stay at
150
Safety and reliability modeling and its applications
FIGURE 5.30
MPMC for a single component.
FIGURE 5.31
Typical timeline of a MPMC model.
state 3 for a certain amount of time. After the test ends, the system can go either to state 1, if no failure detected, or state 4, if a failure is detected. This transition would depend on the probability of the system being functioning or failed when tested, that is, if it has come from state 1 or 2. In both cases, the transition would correspond to discrete events, because both occur after the test duration. When the system is in state 4, some maintenance tasks are executed and after sufficient time the system goes back to state 1. Therefore, as its name suggests, the MPMC can be divided into two phases, a continuous and a discrete one. Fig. 5.31 presents the typical timeline of a MPMC model, in which the component transits is regularly placed under tests. The model mixes the concepts of CTMC and DTMC.
Application of Markovian models in reliability and availability Chapter | 5
151
For the continuous phase, generator matrix, GC , is given by Eq. (6.1). ⎡ ⎤ 0 0 −λ λ ⎢ 0 ⎥ 0 0 0 ⎥ GC = ⎢ (6.1) ⎣ 0 ⎦ 0 0 0 μ 0 0 −μ During the continuous phase, the state probabilities can be defined as in an ordinary CTMC, through the integration of Eq. (6.2), where P(t) is the state probability vector. dP(t ) = P(t )GC (6.2) dt Additionally, we should define the transition probability matrix for the discrete time transitions. At this point, it is important to highlight that there are two possible discrete time transitions in the considered model: one when the test starts and other after the test end. To represent the discrete transitions at the test start, a transition probability matrix, MDTS , is created as in Eq. (6.3). ⎡ ⎤ 1 0 0 0 ⎢ 0 ⎥ 1 0 0 ⎥ MDTS = ⎢ (6.3) ⎣ 0 ⎦ 0 1 0 0 0 0 1 The matrix MDTS indicates that at the test start the system will transit to state 3—under test—whenever it is at states 1 or 2. Therefore, p1,3 = 1 and p2,3 = 1. If the system is under repair, however, it will continue under repair, and, consequently, p4,4 = 1. The system will never be at state 3 right before the start of a test, but, for consistency, the term p3,3 is set as unitary in MDT S , since the sum of all lines in a transition probability matrix should be unitary. Note that if s and t are, respectively, the time right before and right after the test start, then Eq. (6.4) applies. P(t ) = P(s)MDT S
(6.4)
The system will stay in the states 3 or 4 during the duration of the test. After the test ends, a new discrete time transition occurs. As in the case of MDTS , we can build a transition probability matrix to the discrete time transition that occur after the test ending, MDTE . Eq. (6.5) introduces the referred matrix. ⎡ ⎤ 0 0 1 0 ⎢ 0 ⎥ 0 0 1 ⎥ MDTE = ⎢ (6.5) ⎣ α ⎦ 0 0 1−α 0 0 0 1 Beginning by the trivial terms, p1,1 = 1 and p2,2 = 1 by the same reason that p3,3 = 1 in MDT S : to ensure the consistency of MDT E by setting the sum of the corresponding lines as unitary. Also, the term p4,4 = 1 indicates that if the system
152
Safety and reliability modeling and its applications
was under repair, it will continue in this state after the test ends. The terms that deserve special attention at this point are p3,1 and p3,4 , which indicate, respectively, if the system will go back to the available state after the test or will be placed under repair. Again, if s is the time instant right before the test start, then α is defined as in Eq. (6.6). α = Pr (X (s) = 1|X (s) = 4)
(6.6)
That means that α and, consequently, p3,1 indicate the probability that the system was in the available state right before the test start, given that it was not under repair. At this point, it is important to state that perfect tests are assumed (i.e., the test will never return false negatives or false positives). Using the Bayes theorem, the term α can be computed as follows: α = Pr (X (s) = 1|X (s) = 4) =
Pr (X (s) = 1) Pr (X (s) = 4|X (s) = 1) Pr (X (s) = 4)
(6.7)
Given that Pr (X (s) = 4|X (s) = 1) = 1 and Pr (X (s) = 4) = 1 − Pr (X (s) = 4), it is possible to compute α based exclusively on the terms of the state probabilities vector at the instant s, P(s): α=
Pr (X (s) = 1) 1 − Pr (X (s) = 4)
(6.8)
At this point, it is important to remember that for the MPMC model under analysis: P(s) = [Pr (X (s) = 1) Pr (X (s) = 2) Pr (X (s) = 3) Pr (X (s) = 4)]
(6.9)
Naturally, the term p3,4 = 1 − α indicates the probability that the system will be placed under repair after a test, corresponding to the probability that the hidden failure has occurred during the continuous operation.
5.6.2
Safety integrity level (SIL)
This section will present a brief introduction about the concept of Safety Integrity Level (SIL). The MPMC models can be helpful when calculating the SIL from safety instrumented systems (SIS) subjected to periodic tests. For a complete understanding of SIL concept, it is recommended that IEC 61508 [IEC, 2010] and IEC 61511 [IEC, 2016] propose a risk-based approach for specification, design, and operation of SIS and how to manage the integrity level of safety instrumented functions (SIF) during the lifecycle of the systems. These standards use safety integrity as a measure of reliability and define four safety integrity levels (SILs), where SIL 1 is the lowest (least reliable) level and SIL 4 is the highest (most reliable) level [Lundteigen and Rausand, 2009]. Also, this chapter encourages the readers to think about practical examples, by adopting the OLF
Application of Markovian models in reliability and availability Chapter | 5
153
TABLE 5.6 SIL requirements. Safety Integrity Level
PFD requirements (low demand mode)
PFH requirements (high demand/continuous mode)
4
≥10-5 to < 10-4
≥10-9 to < 10-8
3
≥10-4
10-3
≥10-8 to < 10-7
2
≥10-3 to < 10-2
≥10-7 to < 10-6
1
≥10-2
≥10-6 to < 10-5
to < to
t ) = 1 − P(T T F ≤ t ) = 1 − F (t ) =
f (t )dt,
t≥0
t
(7.2) Once a probability distribution of the TTF is identified the reliability function and all other reliability measures become known. From the point of view of mathematics any well-defined probability distribution that satisfies mathematical rules for CDF (given above) can be used to model the reliability of components. However, to be practically used the selected probability distribution function should reflect the expected in-service behavior of components considered. Hence, designers and producers of components are the most suitable source of information regarding the selection of the relevant probability distributions to model the mechanisms that generate their failures [Knezevic, 1993].
7.2.3
Reliability model of a system
A system is a collection of components on which at least one measure of performance is defined. An expression that defines the state of a system as a function of the states of its components is called a system function, which is a mathematical model of the physical entity considered. Thus, the reliability function for a system, Rs (t), which consists of several components, is determined by the impact of failure of each component on the reliability performance of a system, which is graphically described by the reliability block diagram (RBD) of a system considered. For a hypothetical system that will experience a failure event when either component A fails, or components B and C fail, the RBD is shown in Fig. 7.1. The failure function of a system, FS (t), based on the axioms of probability theory, is equal to: FS (t ) = P(T T FS ≤ t ) = P(T T FA ≤ t ) + P(T T FB ≤ t )P(T T FC ≤ t ) − P(T T FA ≤ t )P(T T FB ≤ t )P(T T FC ≤ t ) = FA (t ) + FB (t ) × FC (t ) − FA (t )FB (t )FC (t ),
t≥0
Consequently, the probability of not experiencing a failure event at the system level, during a given interval of in-service time [0,t], is quantified by the reliability function of a system, RS (t), which is defined as: RS (t ) = P(T T FS > t ) = 1 − P(T T FS ≤ t ) = RA (t )RB (t ) + RA (t )RC (t ) − RA (t )RB (t )RC (t ), t ≥ 0
(7.3)
184
Safety and reliability modeling and its applications
FIG. 7.1
Reliability block diagram for a hypothetical system
FIG. 7.2
Reliability function for a hypothetical system shown in Fig. 7.1
Given that reliability functions of consisting components A, B, and C are known, it is possible to plot the reliability function for a system and calculate the probability of not experiencing a failure event during any interval of future time, as shown in Fig. 7.2, for a hypothetical system. The above two figures summarize the essence of the mathematical approach to the reliability modelling process. As its main concern is a prediction of the probability of a given system not experiencing a failure event during a given interval of time it is the governing information for the safety, hazard and similar types of analyses performed at the design stage. It is necessary to point out that when a value of reliability functions is calculated, say 0.83 for a given interval of time t, it does not mean that the system under consideration will or will not experience a failure event. It is not known, but what is known is that, “out of 100 systems of that type put into operation about 83 of them will not experience a failure event during that period of operation stated”. Hence, this is the maximum possible information to be obtained with a mathematical model of reliability [Knezevic, 1995]. In summary, mathematics is telling reliability and safety modellers that there is a probability function associated with the operation of each physical system. However, mathematics is saying nothing about the probability distribution of that function. From a mathematical point of view the number of possible probability functions is unlimited.
Mathematical and physical reality of reliability Chapter | 7
185
Consequently, the final statement of mathematics to reliability and safety modellers is, “I know my limitations” [Dubi, 2003].
7.3 Voyage to the ice “The machine does not isolate man from the great problems of nature but plunges him more deeply into them.” Antoine de Saint Exupery4 , Wind, Sand, and Stars, 1939. In order to scientifically understand the physical reality of the operational process of an aircraft, from the point of view of reliability, the MIRCE Akademy5 sponsored British Aviatrix Polly Vacher’s unsupported solo flight around the world in a single-engine aircraft via the North and South Pole. The project was named “Voyage To The Ice” (VTTI) [Vacher, 2006] and had been planned to materialize between May 2003 and March 2004, with the objective to raise awareness of a Flying Scholarship for a Disabled (FSD)6 . The longest and the most challenging leg of the whole journey was expected to be the flight between Christchurch in New Zealand and McMurdo in Antarctica, the 2068 nautical miles of inhospitable Southern Ocean. Hence, reliability wise, the major consideration was the direction of the flight between these two destinations. Both options, eastbound (New Zealand to Antarctica) and westbound (Antarctica to New Zealand) had their advantages and disadvantages. However, this decision would determine the starting date of the project and consequently all other dates and events. At the beginning of January 2002, the MIRCE Akademy allocated four students from the master diploma program to study the proposed flight plan with the aim of determining the necessary reliability and supportability issues concerned with its successful completion. They created the RBD of Polly’s aircraft, obtained the all-necessary information and made reliability predictions, in the manner described earlier in the text, for both options. After reading the Akademy’s report and speaking with pilots who had flown in either, or both, directions, Polly decided to fly westbound, which meant that, in her judgment, flying from Antarctica to New Zealand was the safer option. Consequently, Polly planned to start the flight in the British springtime with a route starting from Birmingham, UK, heading north towards Scotland and Norway with the intention of over-flying the North Pole during the month of June. This timing was considered to be the best chance for clear skies in that region. As a result of the annual rotation of the Earth around the Sun, Polly had around five months to complete the flight south towards Argentina in order to fly over 4 Antoine
Marie Jean-Baptiste Roger, comte de Saint-Exupéry (1900-1944) a French writer, poet, aristocrat, journalist, and pioneering aviator. 5 www.mirceakademy.com 6 Knezevic, J., B2B/A+A - Polly Vacher’s Voyage To The Ice, Birmingham To Birmingham Over Arctic & Antarctic http://www.MIRCEAkademy.com/index.php?page=applied
186
Safety and reliability modeling and its applications
TABLE 7.1 Physical parameters continuously recorded by Polly during the VTTI
project. Departure
In Flight
Arrival
Location
Altitude
Location
Co-ordinates
Winds/directions
Coordinates
Time
Distance
Time
Fuel Load
Fuel consumption
Refuel quantity
Oil Refill
Oil temp and pressure
Oil refill
Maintenance Actions
Fuel mix
Maintenance actions
Battery Charge
Cabin temperature
Battery charge
Ambient Température
Ambient temperature
Total Distance
Cabin Temperature
Engine temperature
Ambient Temperature
Max G-force
Engine RPM
Max G-force
Antarctica during the summer months in the Southern Hemisphere. So, after over-flying the North Pole, the route south would take her through Canada, USA, Mexico, Guatemala, Belize, Antigua, Tobago, Trinidad, Brazil, and Argentina [Vacher, 2006]. Polly was aware that a great deal of patience is required in waiting for the right weather window for the flight to Antarctica. From there, the journey home was pretty much well defined. Namely, through New Zealand, Australia, Indonesia, Malaysia, Thailand, India, Bhutan, Oman, Bahrain, Jordan, Egypt, Greece, Yugoslavia, Italy, France and back to Birmingham, UK [Vacher, 2006]. In order to assist the science-based research studies in reliability at the MIRCE Akademy Polly generously accepted to record relevant in-service data during the trip. The main purpose of the research was to study the impact of the environmental conditions on the reliability and supportability of VTTI system. The data to be collected as the basis for reliability research are shown Table 7.1. Generally speaking, the problem for reliability engineers and managers is the variability of the internal and external drivers of reliability, in time and locations. This research was planned to collect the largest possible range of data in respect to any flights anywhere in the world made by any pilot, as commercial flights over North and South Pole are almost non-existent! Hence, the expectation was to use the data collected by Polly in the endeavor of the MIRCE Akademy to address the reliability modelling process, by understanding the physical reality of operational processes that drive a reliability performance of systems, in respect to time and locations.
Mathematical and physical reality of reliability Chapter | 7
187
On the 6 May 2003, at 16:22 Polly took off from Birmingham airport, in her Piper Dakota PA-28-2367 (G-FRGN) with the thoughts “when will I see home again” [Vacher, 2006] flying north in the direction of Scotland. She arrived at a cloudy and cold Wick at 19:20. After five days of waiting for the weather “window”, Polly flew onward across the North Sea to Norway, through Bergen to Tromso. On the 26th May 2003 she finally left Europe, ready for the Arctic flight. The flight to the first of the Ice Challenges, at the beginning, was slow due to the strong headwind. At some stages of the flight Polly was flying at only 98 kts (cruising speed 135 kts.). Then, as she described in her diary, “My ferry tank ran dry and I switched to the left wing tank. About five minutes after changing tanks THE ENGINE STOPPED - panic - why is it stopping now? I went into automatic mode and changed onto the right tank; put the fuel pump on and the carburetor heat8 . Heaven be blessed it started again, but from then onwards, every little noise every little whistle became a huge problem [Vacher, 2006].” Despite having to manage this operational challenge, in real time, she successfully landed at Resolute on the 27th May 2003. During the following 5 months Polly had flown through: USA, Mexico, Guatemala, Belize, Cayman Islands, Dominican Republic, Antigua, Trinidad and Tobago, Suriname, Brazil and Argentina to arrivein Ushuaia, the most southern tip of Argentina, on the 25th October. Then, the waiting for the favorable weather started. On the 29th November 2003, after 8 hours of flying, she landed at the British research station in Rothera (67°33’S, 68°07’W) in Antarctica, to start her flight over the South Pole [Vacher, 2006]. On the 5th December at 07:00 the weather forecast was good, overall winds +3 kts. The first hour into the flight was good: tail wind at 5000 ft and the cruising speed was 111 kts. Polly flew up the glacier and flying over the top the views were stunning. Four hours into the flight the wind changed from a tailwind to a headwind and soon the ground speed decreased to 80 kts. The GPS indicated that the planned 11-hour trip would take 15 hours! Soon Polly reached the point of no return. An updated weather report was not encouraging, as the headwinds were expected to continue to increase in strength. As a captain, in charge of the VTTI system, whose function was to “safely fly solo a single-engined aircraft around the world” she made the decision to turn back! Naturally, Polly’s speed rapidly increased to 133 kts and she safely landed back at Rothera [Vacher, 2006].
7 Knezevic, J., From B to B, Polly Vacher’s Global Challenge, pp 50, MIRCE Science, 2001, Exeter,
UK 8 The problem was two-fold. First: the fuel mixture being on the lean side. Second: carburettor icing.
All the time Polly was using the warmer fuel from the ferry tank within the aircraft, no ice was forming in the carburettor. But once this was all used and she had to change to the fuel tanks in the wings where the outside temperature was –20°C, the injection of such cold fuel froze any moisture in the carburettor, and caused the engine to cough and splutter unless carburettor heat was continuously applied.
188
Safety and reliability modeling and its applications
As the fuel for Polly’s one-way flights over Antarctica were prepositioned several months in advance, she had no fuel to make the second attempt to fly over the South Pole. The only help came from the Argentine Air Force that on the 17th December delivered 4 drums of fuel to their Antarctica base located in Marambio Island, from where it was impossible to fly to McMurdo without refuelling on the way. Hence, without any other option, Polly had to abandon her Antarctic flight! However, on New Year’s Day, 2004, Polly started on a rerouted flight, back up to the Americas to California, then across the Pacific to New Zealand. Thus, 14,000 miles later, on the 30th January 2004 Polly landed in Auckland, to be in the position to continue the planned trip and honor commitments made to this part of the world in aid of FSD charity [Vacher, 2006]. After 357 days of circumnavigating the globe via all seven continents, 60,000 nautical-miles, thirty countries, and spending over 500 hours in the pilot’s seat, Polly arrived, on schedule, at 12:30 at her starting point, Birmingham International Airport, but this time from the south and “landed” in the aviation record history books as: The first woman to fly solo: • In a single engine light aircraft over the North Pole • In a single engine light aircraft over Antarctica • The first person to fly solo around the world landing on all seven continents. Aviatrix Polly Vacher generated £400,000 for the FSD and recorded over 20,000 in-flight technical data of the physical reality of the flight around the world via poles, to support the research at the MIRCE Akademy. Today, this data is a part of the Polly Vacher Collection9 at the Akademy’s Resource Centre.
7.3.1 Impact of VTTI on reliability modelling at the MIRCE Akademy “Success is a lousy teacher. It seduces smart people into thinking they can’t lose.” Bill Gates10 Polly Vacher’s flight of 11 hours and 53 min, covering 1092 miles, from Rothera to Rothera on the 5th December 2003, had a profound impact on the studies of reliability at the MIRCE Akademy. Although the syllabus offered by the Akademy, was comparable with postgraduate programs in reliability engineering with all other universities in the world, it was unacceptable to the author that no a single part of the whole body of existing knowledge was able to address the observed physical reality. If, the “wind direction change”, event was predicted and the decision was made to preposition enough fuel for one attempt, the author would have been happy as a scientist, but unhappy as a project manager. However, to have a science-based body of knowledge that predicts the 9 Mirceakademy.com/index.php?page=Resource-Centre 10 https://www.brainyquote.com/quotes/quotes/b/billgates122131.html
Mathematical and physical reality of reliability Chapter | 7
189
system operational behavior that is unable to even address the wind direction was totally unacceptable to the author. The brutal truth is that all the components of Polly’s aircraft, contained in the RBD, were performing their expected functions and yet the final result was a mission failure! Thus, the author asked himself, “How is it justifiable to construct a reliability block diagram for an aircraft without a single block being related to the air [Knezevic, 2017]?” Requesting a scientific approach when entering into the MIRCE Akademy, from his students, the author, as its president, had no option than to suspend the studies of Reliability Engineering until the “scientific approach” was found. Hence, the MIRCE Akademy stopped admitting students to the Master and Doctoral Diploma Programs in reliability engineering, from October 2004.
7.4 Physical meanings of mathematical reality of reliability “For whosoever has fix’d on his Cause, before he has experimented; can hardly avoid fitting his Experiment, and his Observations, to his own Cause, which he had before imagin’d; rather than the cause of the truth of the Experiment it self.” The History of the Royal Society [Henry, 2017]. Being exposed to the well-established educational process where scientifically proven deterministic models are used for all engineering predictions, the author has accepted and promoted existing probabilistic model is used for the predictions of the reliability and safety performance of future systems [Eq. (7.3)], govern by reliability functions of consisting components [Eq. (7.2)], which is promoted by existing reliability and safety literature. After gaining the first hand experience from the VTTI and realization that air, as a natural physical entity essential for the flying process, was not a part of the RBD of an aircraft, the author decided to try to “understand” mathematical understandings of the physical reality of the operational reliability and safety performance of defense, aerospace, and nuclear power systems. The main outcomes of this research are presented below.
7.4.1 Mathematical reality: quality of components production is one hundred percent The integral that defines a reliability of components [Eq. (7.2)] is defined by values of time greater or equal to zero, which means that no reliability and safety relevant events can take a place before the beginning of the operation. Consequently, all theoretical probability distributions used to define component reliability must have a range [0,∝]. Knezevic [1993], warned reliability modellers that a normal probability distribution, which is defined between minus and plus infinity, should be used in reliability predictions only when the expected value is a minimum three times greater than the standard deviation, which means that the left tail of the distribution could be ignored, for the modelling purposes.
190
Safety and reliability modeling and its applications
7.4.2 Mathematical reality: errors during system transportation, storage and installation tasks are zero percent As a system consists of components, then the system reliability function must have the same mathematical properties as described in 4.1 for a component reliability function. Consequently, all reliability considerations of a system, as far as mathematics is concerned, start at the “birth” of systems, totally “ignoring” any physical event that could take place during the transportation, storage and installation process and impact the reliability of a system.
7.4.3 Mathematical reality: all components are one hundred percent independent The fact that the contribution the reliability of components connected in series to the system reliability is equal to the product of their individual reliability is valid only if there is no interaction, whatsoever, between them. According to mathematics, individual components exist in their own rights, like they are the only one. This practically means that no failure of any component can impact the reliability performance of any other within a system.
7.4.4 Mathematical reality: zero maintenance actions (inspections, repair, cleaning, etc.) As the integrals defining failure/reliability functions [Eq. (7.1) and (7.2)] of components within a system are continuous integrals within a given interval of time [0,t] exclusively related to the TTF, interruptions for inspections, testing, condition monitoring and similar maintenance actions related to any of consisting components are “nonexistence”, as far as mathematics is concerned. Thus, no maintenance actions are incorporated into failure or reliability functions, as both are covering the length of operational time to failure.
7.4.5 Mathematical reality: continuous operation of the system and components As expressions for reliability of any component and system are defined by the continuous random variable, namely Time To Failure, TTF, no interruptions in continuity of time are “allowed”. In other words, shifts, weekends, “Queen jubilees”, national days, religiously significant days, are totally non-existent, from the point of view of reliability theory.
7.4.6 Mathematical reality: time counts from the “birth” of the system As the expression for the reliability of any system has only one random variable which is the time to failure of a system, TTFs , with a origin from t = 0, then all
Mathematical and physical reality of reliability Chapter | 7
191
of its constituting parts must refer to the same instant of time with a range [0,t] (Eq. 7.3). The time to failure of all components connected in parallel is measured by the time to failure of the last component, measured from the origin of the system, irrespective of when all the other components had failed. Even further there is no option of introducing into the reliability function of a system, the beginning of the operation of the replaced components!
7.4.7 Mathematical reality: fixed operational scenario (load, stress, temperature, pressure, etc.) There are systems whose operational scenario are determined by the seasonal, daily, or even hourly changing patterns, each of which generates different stresses and loads on the systems, and consequently impact the reliability and safety. However, probability distribution functions available in mathematics to be used for modelling TTF are unable to deal with the operational variability of physical reality.
7.4.8 Mathematical reality: reliability is independent of the location in space (GPS or stellar coordinates) In all probability distributions where the TTF is used as a relevant random variable for the reliability predictions, the mathematically defined probability density function is totally divorced from the physical location of a system. Hence, a system defined by identical reliability function, will have identical reliability performance irrespective of the location of a system in the geographical or astronomical space. The reason for these is very simple; no mathematical axiom is related to physical reality, in any shape or form.
7.4.9 Mathematical reality: reliability is independent of human actions Although trains, cars, bicycles, buses, lorries, and other means of transport are driven by humans, not a single block in the reliability block diagram is related to them. Hence, human actions have no impact on the reliability function.
7.4.10 Mathematical reality: reliability is independent of maintenance actions Well-trained maintainers are a daily feature of the operation of any system and yet they are totally excluded from the mathematics based modelling of reliability, as there is not a single block that represents them in the reliability block diagram.
192
Safety and reliability modeling and its applications
7.4.11 Mathematical reality: Reliability is independent of calendar time (seasons do not exist) Mathematical models based on the reliability function are totally immune to the calendar times, which reflects seasonal variabilities, as the calendar time, represented by mathematical models clearly exhibits the continuation of time, which does not differentiate where on the time axis the interval is located.
7.4.12 Mathematical reality: reliability is independent of the natural environment To the best of the author’s knowledge, not a single RBD of any reliability of an aircraft contains a block that represents air, which for a start is a fundamental element of the flying process. Also, air is the fundamental physical medium through which an aircraft comes into contact with birds, ice, rain, lightening, wind, and many other well know atmospheric phenomena that have a significant impact on its reliability and safety.
7.4.13 Concluding remarks regarding mathematical reality of reliability function “Mathematics does not teach us how to think correctly.” Josephine Pasternak11 All of the above factual “discoveries”, by the author, are not weaknesses of the probability theory, at all. They are just clarified mathematical views of the physical reality of the reliability of functionable systems. According to the axioms of probability12 any probability distribution, defined by the probability density function whose area under curve is equal to 1, is perfectly suitable to be used in the expression for failure and reliability functions, defined by Eqs. (7.1) and (2), as far as mathematics is concerned. Mathematics has neither intention nor ability to decide what is a physical reality of human created and managed systems. The above-deduced “reality” of the reliability measures of systems is just a clear statement of the mathematical truth that says, “In my reality my predictions are correct.” From the point of view of the probability theory any repeated experiment that provides different outcomes under identical conditions is a probabilistic experiment, irrespective of which mechanisms generate that behavior. Hence, the probability theory is a mathematical concept that is totally unconcerned with the physical causes and mechanisms that generate failure events in the life of human 11 Pasternak, J., Indefinability, An Essay on the Philosophy of Cognition, Page 118, edited by Arne F.
Petersen, pp. 144, Published by Museum Tusculanum Press, University of Copenhagen, Denmark, 1993. ISBN 10: 877289531 12 Kolmogorov, A.N., Foundations of the Theory of Probability, Chelsea Publishing Company, USA, 1950.
Mathematical and physical reality of reliability Chapter | 7
193
created and operated systems. At the same time it is the only body of knowledge that enables predictions of the occurrences of failure events throughout the life of systems used daily by humans to be made.
7.5 Physical reality of reliability Scientific truth is fundamentally different from mathematical truth. Although there are axioms in the scientific theory, but unlike mathematical axioms, they are related to the universe in which we exist and its laws. The definition of scientific truth is based on the physical experiment, which is defined by Dubi [2003] as, “A statement is true if and only if it can be verified in an objective scientific experiment.” For example, one of the fundamental axioms is the axiom of causality, which states that, “In our universe the cause always precedes the result”. This axiom exists and is believed to be true only because no one has ever demonstrated in an experiment that it does not hold. Although, according to Dubi [2003], many scientists are still designing experiments at the atomic and sub atomic scale to challenge causality. Should any of these experiments succeed, a major change will take place in what is known today to be “truth”. Hence, unlike mathematical truth, scientific truth can change through time as new experiments and observations are made. To understand the physical reality of the in-service reliability of defense, aerospace, and nuclear power industries the author systematically studied the reliability performance of their in-service “experiments”. Hence, in the remaining part of the text: types, causes and mechanisms of failures analyzed are presented against the titles used earlier in the text to examine the mathematical reality of a reliability function.
7.5.1 Physical reality: Quality of produced components and assemblies is less than 100 percent 7.5.1.1 A400M crashed by incorrectly installed engine software13 On 29 May 2015 the Airbus Group revealed that incorrectly installed engine control software had caused the fatal crash of an A400M airlifter in Spain. The incorrect installation took place during the final assembly of the aircraft, which led to engine failure and the resulting crash. The conclusion was based on the data extracted from the flight data recorder, which confirmed the Airbus engineer’s internal hypothesis that there had been no problem with the aircraft. France has continued flying its fleet of six aircraft, while Germany, Malaysia, Turkey, and the U.K. paused flight operations.
13 MIRCE
Akademy Archive- MIRCE Functionability Event 20150529
194
Safety and reliability modeling and its applications
7.5.1.2 Quality control issue halted f-35 deliveries to us government14 On 11 December 2019 the Pentagon (U.S. Government) temporarily suspended deliveries of the F-35 Joint Strike Fighter for 15 days because the Defense Contract Management Agency (DCMA) discovered “instances” of comingling of titanium and Inconel fasteners. Lockheed Martin and the U.S. government conducted engineering analyses and determined those aircraft were safe to fly and the Pentagon began accepting aircraft. The Pentagon does not have any indication this was a systemic problem, as DCMA representatives are on the production floor working alongside Lockheed Martin personnel. This is not the first time a quality control issue stopped F-35 deliveries. Corrosion was identified in several fastener holes under the fuselage panels of a F-35A conventionaltakeoff-and-landing aircraft that was in maintenance at Hill Air Force Base in Utah, USA. Other previous problems included faulty insulation that disintegrated into the fifth-generation fighter’s fuel tank and an engine-rubbing problem that increased the likelihood of fire. 7.5.1.3 Japanese rocket start-up blow up after 2 seconds15 On 30 June 2018 Japanese start-up Interstellar Technologies’ Momo-2 lost thrust 8 sec after lift off, reaching a maximum altitude of about 20 m before falling back to Earth about 5 m from the launch pad. The rocket burned for about 2 hours after the impact, until the fire eventually extinguished itself. The flame was seen squirting from the top of the engine, immediately after lift off. The crash set the booster and ground equipment on fire, and some parts were scattered beyond the concrete launch pad, but no one was injured. Video of the launch shows a small flame emerging from the top of the engine barely 2 sec after the vehicle leaves its support stand. The Momo-2 was out of sight when the rocket is heard losing thrust. Then the vehicle falls vertically back to Earth, motor still burning, and explodes in a fireball on impact. This is Japan’s first privately developed launch vehicle, with a length of 9.6 m and takeoff weight of 1,000 kg. Momo is fueled with pressure-fed ethanol and liquid oxygen. The single-stage rocket is designed to carry a 20-kg payload to 120 km and provide 260 sec of micro gravity flight. 7.5.1.4 After in-flight diversion Boeing 777 production-line wiring inspections16 Chafing and arcing of incorrectly installed wire bundles caused an in-flight diversion of a Boeing 777, in October 2017, during a flight from Abu Dhabi to Sydney. As the aircraft neared Adelaide, the flight crew “noticed a burning smell coming from an air vent.” The issue soon triggered onboard warnings of 14 MIRCE
Akademy Archive- MIRCE Functionability Event 20191211 Akademy Archive- MIRCE Functionability Event 20180730 16 MIRCE Akademy Archive- MIRCE Functionability Event 20171000 15 MIRCE
Mathematical and physical reality of reliability Chapter | 7
195
a forward cargo fire. The crew performed its “nonnormal” checklist, discharged forward-cargo fire bottles, and declared an emergency. The aircraft, carrying 349 passengers and 16 crewmembers, arrived “uneventfully” at Adelaide Airport about 50 min after the incident began. The aircraft involved was delivered in November 2013 and had 21,493 hours and 2,284 cycles at the time of the incident. A post-incident inspection found soot damage on the forward cargo compartment ceiling. A more detailed investigation traced the soot’s source to heat damage and a chafed electrical wire in a bundle running between the cargo compartment ceiling and the cabin floor above. Boeing determined the entire wiring loom that contained the chafed wire, which powered a re-circulation fan, was “incorrectly routed, likely during aircraft manufacture, and had not been installed as per the design drawings.” Four years in service caused the missrouted wire bundle to chafe on a nearby screw. This sent a current “through the passenger floor carbon-fiber beam” at body station 508. The current generated enough heat to damage 14 ceiling brackets, and cause “several areas” of the beam to chafe and delaminate. Late last year, Boeing added a productionline inspection and issued recommendations to operators following an Etihad Airways Boeing 777-300 in-flight diversion caused by chafing and arcing of incorrectly installed wire bundles, which was the fifth incident linked to the faulty production process.
7.5.1.5 Design errors Most designed induced errors end up with partial or gradual corrections and changes through modifications and upgrades. However, in certain cases even the whole production run has to be made. The “top-ten” recalls in auto industry are briefly described below: • 1971 General Motors: After sudden acceleration problems caused by engine motor mounts, GM had to recall 7 million cars. The resolution involved putting in a restraining bolt to keep the engine in place. • 1980 Ford: 21 million vehicles had to be recalled and it wasn’t just one model that was at fault. Unfortunately for the company, there was an issue with cars shifting out of parking mode and running away down the road. Ford responded to the problem by trying to issue each owner with a special warning sticker before they were eventually convinced to recall and repair. • 1981 General Motors: 5.8 million cars were recalled in the early 80s because of a rear suspension bolt issue. The vehicles included all intermediate models that had been produced since 1978. The problem was highlighted after reports of accidents began to come through. • Ford: 7.9 million cars were recalled in the mid-1990s and were down to the ignition switch that again caused fires in a number of vehicles. • Ford: 15 million vehicles were recalled because of a potentially faulty cruise control switch that had caught fire in some cars.
196
Safety and reliability modeling and its applications
• Toyota: 9 million cars were recalled in 2009/2010 because of a sudden acceleration issue resulting from a faulty accelerator pedals. The case was complicated because the company initially misdiagnosed the problem so a second recall had to be rolled out after a small number of cars crashed. • 2012 Toyota: 7.4 million cars taken back in because of faulty electric windows. Not a major, life threatening issue but it cost the company a substantial amount, although the final figure was never revealed. • 2014 General Motors: 5.8 million cars were recalled in 2014 and it was down to an ignition switch issue, which had a tendency to cut off the engine while driving, and prevented the airbag from inflating in some cars. It’s estimated that the total recall cost the company just over $4 billion. • 2014, Honda: had to recall 5.4 million cars because of an airbag issue relating to about 20 different models. The fact that the bags were not inflating properly made them potentially dangerous in the event of an accident. • 2016: Volkswagen: 8.5 million vehicles needed to be recalled. This was down to software installed that gave false results on emissions, something that is against European Union rules. It’s expected to cost the company more than £12 billion to put right.
7.5.2 Physical reality: transportation, storage and installation tasks are not 100 percent error free 7.5.2.1 SpaceX explosion at launch pad17 On 1 September 2016, an explosion destroyed a Falcon 9 and its payload, at launch pad. This vehicle was scheduled to launch the Amos-6 communications satellite on 3rd September. SpaceX indicated that the anomaly occurred around the upper stage oxygen tank during propellant loading for the static fire test. The explosion triggered a blast wave that was reported up to 30 mi away, and was followed about 2 min later by further explosions that appear to have originated around the base of the strong back launch support structure. 7.5.2.2 Leonardo calls for aw169 and aw189 tail rotor inspections18 On 7 November 2018 operators of the Leonardo AW169 twin-engine medium helicopter received safety directives requiring them to check correct installation of the tail rotor (TR) servo-actuator, following a crash that killed five people on the 27 October 2018. According to the UK Civil Aviation Authority the aircraft that crashed, G-VSKP, was registered new in July 2016, and it had flown less than 300 hr up to the end of last June. The helicopter was transporting 5 people including the pilot and crashed just moments after takeoff from the Leicester 17 MIRCE 18 MIRCE
Akademy Archive- MIRCE Functionability Event 20160901 Akademy Archive- MIRCE Functionability Event 20181107
Mathematical and physical reality of reliability Chapter | 7
197
club’s King Power Stadium following a football match. TV footage shows the aircraft turning around after takeoff, climbing out of the stadium and drifting backward as per common procedure in case of engine failure to return to land. Once well above the stadium, the aircraft appeared to suddenly go out of control, spinning rapidly toward the ground before crashing into a nearby parking lot. The inspections also applied to the company’s AW189 super-medium helicopter, as they feature a similar design to that of the AW169. The directive, published by the European Aviation Safety Agency (EASA) says, “The incorrect installation of the TR servo-actuator, if not detected and corrected, depending on the flight condition, could possibly result in loss of control of the helicopter. Checks should be carried out within five flight hours or 24 hr of the directive being issued and requested that all inspection results should be reported back to the manufacturer.”
7.5.3 Physical reality: there are interactions between “independent” components 7.5.3.1 Power plant’s inlet cowl detached in midair of Boeing 737-70019 On 27 August 2016, during the flight of a Southwest Airlines B737-700, on the left CFM56-7B engine inlet cowl detached in midair, causing the engine to be shut down as well as significantly damaging the airframe. The flight was enroute from New Orleans to Orlando, Florida and landed to Pensacola, Florida, showing that the fan and centrally located spinner intact after the cowl separated. There was no apparent indication that the cowl loss was associated with either a fanblade failure or the release of a blade. Passengers reported that a loud noise accompanied the event, which occurred around 13 min after takeoff at around 31,000 ft over the Gulf of Mexico. The cowl is normally attached to the fan case by bolts and two alignment points located at the 3 and 9 o’clock positions around the inlet. Damage visible to the airframe included significant buckling of the leading-edge wing root fairing, indicative of a heavy impact from part of the inlet assembly, as well as a puncture of the fuselage skin below the window belt above the leading edge. This latter damage was likely the main cause of the cabin depressurization that occurred on separation of the inlet. 7.5.3.2 Oil system flaw caused PW1524g engine uncontained failure20 In May 2014, the uncontained failure of the Pratt & Whitney PW1524G engine, on Bombardier’s C Series CS100 prototype, was triggered by the failure of a Teflon seal in the oil system, according to a Transport Safety Board of Canada 19 MIRCE 20 MIRCE
Akademy Archive- MIRCE Functionability Event 20160827 Akademy Archive- MIRCE Functionability Event 20140500
198
Safety and reliability modeling and its applications
report. The failure of the low pressure (LP) turbine, which occurred during engine ground runs, followed heat soaking of the oil feed tube to the No. 4 bearing at the back of the engine. The heat specifically impacted the integrity of the feed tube’s Teflon C-seal after a series of engine “hot shutdowns”. The damaged seal allowed engine oil to merge with the turbine rotor’s cooling air stream, leading to ignition of the resulting air-oil mixture in the cavity around the base of the first stage of the three-stage LP turbine. “The ensuing combustion heated the low-pressure turbine rotor to the point of failure,” says the report, which adds the resulting disintegration of the rotor “was uncontained, and resulted in major damage to the engine, nacelle and wing.” Consequently, engine debris damaged the wing’s lower surface, wing-to-fuselage fairing, leading-edge slats and flap fairings, as well as the landing-gear door panels and strut. The system “functioned as designed” to prevent a far more serious fuel fire, despite a 38-inch, hot section of the LP turbine rotor disk penetrating the center fuel tank and wedging in the upper wing skin. At the time of the event, the center tank was almost half-full with 12,200 lb. of fuel. Following the incident, interim measures were introduced to enable flight tests to resume, including a revised cool-down procedure with an increased preshutdown cooling period of 20 min. It also added a metallic face seal, in addition to the Teflon C-seal, on the No. 4 bearing oil-feed tube mounting flange, and changed the material of the mounting bolts of the flange to enable higher torque on the bolts. Thermocouples were also added to permit real-time monitoring of the LP turbine cavity temperature, while limiting the oil-seal temperature to 500°F. Also, daily post flight oil-consumption monitoring and increased daily borescope inspections were instituted. For the production-standard PW1500G, Pratt changed the design of the oil-supply tube and cooling-airflow areas to physically separate the turbine-rotor-cooling airflow from the bearing compartment, to prevent any chance of a repeat occurrence.
7.5.3.3 Faulty equipment partly due to crash of AirAsia flight QZ85021 The Airbus A320-200, flown by Indonesia AirAsia, on flight QZ8501 crashed on 28 December 2014, killing all 162 people on board. According to the report issued by Indonesian National Transportation Safety Committee (NTSC), a fault in the connecting circuitry of the aircraft’s rudder travel limiter (RTL) sent repeated operational warnings to the cockpit, which led the flight crew to attempt a reset of the system. This, in turn, led the aircrew to accidentally disengage the flight augmentation computer (FAC or autopilot) system, followed by what the NTSC report described as “an inability of the flight crew to control the aircraft”. Also, report detailed that four repeated RTL warnings during the first hour of the flight had likely led the aircrew to disengage system circuit breakers in an attempt to reset the RTL, but this also disengaged the FAC, leading 21 MIRCE
Akademy Archive- MIRCE Functionability Event 20141228
Mathematical and physical reality of reliability Chapter | 7
199
to an uncontrolled stall into the sea as the aircraft “departed from the normal flight envelope”. The actual cause of the crash was found to be a “prolonged stall condition that was beyond the capability of the crew to recover”, which resulted in the aircraft impacting the Java Sea. The report also revealed that RTL had suffered 23 reported malfunctions over the previous year, according to this aircraft maintenance records.
7.5.3.4 Ethiopian B787 fire due to runaway in the lithium-metal batteries22 On 12 July 2013 a parked Ethiopian Airlines Boeing 787-8 at London Heathrow Airport caught fire. Aircraft Accident Investigation Board (AAIB) classed it as a “serious incident,” in which the fire badly damaged the crown of the fuselage just forward of the tail fin. Report states, “The fire was initiated by the uncontrolled release of stored energy from the lithium-metal battery in the aircraft’s Emergency Location Transmitter (ELT). The fire was most likely triggered by an external short-circuit, created by the battery wires having being crossed and trapped under the Honeywell ELT battery compartment cover plate when the ELT battery was last accessed. This “probably created a potential shortcircuit current path, which could allow a rapid discharge of the battery. Root cause testing performed by the aircraft and ELT manufacturers supported this latent fault as the most likely cause of the ELT battery fire, most probably in combination with the early depletion of a single cell.” According to AAIB, “Neither the cell-level nor battery-level safety features were able to prevent this single-cell failure, which then propagated to adjacent cells, resulting in a cascading thermal runaway, rupture of the cells and consequent release of smoke, fire and flammable electrolyte. The trapped battery wires in turn compromised the environmental seal between the battery cover-plate and the ELT, providing a path for flames and battery decomposition products to escape”. The flames “directly impinged on the composite aircraft structure, which led to resin in the composite material of the fuselage crown decomposing, providing further fuel for the fire. As a result of this, a slow-burning fire became established in the fuselage crown, which continued to propagate from the ELT location … even after the energy from the battery thermal runaway was exhausted.” It noted that the location of the ELT in the fuselage crown made it difficult for fire fighters to locate the fire. In the event of an in-flight fire from this source, AAIB noted, it would be “challenging” for cabin crew to locate and fight the flames. 7.5.3.5 Smoke and fumes event involving Boeing 78723 On 17 April 2016, a B787-9, (N36962) operated by United Airlines as flight UAL870, departed Sydney for San Francisco, USA. As a part of scheduled meal 22 MIRCE 23 MIRCE
Akademy Archive- MIRCE Functionability Event 20130712 Akademy Archive- MIRCE Functionability Event 20160417
200
Safety and reliability modeling and its applications
service cabin crew switched on the aft galley ovens. After the second oven was switched on, there was a short burst of smoke, which set off a fire alarm in a nearby toilet for about one minute. One of the ovens displayed a “FAILURE” message. Several cabin crews detected a strong chemical odor and an electrical smell, as well as a blue haze. The crew immediately pulled all relevant circuit breakers, and switched off all electrical sources to the aft galley. By the time that the in-flight service manager (ISM), together with a relief pilot from the cockpit arrived at the aft galley with fire extinguishers, the smoke had dissipated, but the odor persisted. As it could not be confidently ascertained that the ovens were the sole source of the problem, the captain contacted the ground-based technical operations maintenance controller (TOMC) by satellite phone. It was agreed that the safest option was to return the aircraft to Sydney. As the aircraft was well in excess of its allowed landing weight, fuel was dumped during the descent. The aircraft landed without incident in Sydney with emergency services in attendance. A post-engineering inspection quarantined the suspect oven, and after an inspection, a fuse was replaced. After appropriate testing, the aircraft was released back to service. The manufacturer individually tested all oven components and reported that all individual components worked correctly. However, an additional measurement of the oven motor current detected that the motor did not run smoothly, and its temperature was also above normal, most likely from insufficient airflow. The exact cause of the odor could not be determined.
7.5.3.6 Pilots unaware of B737 MAX’s automatic stall prevention system24 On 10 November 2018 Boeing issued a multioperator message (MOM) explaining the MAX’s maneuvering characteristics augmentation system (MCAS) “commands nose-down stabilizer” in certain flight profiles using “input data and other airplane systems.” MCAS is operated by the flight control computer and “activated without pilot input and only operates in manual, flaps-up flight.” MCAS was not part of previous designs of 737, Boeing’s MOM confirms. The system also was not covered in MAX flight crew operations manual (FCOM) or difference training for 737NG pilots. Most likely this is linked to the ongoing investigation into the fatal crash on 29 October 2018 of a Lion Air Boeing 737 MAX 8, killing all 189 onboard. Aviation Week has reviewed the 737 MAXfamily flight crew operations manual for another large MAX-family operator. It does not reference MCAS. A multipage document issued by the airline’s flight operations department that highlights the differences between the MAX and 737 NG does not mention MCAS or any other changes to the autotrim system.
24 MIRCE
Akademy Archive- MIRCE Functionability Event 20181010
Mathematical and physical reality of reliability Chapter | 7
201
7.5.4 Physical reality: maintenance activities like: inspections, repair, cleaning, etc., have significant impact on the reliability of a system 7.5.4.1 In-service cracks trigger Airbus A380 wing-spar inspections25 After reports of cracks on in-service Airbus A380 wing outer rear spars (ORS) Airbus and EASA are developing an inspection program for it. The program, revealed in a proposed EASA airworthiness directive (AD) published on 5 July 2017, targets “the 25 oldest wing sets” in the A380 in-service fleet. Affected operators are to conduct initial “special detailed inspections” on a schedule based on the aircraft’s age. Follow-up checks should be done every 36 months. The initial inspection results would be evaluated by Airbus and EASA and, “based on inspection findings,” may expand the program to other A380s, the proposed AD explained. Out of the 25 aircraft listed for initial inspections Emirates Airline has 9, followed by Qantas 6, including the aircraft that suffered substantial damage during a November 2010 engine failure and was out of service for nearly 18 months. Singapore Airlines has 4, while two aircraft once operated by Singapore are in storage with Afa Press UK Ltd. as the listed owner. The remaining airframes are with Air France (2), Lufthansa, and Portuguese charter carrier Hi Fly. The initial program is in response to “occurrences” of ORS cracks on in-service aircraft, EASA explained, but the AD does not say how many aircraft have turned up with cracks. 7.5.4.2 ANA grounded Boeing 787 for Rolls Royce engines inspections26 On the 28th August 2016 All Nippon Airways (ANA) had six of its Boeing 787s out of action as it continues inspections due to concerns about turbine blade erosion in the fleet’s Rolls-Royce Trent 1000 engines. The airline intends to progressively inspect its B787 engines and replace turbine blades. The six aircraft currently grounded are part of this process, which has caused the carrier to cancel several domestic flights. So far ANA has replaced turbine blades on 17 engines, out of the total of 100 engines on its 50 787-8s and -9s. The carrier says it has “identified that multiple engines need to be serviced”.
7.5.4.3
Chemical residue causes in-flight shutdown to A38027
In May 2017, an Airbus A380 operated by Qantas Airways departed Los Angeles destined for Melbourne. The crew turned back 2 hr into the flight after hearing 25 MIRCE
Akademy Archive- MIRCE Functionability Event 20170705 Akademy Archive- MIRCE Functionability Event 20160828 27 MIRCE Akademy Archive- MIRCE Functionability Event 20170500 26 MIRCE
202
Safety and reliability modeling and its applications
a loud bang followed by an unusual vibration and what turned out to be a false fire warning. After an uneventful landing, the initial inspection found no breach of the No. 4 engine casing and minor damage to the right flap due to exiting debris. A subsequent analysis found fatigue cracking due to internally corroded lowpressure turbine blades, which had resulted in blade debris and downstream damage through the engine. The corrosion was attributed to a chemical residue in the hollow blades left after a July 2015 cleaning operation. In response to the occurrence, the manufacturer modified its blade-cleaning instructions to include best practices for the removal of process solutions and chemical residues. The revised procedures, which include flushing of aerofoil cavities and modifying the orientation and support of the blades while cleaning, were adopted at all applicable Trent 900 Stage 2 low-pressure turbine blade maintenance facilities. An internal manufacture safety alert also was distributed to raise awareness of the issue and its potential impact on other engine types.
7.5.5 Physical reality: neither all systems nor all components operate continuously 7.5.5.1 Airbus A320 was flying with a failed actuator on minimum equipment list28 On 22 March 2014 at 18:28:32, Airbus A320-232 aircraft (G-EUUE), took off from London Heathrow International Airport to perform the Flight BA870 for the British Airways airline. The take off as well as the flight was in order until 19:24:32 (the aircraft was cruising at an altitude of FL370 and speed of 250 knots at that time), when the crew received “Right Aileron Fault & Elevator Aileron Computer (ELAC1) fault” messages. This caused the right aileron to be locked into a position 8.8 deg. up from neutral, later moving as high as 15.9 deg. of the maximum 25 deg. of motion. Despite this, the aircraft remained in normal control mode with the autopilot engaged, and the captain was able to perform a normal landing at Liszt Ferenc International Airport in Budapest, Hungary, at 20:35. A day before the incident, mechanics had deactivated the aircraft’s blue hydraulics system connected to one of the two servo controls for the right aileron. This occurred after a string of three failure notifications during flights on March 19 and March 21, which pertained to ELAC2 and its related servo controls. Based on the approved minimum equipment list, British Airways had 10 days to fix the problem. Meanwhile, the right aileron could only be commanded by ELAC1 and one hydraulic system. Mechanics later replaced the captain’s sidestick, which was considered to be the root cause of the problem, as well as ELAC1. However, the Transportation Safety Bureau of Hungary (TSB) said, “As no faults could be
28 MIRCE
Akademy Archive- MIRCE Functionability Event 20140322
Mathematical and physical reality of reliability Chapter | 7
203
found with either component during post incident testing, one or both could have been responsible.”
7.5.6 Physical reality: Components and a system have different “times” 7.5.6.1 ANA to replace turbine blades on RR Trent 1000 engines on B787 fleet29 After identifying problems related to corrosion and cracking, on 1 September 2016, the Japanese airline group All Nippon Airways (ANA) confirmed that turbine blades on the Rolls-Royce Trent 1000 engines powering its fleet of B787 aircraft will be replaced. It is expected that the process of fitting all of their 50 aircraft of this type with engines equipped with new blades could take up to three years to complete. Although only five of the engines are in need of repairs at present, the company decided to repair the entire fleet of 100 Trent 1000s as a safety measure. All of this was started by three engine failures in 2016 related to the blades, resulting in 18 domestic flights being cancelled by ANA last week due to engine issues. As result of this decisions made by ANA, the Air New Zealand, which is another carrier operating Trent 1000-powered 787s, said it has put “proactive systems” in place across its fleet of seven of the aircraft to any potential monitor turbine problems. 7.5.6.2 International space station electrical issue delays SpaceX launch30 The planned 1 May 2019 launch of a SpaceX cargo ship to the International Space Station (ISS) had been delayed due to a problem with the station’s electrical system. The problem, that posed no immediate concerns to the station or its six-member crew, involved a Main Bus Switching Unit (MBSU), which distributes electrical power to two of the station’s eight channels. Electrical power generated by the station’s solar arrays is fed to all station systems through these power channels. One of these units had failed in a manner that cannot be recovered, so it effectively lost one-quarter of the power to the space station. It is possible to move loads around and keep payloads operating, but it loses redundancy. Among the systems lacking backup power were the station’s robot arm and mobile base, which is needed to capture SpaceX’s Dragon cargo ship and berth it to the docking port. Launch has been tentatively rescheduled for 3 May, pending a successful robotic change-out of the failed MBSU, on May 2. In the past the ISS had two failures of this particular box, one of which was repaired in orbit. This one looks like it’s probably not repairable on orbit as it is lifetime issue. 29 MIRCE 30 MIRCE
Akademy Archive- MIRCE Functionability Event 20160901 Akademy Archive- MIRCE Functionability Event 20190501
204
Safety and reliability modeling and its applications
7.5.7 Physical reality: Variable operation scenarios (load, stress, temperature, pressure, etc.) 7.5.7.1 Aeroflot Superjet 100 (RA-89098) crashed in Moscow31 On 5 May 2019 a Superjet 100 airplane, operated by Aeroflot on the flight SU1492, took off from Moscow Sheremetyevo airport (SVO/UUEE) runway 24C at 18:04L (15:04Z). The crew stopped the climb at about FL100 and declared initially loss of radio communication. Later the crew declared emergency via transponder codes and returned to Sheremetyevo for an emergency landing. According to radar tracks, the first approach was discontinued, the airplane made a 360° turn and approached Sheremetyevo runway for landing on Runway 24C. Weather at the time of landing was not a factor for the landing, although not confirmed information said that a lightning strike might be involved in the accident. According to CCTV cameras, the airplane bounced on the runway during landing and when it hit the runway again, caught fire. During the deceleration, the Superjet 100 burst into flames, veered to the left off the runway and came to a stop on the grass adjacent to the runway, after making a 180° turn. While the aircraft burned down, an evacuation started from the L1 and R1 doors via emergency slides, but 41 people are confirmed dead, including 2 children. 7.5.7.2 Hard landing of Wings Air ATR 72-600 in Indonesia32 On the night of 25 December 2016 at the Achmad Yani International Airport in Semarang, Indonesia, a hard landing accident of a Wings Air ATR 72-600 took place. Flight 1896 from Bandung, Indonesia, with 68 passengers and four crew members on board, was attempting to land after an instrument approach to Runway 13 in light rain and relatively light winds. The aircraft touched down hard and bounced, with a second touchdown also resulting in a bounce. Despite an attempted go-around by the captain after the second bounce, the aircraft touched down hard again, collapsing the right-side main landing gear, and breaking about 10 in. off of each blade of the six-bladed propeller, as the aircraft swerved off the right side of the runway. Air traffic controllers, noticed that the aircraft was “tilted to the right” during the landing roll, and called out rescue and fire-fighting services. However, those crews could not approach the aircraft, because the pilots had not shut down the engines. “While waiting for the assistance, the pilot kept the engines running to provide the lighting system on in the cabin,” the airline said, adding that the tower then radioed the pilots to shut down. Passengers evacuated approximately 10 min after the aircraft stopped. Wings Air’s standard operating procedures for an emergency evacuation on the ground called for the pilots to shut down the engines after notifying air traffic control, and to turn on cabin lighting (which would be powered by the battery). 31 MIRCE 32 MIRCE
Akademy Archive- MIRCE Functionability Event 20190505 Akademy Archive- MIRCE Functionability Event 20161225
Mathematical and physical reality of reliability Chapter | 7
205
7.5.7.3 Gear retracted landing of Emirates b777 at Dubai33 On 3rd August 2016 an Emirates Airlines Boeing 777-300, on flight EK-521 from India, with 282 passengers and 18 crew, was on the final approach to Dubai’s runway 12L when an attempt to go around was made after the first ground contact. However, the aircraft did not climb, but after retracting the gear touched down on the runway and the right wing caught fire and the right hand engine separated from the aircraft that burst into flames. All occupants evacuated via slides, 13 passengers received minor injuries (10 were taken to hospitals and three treated at the airport). The aircraft burned down completely. A fire fighter attending to the aircraft lost his life. The airline reported that both captain and first officer had accumulated more than 7000 flying hours. The aircraft involved was equipped with Trent 800 engines and had been delivered to the airline in March 2003. 7.5.7.4 Weather scrubs SpaceShipTwo glide flight test34 On 2nd November 2016, Virgin Galactic called off the first planned glide flight test of its second SpaceShipTwo sub orbital spacecraft because of high winds in the skies above its California test site. The plan was to release the spaceplane from its WhiteKnightTwo carrier aircraft during a flight from the Mojave Air and Space Port in California. The flight should have been the first in a series to test the flying characteristics of the vehicle before beginning powered test flights with SpaceShipTwo’s hybrid rocket motor. The tests would examine how it glided in varying conditions, such as whether or not it is carrying a full load of payload and propellant. This testing was designed to demonstrate how aircraft would perform as it returns from space, after the feather system is retracted and the vehicle becomes a glider and lands on the runway like an airplane. 7.5.7.5 Airbus A319 safely landed after windscreen burst35 On the 5 May 2018 a Sichuan Airlines A319-100, en route from Chongqing to Lhasa, in China, experienced a windscreen burst in the cockpit and diverted to Chengdu, where it landed safely. The crew noticed that a crack had appeared in the inner right windscreen. At that time, the electronic centralized aircraft monitor (ECAM) issued an ice warning for the right windscreen. The crew immediately requested permission to descend and return. The windscreen blow out began with a crack appearing while the aircraft was flying at 32,000 ft at Mach 0.74. When the window burst, the pilot near the broken window was slightly injured. A cabin attendant was slightly hurt during the descent, according to the Civil Aviation Administration of China (CAAC), which has strong qualifications for flight crews operating services to high-altitude locations such as Tibet. The 33 MIRCE
Akademy Archive- MIRCE Functionability Event 20160803 Akademy Archive- MIRCE Functionability Event 20161002 35 MIRCE Akademy Archive- MIRCE Functionability Event 20180505 34 MIRCE
206
Safety and reliability modeling and its applications
crew, handling the situation according to procedures, immediately descended reduced speed and donned oxygen masks. Radio contact was impossible, because of noise, so the crew adjusted the transponder to 7700 (the emergency code). At the same time, oxygen masks deployed in the cabin and cabin attendants made announcements and handled the situation. After a check for an overweight landing, the aircraft landed safely. The aircraft entered service on 26 July 2011 and had flown 19,912.25 hr and 12,920 cycles. The most recent maintenance A check was done on 4 April 2018 and most recent C check on 9 March 2017.
7.5.8 Physical reality: Reliability is dependent on the location in space defined by GPS coordinates 7.5.8.1 Cold weather operations36 On the first trip through Anchorage, a pilot learned the value of proper equipment when operating in extreme temperature conditions. He arrived at midnight on 12 May 2018, in the middle of a snowstorm, and basically just drained the water, closed the plane up and went to the hotel. At 10 a.m., two days later, it was time to leave and the plane looked like a white popsicle under about 3 in. of snow. Usually the temperatures in Anchorage are relatively mild compared to other locations in Alaska. That day the temperature was -9°F. After deicing the aircraft and going through all the preflight checks he made the cabin ready for departure. All was set and when the passengers showed, he loaded, closed, started and taxied in a crystal clear, but frigid, day. After takeoff and at an altitude of about 100 ft, when he went to trim, the switch failed to move the trim at all. Reaching for the manual wheel revealed that it was completely frozen. He circled the field to get the landing weight down and kept the speed at that which was comfortable for the takeoff trim that was set. Landing was uneventful. Back at the ramp, he offloaded the passengers and a huge “Herman Nelson” heat generator was brought over, started, and the exhaust hose placed upward in the rear compartment. It actually took about half an hour for the trim to break free. It was a relatively easy fix, as the correct equipment was available. In other locations when warmth is needed in frigid conditions, the only solution is, “ to put the plane in a hangar and wait . . . for a long time, really long time!” 7.5.8.2 GPS sensors data for forecasting dangerous solar storms37 Fully aware that today’s worldwide web of power and data links are vulnerable to extreme space-weather events, the U.S. milspace-sensor network was designed to help the Air Force to examine the effects that space weather may have on spacecraft operations. The U.S. government released their environmental information, collected with national-security spacecraft, on 22 February 2017. The 36 MIRCE 37 MIRCE
Akademy Archive- MIRCE Functionability Event 20180512 Akademy Archive- MIRCE Functionability Event 20170222
Mathematical and physical reality of reliability Chapter | 7
207
released GPS historical dataset is likely to be of value to scientists studying how Earth’s magnetic field interacts with the solar wind and to engineers developing radiation-hardened avionics to extend the total ionising dose spacecraft can withstand over a service life of 15 years or more. The radiation sensors on the nation’s GPS satellites, which operate in mid Earth orbit where radiation trapped by the planet’s magnetic field, the Van Allen belts, is most intense. The charged particles there can cause havoc with the micro-circuitry that makes spacecraft computers and other avionics operate. The sensors measure and record the energy and intensity of electrons, protons and other charged particles in six orbital planes about 12,600 mi above the surface. The network records 92 measurements per day. As more and more satellites are using solar-electric propulsion to place their platforms in geostationary orbit, the avionics will be spending more time in the high-radiation regions of mid Earth orbit as they progress upward. It is quite possible that the technology for refuelling and maintaining operational spacecraft will increase the demand for longer avionics service life in space. This data also may help space-weather forecasters predict much more serious Solar storms, like the Carrington Event38 that took place in 1859 and disabled the U.S. telegraph system. Unquestionably a similar event today could be detrimental to the world’s tightly interconnected global communication and data networks.
7.5.8.3 SpaceX delays launch due to weather39 On 9 January 2017 the bad weather in California prompted SpaceX to delay its planned return to flight until 14 January, at the earliest. The company had planned to resume lift-offs after finishing its investigation into the spectacular explosion of a Falcon 9 rocket in September 2016 (the rocket and its $195 million payload were destroyed, causing heavy damage to the Launch Complex 41 at Cape Canaveral). Like much of the country, California was getting pounded that weekend by extreme weather with rain and gusty winds, according to the National Weather Service. Some areas were expected to receive 10 or more inches of rain over the weekend. The delay came two days after the Federal Aviation Administration re-authorized SpaceX’s Commercial Space Transportation License, allowing it to resume launches. SpaceX launches have been suspended since the last explosion. 7.5.8.4 Passengers stranded after Delta flights grounded worldwide40 On 8 August 2016 tens of thousands of passengers were stranded after Delta Air Lines flights were grounded around the globe due to a system outage. As 38 Named after the British astronomer Richard. C. Carrington who observed the coronal mass ejection
that triggered it during solar cycle 10 (1855-1867) Akademy Archive- MIRCE Functionability Event 20170109 40 MIRCE Akademy Archive- MIRCE Functionability Event 20160808 39 MIRCE
208
Safety and reliability modeling and its applications
for the cause of the problem, Delta pointed to an overnight power outage in its hometown of Atlanta, which "impacted the Delta computer systems and operations worldwide, resulting in flight delays". Delta said that systems were back online by 8:40 a.m. ET, but warned disruptions would continue amid a "limited" resumption of departures. By 1:30 p.m. ET, the airline had cancelled 451 out of its 6,000 daily flights. It remained to be seen how large a portion of the carrier’s daily schedule would ultimately be cancelled by the end of the day.
7.5.9
Physical reality: Reliability is dependent on humans
7.5.9.1 Damage to Embraer business jet due to deviations from standard operation procedure41 On 22 February 2019, a chartered Belgium-registered, Embraer EMB-500 departed from Kortrijk-Wevelgem Airport (EBKT), Belgium, at 07.38 hr on an IFR flight plan to Berlin-Schonefeld Airport (EDDB) with three people on board. The aircraft was severely damaged on the final approach to Runway 07L at EDDB, when the left wing had suddenly dropped and touched the runway during the flare as the aircraft crossed the threshold. Subsequently, the airplane rolled right, the right main landing gear hit hard and collapsed, and the aircraft slid along the runway toward the right runway edge where it came to a stop 447 meters from the threshold beyond the right runway edge marking but still on the asphalt area. There was no fire. Both pilots and the passenger were uninjured, but the accident brought attention the EMB-500’s deice system and training of pilots. The causes of the accident, according to German air safety investigators, were, “The crew conducted the approach under known icing conditions and did not activate the wing and horizontal stabilizer deice system, which was contrary to the Standard Operating Procedures. The aircraft entered an abnormal flight attitude during the flare phase and crashed due to ice accretion on wings and horizontal stabilizer and infringement of the required approach speed.” A major contributing factor was the crew’s “insufficient knowledge of the connection between the ice protection system and the stall warning protection system (SWPS).” 7.5.9.2 Catering track damage ramifications on Qantas A380 turn back42 On 29 August 2018, passenger-door seal damage caused by a catering truck created an unnerving onboard noise that led a Qantas Airbus A380 to return to Sydney 2 hr into a scheduled flight to the U.S., according to the Australian Transport Safety Board (ATSB) report. The A380 conducted a routine departure from Sydney Airport on a scheduled flight to Dallas/Fort Worth International 41 MIRCE 42 MIRCE
Akademy Archive- MIRCE Functionability Event 20190222 Akademy Archive- MIRCE Functionability Event 20180829
Mathematical and physical reality of reliability Chapter | 7
209
Airport. As it passed FL250, a loud noise was detected coming from a door on the upper deck. The crew determined the door was closed and locked correctly and not at risk of opening. However, “passenger discomfort” combined with “the unknown nature of the issue” convinced the flight crew to return to Sydney. The aircraft dumped fuel and landed safely. Post-flight inspection found damage to a seal retainer and seal along the underside of an upper-deck passenger door, which was caused by a catering truck that serviced the aircraft prior to the flight. The flight was rescheduled for later the same day, but flight crew duty-time limitations forced the airline to cancel that flight. Ramp accidents and incidents continue to be a costly problem for airlines, causing around $10-12 billion annually in aircraft damage, injuries and related costs.
7.5.9.3 Human error behind Air Asia diversion43 On 4 October 2016 while programming an AirAsia Airbus A330-300’s initial coordinates, a captain’s data-entry error, led to a myriad of navigation errors and an eventual diversion. The incident began when the captain entered incorrect coordinates into the Air Data and Inertial Reference System (ADIRS). The longitude was incorrectly entered as 01519.8 east (15 deg. 19.8 min E. Long.) instead of 15109.8 east (151 deg. 9.8 min E. Long.). As a result, the aircraft’s systems placed it near Cape Town, South Africa, instead of at Sydney Airport’s International Terminal Gate 54. The magnitude of this error adversely affected the aircraft’s navigation functions, global positioning system (GPS) receivers and some electronic centralised aircraft monitoring alerts, The flight crew did not realize it had a problem until a series of warnings upon takeoff en route to Kuala Lumpur. The crew then attempted to follow the course assigned by air traffic control, including a right turn. But the aircraft, operating on autopilot and guided by the erroneous starting coordinates, turned left instead, crossing the departure path of a parallel runway. After nearly an hour of fruitless troubleshooting, the crew diverted to Melbourne Airport, as the weather at Sydney had deteriorated. The incident’s cause was clear: The mistyped longitude triggered a series of events that led the flight crew to believe the aircraft had malfunctioning avionics. The extensive post incident troubleshooting concluded that the only problems were human erroneous data entry and missed clues that would have highlighted the problem. 7.5.9.4 Difficulties with fume investigations of Ryanair’s Boeing 73744 On 1 September. 2014, Ryanair flight crew on B737-800 reported an “electrical smell” after landing with a new auxiliary power unit (APU) activated, which was a replacement of the one that was replaced due to “hot-section distress”. 43 MIRCE 44 MIRCE
Akademy Archive- MIRCE Functionability Event 20161004 Akademy Archive- MIRCE Functionability Event 20140918
210
Safety and reliability modeling and its applications
Maintenance crews could not find any problems with the APU, but asked flight crews to monitor the system. Between 3–18 September 2014 there were several separate reports of odors on the flight deck, with descriptions ranging from “slight smell” to “cheesy smell” to “seriously obnoxious smell”, particularly during descents with the engines idling. Maintainers investigated the various components of the bleed-air delivery system after every incident, and ultimately replaced numerous components, including both engines and the APU. Reports from the engine overhaul facility found that: “Oil leakage may have been present, but not to an extent that it would cause significant oil smell in cabin complaints,” according to the AAIU. The most likely source of the odors was oil in the air conditioning system ductwork from the faulty APU that was installed on 1 Sept. 2014. On 18 September the aircraft’s captain “became aware of an unusual smell” as the aircraft descended through 20,000 ft for an approach to London Stanstead Airport. The captain and first officer, who did not notice a smell, donned oxygen masks, declared an emergency and landed at London Stanstead. After the aircraft was taken out of service following the incident, maintenance actions included performing the oil-contamination removal task. The aircraft was put back into service seven days later and did not experience any further odor events. Air Accident Investigation Unit (AAIU), investigators found that the most likely cause of the numerous reports was an internal oil leak in the APU. The leak, which the AAIU found was caused by a faulty bearing repair during APU maintenance, likely contaminated the ductwork in the bleed-air system primarily feeding the flight deck.
7.5.9.5 Ground crew “sucked” into an Air India’s aircraft engine45 On 14 December 2015 a member of the Air India ground crew was “sucked into” an aircraft engine and killed. The technician, who worked for Air India, died when he was working on the plane that was due to fly from Mumbai to Hyderabad. The plane was “pushing back” from the gate to begin its taxiing to the runway when the accident happened. 7.5.9.6 Tug caused Southwest nose gear snap on B737-30046 On 4 August 2016 excessive speed by a tug driver caused the nose gear of a Southwest Airlines Boeing 737-300 to collapse when pushed back from the gate at the Baltimore-Washington International Airport. None of the 135 people on board were injured, but the aircraft was “substantially” damaged when the nose gear collapsed in the forward direction, damaging the gear structure, the nose gear well and the forward bulkhead. With help from an airport surveillance video it was calculated that the tug was pushing the aircraft back at approximately 7 mph, while the airline general operating manual specifies that pushback at a 45 MIRCE 46 MIRCE
Akademy Archive- MIRCE Functionability Event 20151214 Akademy Archive- MIRCE Functionability Event 20160804
Mathematical and physical reality of reliability Chapter | 7
211
walking speed. According to the pilots, the aircraft bounced several times during the pushback before the gear collapsed and the nose fell. The tug driver said he had tried to slow down the pushback, having started too fast, but applying the tug brakes did not slow down the aircraft. Instead, the braking caused it to “start to rock and bounce,” he said. “As I finally got the (tug) to slow up, the plane then had too much momentum and pulled away from me and the tow bar pulled the nose gear off the plane.”
7.5.9.7 Smoke event involving Airbus A38047 On 15 May 2016 a Qantas Airways Airbus A380 (VH-OQD) was on route from Sydney, New South Wales to Dallas-Fort Worth, USA, when approximately two hours prior to the arrival, a passenger alerted the cabin crew to the presence of smoke in the cabin. The cabin crew then initiated the basic fire drill procedure. Two of the cabin crew proceeded to the source of the smoke with fire extinguishers. At the same time, the customer services manager (CSM) made an all stations emergency call on the aircraft interphone to alert the flight crew and other cabin crew to the presence of smoke. The cabin crew located the source of the smoke at seat 19F, on the upper deck. The crew removed the seat cushions and covers from the seat while the CSM turned off the power to the center column of the seats. When the seat was further dismantled, the crew found a crushed personal electronic device (PED), containing a lithium battery, wedged tightly in the seat mechanism. By that time, the PED was no longer emitting smoke, but a strong acrid smell remained in the cabin. The crew then maneuvered the seat and freed the PED and placed it in a jug of water, which was then put in a metal box and monitored for the remainder of the flight. 7.5.9.8 Near loss of A330 due to positioning of captain’s personal camera48 On 4 September 2015 the report published by the Military Aviation Authority describes how the Airbus A330-200 Voyager multi-role tanker transport came close to being lost with all 198 passengers and crew on-board. The event took place during a trooping flight to Afghanistan on 9 February 2014 at 33,000 ft over the Black Sea. The captain was alone on the flight deck as the co-pilot took a break. During this time, the captain took 28 photos of the flight deck using his personal digital camera before placing it between the captain’s seat armrest and the left-hand side-stick controller. One minute before the incident, the captain moved his seat forward, creating a slight physical jam between the armrest and the side-stick, which had the camera wedged between them forcing the side-stick fully forward and initiating the pitch-down command. The stick command disconnected the autopilot and sent the aircraft into a steep dive, losing 47 MIRCE 48 MIRCE
Akademy Archive- MIRCE Functionability Event 20160515 Akademy Archive- MIRCE Functionability Event 20150904
212
Safety and reliability modeling and its applications
4,400 ft in 27 sec. With no co-pilot in the right-hand seat, the command could not be countermanded. The aircraft’s on-board self-protection systems overrode the stick input, with pitch-down protection activated 3 sec after the pitch-down command was given, while high-speed protection was triggered 13 sec after the event started as the aircraft passed through 330 kt. With the flight control system idling the engines, it recovered the dive to level flight. Report states that the camera became free from the side-stick and armrest after 33 sec. During the action in the cockpit, passengers and crew in the cabin were thrown to the ceiling, with 24 passengers sustaining injuries during the dive, along with all seven of the cabin crew. Most of the injuries occurred as the individuals hit the ceiling and overhead fittings or were struck by loose objects. The flight was diverted to Incirlik air base in Turkey, where it landed safely. Although the event caused damage to a number of fixtures and fittings inside the cabin, there was no damage to the cockpit and no structural damage to the aircraft.
7.5.9.9 Confusion over power setting key factor in Emirates crash49 On the 3 August 2016, the Emirates Boeing 777-300 operated as Flight 521, slid down the runway, burst in flames and was completely destroyed. Twenty one passengers, one cabin crew member and one pilot suffered minor injuries, while one flight attendant was seriously injured. A fire fighter died when the center fuel tank exploded 8 min, after the failed landing. The report, released by the General Civil Aviation Authority (GCAA) of the United Arab Emirates, says the pilots tried a go-around following a long landing, but moved the thrust levers from the idle position to full forward only 3 sec before impact on the runway. The 34year-old captain was a pilot flying with 7,457 total flight hours and 5,128 hrs on the aircraft type. The 777-300 was configured for landing with flaps set at 30 and an approach speed of 152 kt selected, as it neared runway 12L. There was a wind shear warning in place for all runways and Dubai air traffic control cleared the flight to land with wind reported from 340 deg. at 11 kt. As the aircraft descended through 1,100 ft at 152 kt, the wind direction started to change from a headwind component of 8 kt to a tailwind. The autopilot was disconnected at 920 ft, but the autothrottle remained engaged. The tailwind component increased to 16 kt. The pilot flying flared the aircraft at 35 ft and 159 kt, and the autothrottle transitioned to idle. During the flare and 5 sec before eventual touchdown, the wind changed back to a headwind. Wheel sensors indicated the right main landing gear touched the ground at 12:37 a.m. local time, already 1,100 meters from the threshold and at a speed of 162 kt. The left gear made contact with the runway 3 sec later, but the nose gear remained airborne. The aircraft’s runway awareness advisory system warned the crew about the long landing, following which the decision to go around was made. After lift-off, the flap lever was moved to the 20 position 49 MIRCE
Akademy Archive- MIRCE Functionability Event 20160803
Mathematical and physical reality of reliability Chapter | 7
213
and the landing gear was selected to the up position. The aircraft was cleared by air traffic control for a straight runway heading and a climb to 4,000 ft. The 777 then climbed to a maximum of 85 ft and an indicated airspeed (IAS) of 134 kt. According to the report, the aircraft began sinking back toward the runway and the first officer called out, “Check speed.” Three seconds before impact the thrust levers were moved to full forward. One second before the aircraft hit the ground, with the gear in the process of retracting, the engines started to respond. Report concluded, “The aircraft was in a rapidly changing and dynamic flight environment. The initial touchdown and transition of the aircraft from air to ground mode, followed by the lift off and the changes in the aircraft configuration in the attempted go-around, involved operational modes, logics and inhibits of a number of systems, including the autothrottle, the air/ground system the weather radar and the GPWS.”
7.5.9.10 USAF spreads blame for fatal WC130h crash50 On 12 November 2018, according to the report by an Aircraft Accident Investigation Board (AAIB) less than 2 min after taking off from Savannah/Hilton Head International Airport in Georgia, the Puerto Rico Air National Guard pilot caused the WC130H aircraft to stall and crash by commanding a leftward yaw while already banking left at low speed despite the failure of the outboard engine on the left wing, killing all eight crewmembers. U.S. Air Force investigation report cited, “A pilot’s mistakes, a maintenance crew’s failures and an overall “culture and climate of complacency” are causal factors. The board singled out the rudder input as the primary cause of the accident since it led the WC-130 into a “skid,” slowing its speed until the left wing stalled at an elevation of 900 ft above sea level. The WC-130’s flight manual advises crews to avoid banking into a direction on the same side of an aircraft with an inoperative engine, as it requires an increase in velocity to stay above the minimum control speed. Although the rudder input was the primary cause, the 52-page AAIB report documented a long list of errors and deficiencies that led to the fatal crash on what should have been an uneventful flight to retire the 53-year-old aircraft in the Arizona desert. The AAIB report highlights the conditions at Muñiz Air National Guard Base in Carolina, Puerto Rico, which at the time of the crash was still recovering from the devastation of Hurricane Maria in September 2017. The report criticized the 156th Airlift Wing for a “lack of initiative or urgency to repair, replace, or fix the structural damage to several buildings from Hurricane Marina.” The inadequate facilities required the wing to fly the WC-130H to another Air National Guard base in Savannah in early April to fix a faulty fuel cell. As another crew ferried the WC-130H to Savannah, the outboard engine, a Rolls-Royce T56-A-15 turboprop, on the left wing malfunctioned, with rotations per minute dropping to 96%. The report finds several mistakes made by the maintenance crew dispatched 50 MIRCE
Akademy Archive- MIRCE Functionability Event 20181112
214
Safety and reliability modeling and its applications
to Savannah to fix the engine problem on 24 April. The crew skipped the first step of the maintenance test procedure, the AAIB report says. The manual requires the crew to plug a precision tachometer into the engine to measure RPM during an engine ground test. All but one of the precision tachometers were broken. However, the only functioning device was being used by another crew elsewhere. Thus, the maintenance crew borrowed a precision tachometer from the host unit in Savannah, but it was a different model that did not fit the WC-130H’s engine without an adapter. Rather than search for an available adapter, the maintenance crew decided to skip that step of the procedure, relying on the aircraft’s less precise, built-in engine tachometer to measure the RPM, the accident report says. The T56 is designed to operate at 100%, but the crew concluded the problem was fixed after observing a tachometer reading of 99% during a second engine test on the ground, the AAIB found. In fact, an inspection of the aircraft’s data recorder showed that the engine achieved an RPM of only 96.8% during the test, indicating the problem may never have been fixed.
7.5.10
Physical reality: Maintenance induced failures
The National Transportation Safety Board of the USA, and Civil Aviation Authorities of the UK, published on 12 August 2002 the following maintenance induced failures and their consequences in commercial aviation, among many others, were reported: • 25 May 2002: China Airlines B747-200 experienced a structural failure at top of climb to cruise altitude resulting in a crash into Taiwan Strait; due to use of a steel doubler which are prohibited by the structural repair manuals, while repairing previous tail strike. Toll: 225 killed. • 24 August 2001: Air Transat A330. Improper engine repair caused by leak from cracked fuel line resulted in duel engine flameout at cruise over Atlantic. Aircraft glided 135 miles to emergency landing in Azores. No serious injuries. • 26 April 2001: Emery Worldwide Airlines DC-8-71F. Left main landing gear would not extend for landing. Cause was failure of maintenance to install the correct hydraulic landing gear extension component and the failure of inspection to comply with post-maintenance test procedures. No injuries. • 20 March 2001: Lufthansa A320. Cross-connected pins reversed the polarity of captain’s side stick. Post-maintenance functional checks failed to detect the crossed connection. Aircraft ended up in 21º left bank, almost hitting the ground. Co-pilot switched his side-stick to priority and recovered the aircraft. No injuries. • 16 February 2000: Emery Worldwide Airlines DC-8-71F. Crashed attempting to return to Rancho Cordova, California. Cause was improperly installed right elevator control. Toll: 3 crew killed.
Mathematical and physical reality of reliability Chapter | 7
215
• 31 January 2000: Alaska Airlines MD-83. Crashed in Pacific Ocean near Port Hueneme due to loss of horizontal stabilizer caused by the maintainer failure to lubricate jackscrew assembly that controls pitch trim. Toll: all 88 aboard killed. • 21 January 1998: Continental Express ATR-42. Fire in right engine during landing, due to improper overhaul of lugholes in the fuel/oil heat exchanger. No serious injuries. • 27 September 1997: Continental Airlines B737. Separation of aileron bus cable forced the crew to return to the airport shortly after takeoff. Separation was caused by wear in the cable and inadequate inspection of it. No serious injuries. • 18 March 1997: Continental Airlines DC-9-32. Failure of maintenance personnel to perform a proper inspection of the combustion chamber outer case, allowing a detectable crack to grow to a length at which the case ruptured, causing uncontained failure of right engine. No injuries. • 17 July 1996: TWA Flight 800, B747. Fuel/air explosion due to inadequate maintenance on an aging fleet and noncompliant parts. Toll: all 230 passengers and crew killed. • 6 July 1996: Delta Air Lines MD-88. Uncontained engine failure on takeoff due to inadequate parts cleaning, drying, processing and handling. Toll: two passengers killed, two passengers seriously injured. • 8 June 1995: ValuJet Airlines DC-9-32. Maintenance technicians failed to perform a proper inspection of the 7th stage high compression disk, allowing a detectable crack to grow to a length at which it ruptured. Toll: 1 crew seriously injured. • 12 February 1995: British Midland B737-400. Oil pressure lost on both engines. Covers had not been replaced from borescope inspection the previous night, resulting in loss of almost all oil from both engines during flight. Diverted and landed safely. No injuries. • 1 March 1994: Northwest Airlines B747. Narita, lower forward engine cowling dragged along runway. During maintenance, the No. 1 pylon diagonal brace primary retainer had been removed but not reinstalled. No injuries • In August 1993: Excalibur Airways A320. Un-commanded roll in first flight after flap change. Returned to land safely at Gatwick. Lack of adequate briefing on status of spoilers (in maintenance mode) during shift change. Locked spoiler not detected during standard pilot functional checks. No injuries. • 11 September 1991: Horizontal stabilizer on Continental Express Airlines, EMB-120 separated from fuselage during flight because maintenance personnel failed to install 47 screw fasteners. Toll: all 14 passengers and crew killed. • 21 August 1990: Flashlight left by maintenance, on United Airlines B737, sandwiched between cargo floor and landing gear retract/extend linkage, causing the crew to make a gear up landing. Toll: No injuries.
216
Safety and reliability modeling and its applications
• 22 July 1990: USAir B737. A fuel pump control failure due to improper machining. No injuries • In June 1990: British Airways BAC1-11. Captain sucked halfway out of windscreen, which blew out under effects of cabin pressure, as 84 of 90 securing bolts were smaller than the specified diameter. Toll: 1 serious injury. • 12 August 1985: Japan Air Lines B-747SR. Improper repair of aft pressure bulkhead led to sudden decompression in flight that damaged hydraulic systems and vertical fin. Aircraft struck Mt. Ogura. Toll: 520 passengers and crew killed; four surviving passengers injured.
7.5.11 Physical reality: Reliability is dependent on natural environment 7.5.11 1 Hailstorm damaged Boeing 787 returns back to China51 On 29 July 2015 American Airlines’ Boeing 787 was climbing out of Beijing, China, when it encountered a hailstorm that left the 3-month-old airplane somewhat beat up. Flight 88 from Beijing to Dallas/Fort Worth Airport, DFW, was about 20 min out of Beijing and climbing above 26,000 ft when it began descending. It landed back at Beijing less than 45 min after takeoff. The composite fuselage, one of the things that separate the Boeing 787 from most other airplanes, itself took no apparent damage from the hailstorm in Beijing. The radome, the nose cone that protects the radar and other avionics on the airplane’s front tip, was hammered. It was replaced in Beijing with a spare radome that American flew over to the Beijing airport. They also covered some small punctures on the wing’s underside with speed tape, a strong, thin aluminum tape. The airplane was flown to Tokyo’s Narita International Airport, where maintenance personnel replaced the side windscreens on the left and right sides. Those windscreen’s outer panels had cracked on their front edges and bottoms, but the inner panels were not damaged, and the integrity of the window was maintained throughout. Then, when the airplane returned to Dallas/Fort Worth, American repair facilities, the major inspections started. Thus, 44 panels were removed and shipped to American’s composite shop at its Tulsa composite repair center maintenance base for repairs and repainting. Their large autoclave enables many of them to be repaired at one time. Some curved aluminum pieces that form the wing’s leading edge are also being replaced. 7.5.11.2 Unfavourable winds delay test flight of NASA’s low-density supersonic demonstrator52 On 12 June 2014 NASA suspended efforts to test launch a disk shaped craft for the demonstration of technologies intended to greatly increase the payload 51 MIRCE 52 MIRCE
Akademy Archive- MIRCE Functionability Event 20150729 Akademy Archive- MIRCE Functionability Event 20140612
Mathematical and physical reality of reliability Chapter | 7
217
mass that can be landed on the Martian surface, at the U. S. Navy’s Pacific Missile Range Facility, due to “two weeks of uncooperative wind conditions”. The announcement followed half a dozen attempts since June 3 to launch the rocket powered Supersonic Inflatable aerodynamic Decelerator from a high altitude balloon. NASA team studied wind data in the region from 2012-13 that suggested early June was favorable for the test flight. However, the weather pattern in the Northern Hemisphere changed in 2014, leading to a longer winter and unfavorable winds in the region. The test flight represents a major milestone for the $200 million, five-year initiative managed by NASA’s Space Technology Mission Directorate.
7.5.11.3 Rat on plane forces Air India flight to return to Mumbai53 On 31 December 2015 an Air India aircraft flying to London was forced to return to Mumbai after passengers reported spotting a rat on board. Though the rat was not found, the pilot returned to Mumbai keeping passenger safety in mind, Air India said in a statement. A separate aircraft later flew passengers to London. The aircraft will be fumigated and checked before it is returned to service. Maintenance workers checked that the rat did not damage equipment or chew any wires and the plane was certified to be rodent-free. 7.5.11.4 Elevator malfunctions in MD-83’s rejected takeoff54 On 8 March 2017 Ameristar Jet Charter pilots attempting to takeoff from Runway 23L at the Willow Run Airport in Ypsilanti, Michigan were not able to lift the nose of the aircraft at the 152-kt takeoff speed due to a jammed right elevator. Based on a preliminary report by the NTSB, flight data recorder (FDR) information showed that the pilots continued accelerating for 5 sec with no pitch change, until reaching a speed of 166 kt, before initiating a rejected takeoff procedure. The aircraft travelled 1,000 ft past the end of the runway, coming to rest in a field where the 109 passengers and seven crewmembers evacuated, using escape slides. One passenger received a minor injury. The NTSB said the forward right slide did not deploy correctly. A strong headwind with right crosswind component was blowing at the time of takeoff, from 260 deg. at 35 kt, gusting to 50 kt. According to investigators, a post-accident examination revealed that the cockpit controls moved normally; however, upon inspecting the elevator assembly on the tail, investigators found the right side to be jammed. The cause was a bent linkage to a control tab on the trailing edge of the right elevator, which prevented the elevator from moving to the nose-up position. The left side of the split elevator functioned normally. Data from previous flights showed both elevators operating normally. One possible cause the NTSB will investigate is whether strong winds may have damaged the elevator while the 53 MIRCE 54 MIRCE
Akademy Archive- MIRCE Functionability Event 20151231 Akademy Archive- MIRCE Functionability Event 20170308
218
Safety and reliability modeling and its applications
aircraft was parked after arriving March 6 in Ypsilanti. According to Weather Underground, winds were gusting to 25 kt on 6 March, to 35 kt on 7 March and to 50 kt on 8 March.
7.5.11.5 Lightning strikes caused power cut on national grid in UK55 The National Grid in the UK suffered a power failure on 9 August 2019. The outage left 1.1 million customers without power for between 15 and 50 minutes. Problems on the railways were mainly blamed on one particular type of train, of which there were around 60 in use, reacting unexpectedly to the outage, and half of them failing to restart, requiring an engineer to attend to do so. Other “critical facilities” hit by the power cut included Ipswich hospital and Newcastle airport. National Grid is facing an investigation by Ofgem over this event. The regulator has the power to fine firms up to 10% of UK turnover. The failures knocked out Hornsea off-shore wind farm, off the Yorkshire coast, owned by the Danish company Oersted, as well as the Little Barford gas power station in Bedfordshire, owned by Germany’s electric utility company RWE, resulting in the loss of 1,378MW. That was more than the 1,000MW being kept by National Grid at that time, a level designed to cover the loss of the single biggest power generator to the grid. The preliminary report blamed an “extremely rare and unexpected” outage at two power stations caused by one lightning strike at 4.52pm that day. That resulted in a combined power loss to the network that was greater than the backup capacity held in case of emergency. The report said the system automatically turned off 5% of Britain’s electricity demand to protect the other 95%, a situation that it said had not happened in over a decade. The National Grid also admitted that the government, the regulator and the media were not made aware of what had happened as quickly as they should have been “impacted by the availability of key personnel given it was 5pm on a Friday evening”. The business department was not updated until 5.40pm and Ofgem at 5.50pm, nearly an hour after the initial event. 7.5.11.6 Plastic sandwich bag caused retirement of Williams F1 car in Melbourne56 Brake failures in F1 are rare, especially early in a race, but Sergey Sirotkin’s F1 debut in Australia, on 25 March 2018, was just five laps old when he ran out of brakes and rolled to a stop up the escape road at Turn 13. Understandably the Williams Team was keen to find out what had happened. The result was, “a plastic sandwich bag that went into the rear-right brake duct caused massive overheating, which caused massive temperature spikes destroying the brakes and total loss of the brake pedal”. After the “forensics analysis” the residue of what 55 MIRCE 56 MIRCE
Akademy Archive- MIRCE Functionability Event 20190809 Akademy Archive- MIRCE Functionability Event 20180325
Mathematical and physical reality of reliability Chapter | 7
219
looks like a melted plastic bag was found that completely blocked the brake duct on the right rear with all the temperatures going through the roof, eventually catching fire, and then the actual catastrophic failure. All the sensors were lost, progressively as they got burned and eventually the seal has probably gone on the calliper because there’s a fluid leak and the pedal went to the floor. Closing Apostrophe missing but not sure where
7.5.11.7 A Burst of asteroid activities in Europe57 According to the European Space Agency (ESA) expected a rare convergence of asteroid-related activities in Europe. They estimated that around 10 September 2019, there would have been 878 asteroids in the ‘risk list’. This ESA catalogue brought together all asteroids known of having a ‘non-zero’ chance of impacting Earth in the next 100 years, meaning that an impact, however unlikely, cannot be ruled out. An impact by even a small asteroid could cause serious destruction to inhabited areas. This is why the ESA, together with international partners, are taking action to search for asteroids, develop technology that could deflect them in future and collaborate at the international level to support mitigation measures. Thus, planetary defense and other experts are meeting in three locations to coordinate humanity’s efforts to defend us from hazardous space rocks. Such intense levels of international scientific collaboration are driven in part by the fact that an asteroid impact could cause devastating effects on Earth. But this is also a testament to the fact that we are at a point in human history where we can do something about risky asteroids. The flurry of upcoming meetings will cover vital topics in planetary defense, including the planned, first-ever test of asteroid deflection, coordination and communication of asteroid warnings and how to ensure the most effective emergency response on the ground. With all the work being done, the planet has never been so prepared for the unlikely but very real threat of an asteroid impact. 7.5.11.8 Northeast Airlines cancelled 1,900 U.S. flights due to storm On 26 January 2015 air travel to New York is being slashed as carriers scrap thousands of U.S. flights to keep planes, crew and passengers out of the path of a blizzard threatening the Northeast with as much as 2 ft of snow. New York’s three airports: LaGuardia, Kennedy and New Jersey’s Newark Liberty, were feeling the brunt of the schedule changes. Airlines eliminated about half of Monday’s arrivals at the trio of hubs, which make up the busiest U.S. travel market, while departures were cut by more than a third. Preliminary cancellations in the face of foul weather help carriers in part by relocating aircraft to unaffected airports. That positions airlines to resume service faster once flight conditions improve. 57 MIRCE
Akademy Archive- MIRCE Functionability Event 20190900
220
Safety and reliability modeling and its applications
7.5.11.9 Impact of bird strikes on aircraft reliability58 The U.S. Department of Agriculture, through an interagency agreement with the Federal Aviation Administration, compiles a database of all reported bird/wildlife strikes to U.S. civil aircraft and to foreign carriers experiencing strikes in the USA. Over 87,000 strike reports from over 1,650 airports have been compiled, 1990-2008. The federal Aviation Authorities (FAA) estimates that this represents only about 20% of the strikes that have occurred. The following historical examples of strikes from 1905-1989 and examples from the database from 1990-2008 are presented to show the serious impact that strikes by birds or other wildlife can have on aircraft, and surely somehow reflected in its Reliability Function. Selected examples of the bird strikes are presented below59 : • 4 October 1960, a Lockheed Electra turbo-prop ingested European starlings into all four engines during takeoff from Boston Logan Airport (Massachusetts). The plane crashed into Boston Harbor, killing 62 people. Following this accident, the FAA initiated action to develop minimum bird ingestion standards for turbine-powered engines. • 26 February 1973, at the departure from Atlanta’s Peachtree-Dekalb Airport (Georgia), a Lear 24 jet struck a flock of brown-headed cowbirds attracted to a nearby trash disposal area. Engine failure resulted. The aircraft crashed, killing seven people and seriously injuring one person on the ground. This incident prompted the FAA to develop guidelines for the location of solid waste disposal facilities on or near airports. • 12 November 1975, on departure roll from John F. Kennedy International Airport (New York), the pilot of a DC-10 aborted takeoff after ingesting gulls into one engine 12 November 1975. The plane ran off runway and caught fire as a result of engine fire and overheated brakes. The resultant fire destroyed the aircraft. All 138 people on board were evacuated safely. • 25 July 1978, a Convair 580 departing Kalamazoo Airport (Michigan, USA) ingested one American kestrel into an engine on take-off. The aircraft crashed in a nearby field, injuring three of the 43 passengers. • 1980; Royal Air Force Nimrod aircraft lost control and crashed after ingesting a number of birds into multiple engines at Kinross, Scotland. • 18 June 1983, during the landing process at Clifford, Texas, the pilot of a Bellanca 1730, saw two “buzzards” on final approach. Hence, he added power and maneuvered to avoid them, then continued approach, which resulted in a landing beyond the intended point. As the middle of the runway was higher than either end, the pilot was unable to see a large canine moving 58 Knezevic, J., Bird Strike as a Mechanism of the Motion in MIRCE Mechanics, pp 167-173, Journal
of Applied Engineering Science, No 3, Vol 12, 2014, Belgrade, Serbia R. A., Birds and aircraft: fighting for airspace in crowded skies, pp 37-43, Proceedings of 19th Vertebrate Pest Conference, University of California, Davis, California, USA, 2000.
59 Dolbeer,
Mathematical and physical reality of reliability Chapter | 7
•
•
•
•
•
•
•
•
•
221
toward the landing area until the aircraft was halfway down the runway. A goaround was initiated, but the lowered landing gear hit some treetops causing the pilot to lose control. The aircraft came to rest about 250 yards from initial tree impact after flying through additional trees. The aircraft suffered substantial damage, and two people in the aircraft were seriously injured. September 1987, U.S. Air Force B1-B lost control and crashed after an American white pelican struck the wing root area and damaged the hydraulic system. The aircraft was on a low level, high-speed training mission in Colorado, USA. Only three of the six occupants have survived this negative functionability event. 5 November 1990, during takeoff at Michiana Regional Airport (Indiana), a BA-31 flew through a flock of mourning doves. Several birds were ingested in both engines, and take-off was aborted. Both engines were destroyed. Cost of repairs was $1 million. 30 December 1991, a Citation 550, taking off from Angelina County Airport (Texas), struck a turkey vulture. The strike caused major damage to the engine number 1 and resulting shrapnel caused minor damage to the wing and fuselage. Cost of repairs was $550,000. 3 December 1993, a Cessna 550 struck a flock of geese during the initial climb out of Du Page County Airport (Illinois). The pilot heard a loud bang, and the aircraft yawed to the left and right. Instruments showed loss of power to engine number 2 and a substantial fuel leak on the left side. An emergency was declared, and the aircraft landed at Midway Airport. The cost to repair two engines was $800,000, and the aircraft spent 3 months in repair shop. 3 June 1995, an Air France Concorde, while landing at John F. Kennedy International Airport (New York), ingested one or two Canada geese into engine number 3. The engine suffered an uncontained failure and its shrapnel destroyed the engine number 4 and cut several hydraulic lines and control cables. The pilot landed safely, but the runway was closed for several hours. The repair cost was around $7 million. 22 September 1995, Airborne Warning and Control System aircraft (known as AWACS) crashed killing all 24 on board. The cause of the accident was ingestion of four Canada geese into engines 1 and 2 during take-off from Elmendorf Air Force Base (Alaska). 14 July 1996, NATO E-3 AWACS aircraft struck a flock of birds during takeoff at Aktion Airport in Greece. The crew aborted the takeoff and the aircraft overran the runway. The aircraft was not repaired; none of the crew was seriously injured. 15 July 1996; Belgian Air Force Lockheed C-130 struck a large flock of starlings during approach to Eindhoven, Netherlands and crashed short of the runway. All four members on the crew and 30 of the 37 passengers were killed. 5 October 1996, a Boeing-727 departing Washington DC Reagan National Airport struck a flock of gulls just after takeoff, ingesting at least one bird.
222
•
•
•
•
•
•
•
Safety and reliability modeling and its applications
One engine began to vibrate and was shut down. As the burning smell entered the cockpit, the pilot declared an emergency, and the aircraft, carrying 52 passengers, landed at Washington Reagan National. Several engine blades were damaged 7 January 1997, an MD-80 aircraft struck over 400 blackbirds just after takeoff from Dallas-Fort Worth International Airport (Texas). Almost every part of the plane was hit. The pilot declared an emergency and safely landed. Substantial damage was found on various parts of the aircraft, and engine number 1 had to be replaced. The runway was closed for 1 hour. The birds had been attracted to an un-harvested wheat field close to the airport. 9 January 1998, while climbing through 3,000 feet, following takeoff from Houston Intercontinental Airport (Texas), a Boeing-727 struck a flock of snow geese with three to five birds ingested into one engine. The affected engine lost all power and was destroyed. The radome was torn from aircraft and leading edges of both wings were damaged. The Pitot tube for the first officer was torn off. After declaring emergency the flight returned safely to Houston with major damage to aircraft. 22 February 1999, a Boeing-757 departing Cincinnati/Northern Kentucky International Airport was forced to return and make an emergency landing after hitting a large flock of starlings. Both engines and one wing received extensive damage. Around 400 dead starlings were found on the runway area. 7 February 2000, DC-10-30, belonging to an American-owned cargo company, ingested a fruit bat into one engine at 250 feet Above Ground Level (AGL), while departing from Subic Bay, Philippines. The aircraft returned to the airport safely. Five damaged fan blades had to be replaced keeping aircraft out-of-service for 3 days. Total repair and related costs exceeded $3 million. 21 January 200, an MD-11 departing Portland International Airport (Oregon) ingested a herring gull into engine number 3 during the takeoff run. The engine stall blew off the nose cowl that was sucked back into the engine and shredded. The engine had an un-contained failure. The pilot aborted take-off and safely landed 217 passengers, with two blown tires. 9 March 2002, a Canadair RJ 200 at Dulles International Airport, Washington DC, struck two wild turkeys during the takeoff roll. One of them shattered the windshield spraying the cockpit with glass fragments and remains. 19 October 2002, a B767 departing Logan International Airport in Boston, encountered a flock of over 20 double-crested cormorants. At least 1 cormorant was ingested into engine number 2. There were immediate indications of engine surging followed by compression stall and smoke from the engine. The engine was shutdown. An overweight landing with one engine was made without incident. The nose cowl was dented and punctured. There was significant fan blade damage with abnormal engine vibration. One fan blade was found on the runway. The aircraft was towed to the ramp. Hydraulic lines were leaking, and several bolts were sheared off inside
Mathematical and physical reality of reliability Chapter | 7
•
•
•
•
•
•
•
223
engine. Many pieces fell out when the cowling was opened. The aircraft spent 3 days in the repair shop and the total repair bill was $1.7 million. 8 January 2003, a Bombardier de Havilland Dash 8 collided with a flock of lesser scaup ducks at 1,300 feet AGL on approach to Rogue Valley International Airport (Oregon), At least one bird penetrated the cabin and hit the pilot who turned control over to the first officer for landing. Emergency power switched on when the birds penetrated the radome and damaged the DC power system and instruments systems. 4 September 2003, a Fokker 100 struck a flock of at least five Canada geese over the runway shortly after take-off at LaGuardia Airport (New York), ingesting one or two geese into engine number 2. The pilot was unable to shut the engine down with the fuel cutoff lever, so the fire handle was pulled and the engine finally shut down. The flight was diverted to nearby JFK International Airport where a landing was made. A depression on the right side of the nose behind the radome was found with a maximum depth of 10 cm. Impact marks were found on the right wing. A fan blade separated from the disk and penetrated the fuselage. Several fan blades were deformed. Holes were found in the engine cowling. Bird remains were recovered and identified by the Wildlife Services. 17 February 2004, a Boeing 757 during a takeoff run from Portland International Airport (Oregon) hit five mallards and returned with one engine out. At least one bird was ingested, and parts of five birds were collected from the runway. As the damaged engine was beyond repair, the new one was fitted at the cost of $2.5 million, keeping the aircraft 3 days out of service. 15 April 2004, an Airbus 319 climbing out of Portland International Airport (Oregon) ingested a great blue heron into engine number 2, causing extensive damage. The pilot shut the engine down as a precaution and made an emergency landing. The runway was closed 38 minutes for cleaning. The engine and nose cowl were replaced at the cost of $388,000, keeping aircraft in repair shop for 72 hours. 14 June 2004, a great horned owl struck a Boeing 737, during a nighttime landing roll at Greater Pittsburgh International Airport. The bird severed a cable in the front main gear and disabled the steering system causing the aircraft to run off the runway and became stuck in mud. Passengers were bused to the terminal. Repair team replaced 2 nose wheels, 2 main wheels and brakes keeping the aircraft out of service for 24 hours at the total cost of $20,000. 16 September 2004, departing Chicago O’Hare (Illinois), a MD 80 hit several double-crested cormorants at 3,000 feet AGL and 4 mi from airport. Engine number 1 caught fire and failed, sending metal debris to the ground in a Chicago neighborhood. The aircraft made an emergency landing back at O’Hare with no injuries to any of the 107 passengers. 24 October 2004, a Boeing 767 departing Chicago O’Hare (Illinois) hit a flock of birds during the take-off run. A compressor stall caused the engine
224
Safety and reliability modeling and its applications
to flame out. A fire department got calls from local residents who reported seeing flames coming from the plane. The pilot dumped approximately 11,000 gallons of fuel over Lake Michigan before returning to land. The nature of aircraft damage from bird strikes, which is significant enough to create a high risk to continued safe flight, differs according to the size of aircraft.
7.5.12
Closing remarks regarding physical reality of reliability
The above presented set of physically observed and documented facts seriously raised the question of the accuracy of the reliability predictions currently provided through reliability function Eq. 7.3 to even the time to the first failure of a system. Even further, it is impossible to say anything about all subsequent physically observable phenomena during the in-service operation of systems that are totally “nonexistent”, as far as the reliability function of a system is concerned, Eq. 7.3 which covers only the time to the first failure.
7.6 Mathematical versus physical reality of reliability Base on the information provided thus far it is possible to summarize that there are clear differences between a mathematical reality of reliability and the observed physical reality of reliability described through observed reliability related events described in the text. The major points of the differences between them are presented in the Table 7.2.
7.7
Closing Question
The main objective of this text was to expose the reliability and safety community to the mathematical and physical realities of the reliability function with the objective to focus their attention to the following question, “What is the body of knowledge on which reliability and safety modelling should be based, in order for the predictions made to be confirmed by reliability measures obtained in operationally defined physical reality?” (Varde et al., 2019).
Acknowledgement The author wishes to acknowledge that the majority of the information regarding the reliability and safety events presented in this text originated from the Aviation Weekly60 . 60 www.aviationweekly.com
Mathematical and physical reality of reliability Chapter | 7
225
TABLE 7.2 Comparison between mathematical and physical reality of reliability Mathematical Reality
Physical Reality
Quality of produced components and assemblies is hundred percent
Quality of produced components and assemblies is less than hundred percent
Errors during system transportation, storage and installation tasks are zero percent
Errors during system transportation, storage and installation tasks are greater zero percent
There is no interactions between “independent” components
There are a huge interactions between “independent” components
Maintenance activities like: inspections, repair, cleaning, etc., do not exist
Maintenance activities like: inspections, repair, cleaning, etc., do exists
System and all components operate continuously (24/7)
Neither system nor all components operate continuously (24/7)
First observable failure is a failure of a system
First observable failure is not necessarily the failure of a system
Components and a system have the same “times”
Components and a system have different “times”
Fixed operation scenario (load, stress, temperature, pressure, etc.)
Variable operation scenario (load, stress, temperature, pressure, etc.)
Reliability is independent of the location in space defined by GPS coordinates
Reliability is dependent on the location in space defined by GPS coordinates
Reliability is independent of humans
Reliability is dependent on humans
Reliability is independent of maintainers
Reliability is dependent on maintainers
Reliability is independent of calendar time
Reliability is dependent on calendar time
Reliability is independent of environment
Reliability is dependent on environment
Also, the author wishes to acknowledge the contribution of the numerous students, at Exeter University (1986-1999) and the MIRCE Akademy (1999current), towards his endeavor to understand the physical mechanisms that cause occurrences of failure events through the collection and analysis of information related to the in-service behavior of functionable systems, as the only way towards the creation of the modelling method that would provide predictions of reliability and safety which are confirmed by measurements obtained in the operationally defined physical reality, as it is achievable in mechanical, electrical, chemical, nuclear and other branches of engineering.
226
Safety and reliability modeling and its applications
References Dubi, A., 2003. System Engineering Science. MIRCE Science, Exeter, UK, p. 168. Henry, J., 2017. Knowledge is Power. Icon Books Ltd, London, UK, p. 214. Knezevic, J., 1993. Reliability, Maintainability and Supportability, A probabilistic approach. McGraw Hill Book Company, London, UK, p. 294. Knezevic, J., 1995. Modelling System Reliability - Way Ahead, CODERM Newsletter. Ministry of Defence, UK 12 (2), 8–9 June. Knezevic, J., 2017. The Origin of MIRCE Science. MIRCE Science, Exeter, UK, p. 232 ISBN 9781-904848-06-6. Vacher, P., 2006. Wings around the World. Grub Street, London, UK, p. 160 ISBN 1-904943543. Varde, P.V., Prakash, R.V., Joshi, N.S., 2019. Risk Based Technologies, MIRCE Science Based Operational Risk Assessment. Springer Nature Singapore Pte. Ltd., Singapore, pp. 223–258.
Non-Print Items Abstract According to Knezevic, the purpose of the existence of any functionable system is to do work. The work is done when the expected functionality (function, performance, and attributes) is delivered through time. However, experience teaches us that the work expected to be done is frequently beset by failures, some of which have safety consequences to: the users, the natural environment and human communities. Thus, from the late 1950s reliability models, based on a reliability function, have been used to predict the impact of the design decisions on inservice reliability and safety, before finalizing the design. As the accuracy of these predictions is fundamental for the formulation of failure management policies, the author has studied the physical properties that future systems must possess, in accordance with the mathematical view of reality, firmly imbedded in their reliability block diagrams. The results of the study are presented in the first part of the text. These findings are tested through scientific studies of a large number of physically observed failures generated by operation, maintenance and support processes of defence, aerospace, and nuclear power systems. The results obtained, presented in the second part of the text, show significant discrepancies between the mathematical reality of reliability based on axioms of probability imbedded in reliability function and the physical reality observed through the scientific studies of numerous in-service reliability and safety related events. Thus, the main objective of this text was to expose the reliability and safety community to the mathematical and physical realities of reliability function with the objective to focus their attention to the following question, “What is the body of knowledge on which reliability and safety modelling should be based, in order for predictions made to be confirmed by reliability measures obtained in operationally defined physical reality?” Keywords Mathematical reality of reliability modelling; Observed failure events; Physical reality of reliability modelling; Reliability function
Chapter 8
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant safety instrumented systems Sun-Keun Seo a and Won Young Yun b a Department
of Industrial and Management Systems Engineering, Dong-A University, Busan, Korea. b Department of Industrial Engineering, Pusan National University, Busan, Korea
8.1 Introduction After International Electrotechnical Commission (IEC) 61508 which specified the requirements for the functional safety of safety instrumented systems (SIS) in 2000 was issued, functional safety of SIS has been required in different industry sectors. In addition to IEC 61508, which is the basic standard about functional safety, various standards for functional safety in various industries have been enacted, for example, process industry, car industry (ISO 20262, 2011), machinery products, railway vehicles, medical devices, and nuclear power plants. IEC 61508 categorizes the operating modes of SIS into low and high-demand modes. In IEC 61508 (2010 version), if the demand rate is greater than one per year, it is classified as a high-demand operating mode and continuous operation is included also in this mode. Otherwise, the operation is classified as a lower demand operating mode. IEC 61508 uses average probability of failure on demand (PFD) and probability of dangerous per hour (PFH) as the system safety measure instead of instantaneous availability, unreliability, and steady state availability used popularly in reliability area. PFD is similar to unavailability and PFH is similar to rate of occurrence of failure (ROCOF) in repairable systems (Seo, 2012) and PFD is more widely used for SIS in IEC 61508 and we focus on PFD as the safety measure.
Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00012-X Copyright © 2021 Elsevier Inc. All rights reserved.
227
228
Safety and reliability modeling and its applications
In general, failures of units related to functional safety (SIS) result in serious situation of the system and redundant units are added to reduce the failure probability of safety units on demand. Additionally, a proof test (inspection) is performed periodically to find hidden failures of redundant units. The failures of the SIS are classified as dangerous detected (DD) failures if those are detected by the built-in self-diagnosis test, or as dangerous undetected (DU) failures if those are detected only by the proof test. We try to find the optimal testing method of proof tests for hidden failures of SIS with redundant units. When we test (inspect) SIS periodically, we can test units at same time points together or at different time points. The testing method with same testing time points and cycle for all units may be inefficient to find hidden failed units and we consider other testing methods to minimize the system PFD. If we consider staggered testing methods in which different testing time points are assigned to test redundant units, we can reduce the system PFD in cases with short testing times (Contini et al, 2013; Liu and Rausand, 2013). This article deals with the staggered testing methods of SIS with redundant structures. In order to show the outstanding performance of staggered testing methods, Green (1972) considered a uniformly staggered testing method with equallydivided test intervals in which all units have an equal testing interval but the first test time points are different and divided equally for SIS. He obtained the system unavailability in cases that the failure distribution is exponential and the redundant structures of SIS are parallel and 2 out of 3 structures. Rouvroye and Wiegerinck (2006) also considered a uniformly staggered testing method for SIS system with 1 out of 2 structure and derived PFD by using continuous time Markov chain. Vaurio (1980, 2011) proposed a method to estimate approximate unavailability of uniformly-staggered testing method for SIS with M-out-of-N structures(N = 2,3,4), and also considered common failure modes. For the optimal testing method in staggered testing models, Green (1972) showed that the optimal staggered time point in case with 1 out of 2 structure should be set as a half of the test interval based on the approximate unavailability. Liu (2014) consider the 1 out of 2 structure with different failure rates and test intervals, and investigated the optimality of the staggered test method of setting the midpoint of the test interval as the starting test time point under various conditions. Part 6 of IEC 61508 (2010) provides approximate formulas to obtain the PFD for SIS with 1, 2, and 3 units in cases with the equal failure rate and the equal proof test interval for all units. Liu (2014) investigated in detail the optimality of staggered testing in SIS with 1-out-of -2 structure. In this article, we review the optimality of existing testing schemes and obtain the optimal staggered testing in SIS with 1- and 2-out-of-3 structures. Additionally, we consider SIS with 3 different units and study the optimal staggered test methods in various grouping cases. This chapter is organized as follows. Section 2 summarizes research results in previous studies on PFD when periodic tests are performed in the proof test
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
229
of SIS with redundant units. Section 3 and 4 consider SIS with 1-out-of-3 and 2out-of -3 structures, respectively and find the optimal staggered testing schemes. Finally, Section 5 summarizes the results found in this chapter.
8.2 PFD of redundant safety systems Part 6 of IEC 61508 provides an approximate formula for reliability metrics for several safety systems. The first edition of IEC 61508 (2000) provides an approximate expression for PFD and PFH for one channel, 1-out-of-2, modified 1-out-of-2, 2-out-of-2, 2-out-of-3 structures and in the second edition of IEC 61508 (2010), the result of 1-out- of-3 structure is added. In addition, the analytical formula (Oliveira and Abramovitch, 2010; Jahanian, 2015), fault tree, a Markov model and Petri net model are proposed to calculate PFD of various structures in Part 6 of the second edition of IEC 61508 (2010). The formulas proposed in the standard are based on the analytical formulas. In general, failures in international standards for functional safety are classified into two types: safe failures and dangerous failures, and the latter is again classified into dangerous detected (DU) and dangerous undetected (DU) failures. The primary failure mode of failures is DU failures that are found only on demand or by proof tests. It is also assumed that SIS is inspected periodically with an interval τ , all potential failures are found, and SIS is repaired or replaced to be restored as good as new. When SIS is tested periodically with τ time period, the PFD of SIS is the average unavailability and defined as follows (IEC 61508) τ qS (t )dt MDT (τ ) = 0 (8.1) PFD = τ τ where MDT (τ ) is the expected downtime in the interval (0, τ ) and qS (t) is the unreliability function of the system. When the undetected and detected dangerous failure rates of a unit (channel) are constants, λDU and λDD , respectively, PFD in IEC 61508 is approximately defined as follows PFD = (λDU + λDD )tCE τ λDD λDU + MRT + tCE = MT T R, λDU + λDD 2 λDU + λDD
(8.2)
where MRT is mean repair time and MTTR is mean time to restoration. Generally, MRT and MTTR are considerably smaller than τ , only the undeteced dangerous failure rate is considered, and it may be reasonable to define PFD as τ λ/2 by ignoring the MRT and MTTR (From now on, λDU is abbreviated as λ).
230
Safety and reliability modeling and its applications
If we test all units simultaneously at specified test time points, the PFD of SIS with M-out-of-N structure can be obtained as follows (Rausand and Høyland, 2004; Seo, 2012; Rausand, 2014): (λτ )N−M+1 N (8.3) PFDMooN = N−M+1 N−M+2 When M = 1, (λτ )N /(N + 1)is obtained for PFD in the parallel case.
8.2.1
PFD in staggered testing
The N-out-of-M redundant structure is most commonly used to improve the system reliability and we apply staggered testing to SIS with the redundant structures. As a simple case, we assume that all units of SIS are same (same failure rate) and have equal testing interval, τ . Therefore, testing time points of all units are set equally to 0, τ /N, 2τ /N, … (N-1)τ /N. PFD and ratio of PFD under uniformly staggered testing scheme in parallel structure to PFD under proof tests with equal testing time points are given (Green, 1972) N!(N + 3)(λτ )N 4N N (N + 1) N!(N + 3) = 4N N
PFDNoo1 = RNoo1
(8.4)
Therefore, PFD of SIS with 1-out-of-2 structure is 5(λτ )2 /24 and is 5/8 of PFD under equal testing time points and PFD of SIS with 1-out-of-3 structure is (λτ )3 /12 and is 1/3 PFD under equal testing time points. In cases with 3 or 4 units, PFD of M-out-of-N redundant structures with equal testing time points are given as follows (Green, 1972; Vaurio, 2011).
8.2.2
PFD3oo2 =
2(λτ )2 3
PFD4oo2 =
3(λτ )3 8
PFD4oo3 =
11(λτ )2 8
(8.5)
Optimal staggered testing points: 1-out-of- 2 structure
PFD can be lowered by setting the starting points of the testing interval applied to units different from each other through staggered testing introduced in the previous section. In addition, it is necessary to set the optimal starting point of the testing interval to minimize the PFD and further lower it. Most of existing studies on the optimal staggered test times of SIS considered the 1-out-of-2 structure.
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
FIGURE 8.1
231
Staggered testing scheme: 1-out-of-2 structure
If the testing interval is equal to τ and the failure rates of the two units follow exponential distributions with λ1 and λ2 , the starting test points of units 1 and 2 are 0 andta = ca τ (0 < ca < 1), respectively, as shown in Fig. 8.1. From Eq. (8.1), PFD of this structure is obtained as follows (Rausand and Høyland, 2004). 1 −λ1 τ e−λ1 ca τ (1 − e−λ2 τ ) + e−λ2 τ (1−ca )(1−e ) (8.6) PFD1oo2 = (λ1 + λ2 )τ The optimal ca ∗ minimizing Equation 6 is given by
λ1 (1 − e−λ2 τ ) 1 ln + λ τ ca ∗ = 2 (λ1 + λ2 )τ λ2 (1 − e−λ1 τ )
(8.7)
In Eq. (8.7), ca ∗ ≈1/2 (Liu, 2014) and if the failure rates of the two units are equal (λ1 = λ2 ), ca ∗ is exactly 1/2. Here, if we approximate the unreliability of the 2 2 two components to λ1 t and λ2 t, as in IEC 61508, PFD is λ1 λ2 [ τ3 + τ2 (c2a − ca )] and ca ∗ becomes 1/2 and finally PFD is 5λ1 λ2 τ 2 /24. Recently, Liu (2014) found that even though the testing intervals are equal (τ 1 = mτ 2 ,m is a positive integer), the optimal ca ∗ is exactly 1/2 in cases that all units have equal failure rate and it is approximately 1/2 in different failure rates. In the latter case, using the approximate unreliability of units by the Taylor series as same as in IEC 51608, ca ∗ is exactly 1/2.
8.3 Staggered testing in 1-out-of-3 structure In this section, we consider SIS with 1-out-of-3 structure and apply the staggered testing method to reduce PFD. In particular, we apply Liu’s (2014) approach to 1-out-of-2 structure to 1-out-of-3 structure covered in IEC 61508 and also deal with staggered testing problem of SIS with different failure rates. Although IEC 61508 focuses on the cases in which the failure rates of the components are same, the failure rates of the redundant units are often different depending on environmental condition, such as shut-down valves and even in cases with the same units performing the same safety function (Liu, 2014). First, this section deals with the case that the testing interval is same to all units and the next section deals with 2-out-of-3 structure with different testing intervals. When we consider the staggered testing under equal testing interval, three different testing time points can be assigned to three units (Section 8.3.1). As a simple case, two different testing time points can be also assigned to three units and two units have equal testing time points (Section 8.3.2).
232
Safety and reliability modeling and its applications
FIGURE 8.2
Staggered testing scheme for 1- and 2-out-of-3 structure: 3 groups
In particular, the grouping method of dividing N units into two groups was proposed by Yun and Seo (2016) and they studied a simple numerical example of a special case in 2-out-of-3 structure. In this section, we consider also various grouping cases.
8.3.1
Case with three different testing time points
The failure rates of the three units areλ1 ≥ λ2 ≥ λ3 , the staggered testing consists of three groups, each group consists of a unit, and the units 1, 2, and 3 start the testing at time points 0, ta = ca τ , ta + tb = (ca + cb ) τ (0 < cb < 1, ca + cb < 1), respectively. The testing intervals of three units are all equal to τ , as shown in Fig. 8.2. The approximate unreliability of each unitqi (t),i = 1, 2, 3 can be expressed as Eq. (8.8) because λτ ≤ 0.1in practical cases. q1 (t ) = λ1t, 0 ≤ t < τ λ (t + τ − ta ), 0 ≤ t < ta q2 (t ) = 2 λ2 (t − ta ), ta ≤ t < τ λ (t + τ − ta − tb ), 0 ≤ t < ta + tb q3 (t ) = 3 t a + tb ≤ t < τ λ3 (t − ta − tb ),
(8.8)
It is clear that the unit with the highest failure rate should be tested first (Hirsch, 1971). Let λ2 = h2 λ1 , λ3 = h3 λ1 , 0 < h3 ≤ h2 ≤ 1, and then the unreliability function of the 1-out-of-3 structure is given by qS (t ) = q1 (t )q2 (t )q3 (t ) ⎧ 3 ⎪ ⎨h2 h3 λ1t(t + τ − ta )(t + τ − ta − tb ), = h2 h3 λ31t(t − ta )(t + τ − ta − tb ), ⎪ ⎩ h2 h3 λ31t(t − ta )(t − ta − tb ),
0 ≤ t < ta ta ≤ ta + tb
(8.9)
t a + tb ≤ t < τ
From Eqs. (8.1) and (8.9), PFD of 1-out-of-3 structure is: τ qS (t )dt h2 h3 (λ1 τ )3 = PFD = 0 6 τ × 2(−c3a + 3c2a + c3b − 2ca − cb ) + 3ca cb (1 − ca + cb ) + 1.5
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
233
λ1τ = 0.1, h2 = 0.5, h3 = 0.1
PFD 0.000012 0.000011 0.000010 0.000009 0.000008 0.000007 0.000006 0.000005 0
0 0.2
0.2 Ca
Cb
FIGURE 8.3
PFD shape in 1 out of 3 structure with three different groups
If we derive the first derivatives with respect to ca and cb and set those to 0, the following simultaneous equations are obtained. ∂PFD h2 h3 (λ1 τ )3 −6ca (ca + cb − 2) + (3c2b + 3cb − 4) = 0 = ∂ca 6 ∂PFD h2 h3 (λ1 τ )3 −3ca (ca + 2cb − 1) + 2(3c2b − 1) = 0 = ∂cb 6 Then there are several solutions to satisfying the equations simultaneously but the solution minimizing PFD (Hessian matrix is positive definite) is ca ∗ = 1/3, cb ∗ = 1/3, and the minimum value of PFD is obtained by PFD∗ =
h2 h3 (λ1 τ )3 12
(8.10)
Fig. 8.3 shows that the PFD is a convex function. That is, regardless of whether the failure rates of units are same or not, the uniformly staggered testing schemes is optimal as it is optimal in case of the 1-out-of-2 structure.
234
Safety and reliability modeling and its applications
a FIGURE 8.4
8.3.2
b
Staggered testing schemes for 1- and 2-out-of-3 structures: 2 groups
Case with two different testing time points
In the previous section, we assign different testing time points to three units but in this section, we consider two different testing time points as a simpler case and can apply the staggered testing schemes as shown in Fig. 8.4 by combining two units into one group. That is, Fig. 8.4a shows a case where two units with high failure rates are combined as a group and Fig. 8.3b shows a case where two units with low failure rates are combined into a group. (a) {1, 2} and {3} (b) {1} and {2, 3} When two units 1 and 2 having high failure rates are formed into a group and tested first, the unreliability function of the 1-out-of-3 structure (in Fig. 8.4a) is given as follows; h2 h3 λ31t 2 (t + τ − ta ), 0 ≤ t < ta (8.11) qS (t ) = ta ≤ t < τ h2 h3 λ31t 2 (t − ta ), From Eqs. (8.1) and (8.11), PFD of the 1-out-of-3 structure is given by τ qS (t )dt h2 h3 (λ1 τ )3 3 PFD = 0 = (2ca − 2ca + 1.5) τ 6 √ By setting the first derivative with ca to 0, we can find the solution 1/ 3and it is the optimal value because the function is a convex function However, the uniformly staggered testing scheme is not optimal unlike in the case of 1-out-of2 structure. From this optimal solution, the minimum PFD is given by √ h2 h3 (λ1 τ )3 (27 − 8 3) ∗ (8.12) PFD = 108 As shown in Fig. 8.4(b), when the units 2 and 3 are combined as a group and ∗ they are tested at same time √ points, the optimal ca can be obtained similarly and the optimal value is −1/ 3 + 1, and PDF is equal to Eq. (12). Thus, the optimal ca ∗ are different in the two cases but the optimal value of PFD is same regardless of the grouping methods. When Eqs. (8.10) and (8.12) are compared, The former is better than the latter √ in terms of PFD because 1 : (27 − 8 3)/9 ≈ 1.46 but the reverse is correct in terms of testing number (3:2) Thus, we should trade off the testing cost and the system availability to select the best staggered testing scheme.
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
235
8.4 Staggered testing in 2-out- of-3 structure 8.4.1
Case with three different testing time points
From the unreliability qi (t), i = 1, 2, 3 of three units in Section 3.1, the system unreliability of 2-out-of 3 structure is given as qS (t ) = q1 (t )q2 (t ) + q1 (t )q3 (t ) + q2 (t )q3 (t ) − 2q1 (t )q2 (t )q3 (t )
(8.13)
Thus, PFD can be obtained from Eqs. (8.1) and (8.13). Since the optimal ca and cb cannot be obtained as a closed-form unlike in the case of 1-out-of-3 structure, we should find the optimal solutions numerically and consider some special cases in this section. (1) Case with equal failure rate First, when the failure rates are same (λ1 = λ2 = λ3 = λ), from Eqs. (8.8) and (8.13), PFD of the 2-out-of 3 structure is given by PFD = (λτ )2 (c2a + c2b + ca cb − ca − cb + 1) (λτ )3 [2(2c3a +3c3a cb −3ca c2b −2c3b )−6(2c2a +ca cb )+4(2ca +cb ) − 3] 6 Then, by setting the first partial derivatives of PFD with ca and cb to 0, we can find the optimal values, ca ∗ = 1/3 and cb ∗ = 1/3 under λ1 τ < 1 (refer Fig. 8.5) and the uniformly staggered testing method is optimal. The optimal PFD is obtained as follows. +
(λτ )2 (4 − λτ ) 6 (2) The equally staggered testing intervals (ca = cb ) PFD∗ =
(8.14)
When the failure rates of units are different, PFD under ca = cb can be obtained from Eq. (8.13) as follows (λ1 τ )2 2 [3ca (h2 +4h3 +h2 h3 )−3ca (h2 +2h3 +h2 h3 )+2(h2 +h3 +h2 h3 )] 6 h1 h2 (λ1 τ )3 2 6ca − 4ca + 1 − 2 By setting the first derivative of PFD to 0, the optimal ca ∗ obtained as follows.
PFD =
h2 h3 (1 − 4λ1 τ ) + h2 + 2h3 2[h2 h3 (1 − 6λ1 τ ) + h2 + 4h3 ] h2 h3 (1 − 4λ1 τ ) + h2 + 2h3 = 2[h2 h3 (1 − 4λ1 τ ) + h2 + 2h3 + 2h3 (1 − h2 λ1 τ )]
c∗a =
(8.15)
When h2 λτ < 1, c∗a < 12 . Especially, when the failure rate are equal (h2 = h3 = 1) ca ∗ = 1/3, and the uniformly staggered testing method is optimal (refer Fig. 8.6).
236
Safety and reliability modeling and its applications λ1τ = 0.1, h2 = 1, h3 = 1
PFD 0.0095 0.0090 0.0085 0.0080 0.0075 0.0070 0
0 0.2
0.2
Cb Ca FIGURE 8.5
8.4.2
PFD shape in 2 out of 3 structure with three different groups
Case with two different testing time points
First, when the failure rate of all units is same and units 1 and 2 as a group are tested early (in the staggered testing method (Fig. 8.3a), PFD is given by PFD =
(λτ )2 λτ −4c3a + 4ca − 3 + 6ca (ca − 1) + 6 6
By differentiating with respect to ca the optimal value can be obtained as follows 3 − 12(λτ )2 − 18(λτ ) + 9 ∗ ca = (8.16) 6(λτ ) Fig. 8.7 shows the PFD curve in this case. Fig. 8.8a shows the value of ca ∗ according to λτ . This value is close to 1/2, which is slightly smaller than 1/2. When unit 1 is tested at time 0 and units 2 and 3 as a group are tested late, the optimal ca ∗ is as follows 3 − 12(λτ )2 − 18(λτ ) + 9 ∗ ca = 1 − (8.17) 6(λτ ) (a) {1, 2} and {3} (b) {1} and {2, 3}
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
FIGURE 8.6
237
PFD curve in 2 out of 3 structure (ca = cb ).
Then, Fig. 8.8b shows that the optimal, ca ∗ is close to 1/2 and is slightly larger than 1/2. If we let ca be 1/2 approximately, the minimum PFD is given by PFD∗ ≈
(λτ )2 (3 − λτ ) 4
(8.18)
When the failure rates of units are different and units 1 and 2 with high failure rates are formed into a group and are tested early, PFD is given by PFD =
(λ1 τ )2 [λ1 τ h2 h3 (−4c3a + 4ca − 3) + 3h3 (h2 + 1)ca (ca − 1) 6 + 2(h2 h3 + h2 + h3 )] (8.19)
238
Safety and reliability modeling and its applications
FIGURE 8.7
PFD curve in 2 out of 3 structure with 2 groups (1+2:3)
By setting the first derivative with ca to 0, the optimal ca ∗ can be obtained as follows (refer Fig. 8.9); 3(h2 + 1) − 3h22 16(λ1 τ )2 − 12(λ1 τ ) + 3 − 18h2 [2(λ1 τ ) − 1] + 9 ∗ ca = 12h2 (λ1 τ ) (8.20) From Eq. (8.20), we can know that the optimal ca ∗ is not dependent on h3 . As a special case, we assume that two units 1 and 2 have same failure rate (h2 = 1) and are combined as a group. In this case, PFD can be expressed by the following simplified form PFD =
(λ1 τ )2 λ1 τ h3 −4c3a + 4ca − 3 + 6h3 ca (ca − 1) + 2(2h3 + 1) 6
The optimal ca ∗ is given by c∗a
=
3−
12(λ1 τ )2 − 18(λ1 τ ) + 9 6(λ1 τ )
(8.21)
a
b
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
FIGURE 8.8
2-out-of-3 structure: ca ∗ for case with two testing groups
239
240
Safety and reliability modeling and its applications
FIGURE 8.9
PFD curve in 2 out of 3 structure with 2 groups (1: 2+3)
As another special case, we consider the case in which two units 2 and 3 with low failure rates are combined into a group and the group is tested late. The PFD and the optimal ca ∗ can be obtained PFD =
c∗a
=1−
(λ1 τ )2 [λ1 τ h2 h3 (4c3a − 12c2a + 8ca − 3) + 3(h2 + h3 )ca (ca − 1) 6 + 2(h2 h3 + h2 + h3 )] (8.22)
3(h2 + h3 ) −
3h22 16(h3 λ1 τ )2 − 12(h3 λ1 τ ) + 3 − 18h2 h3 [2(h3 λ1 τ ) − 1] + 9h23 12h2 (h3 λ1 τ )
(8.23)
Here, the difference between the Eqs. (19) and (22) with the same value of ca is given by =
(λ1 τ )2 h2 ca (1 − ca )[4λ1 τ h3 (2ca − 1) + 3(1 − h3 )] 6
(8.24)
Since the value of ca is close to 1/2, it is more likely to be positive in Eq. (8.24) and it is more advantageous to group two units with low failure rates in case of 2-out-of 3 structure unlike in case of 1-out-of-3 structure. Thus, we investigate the trend numerically in the next section.
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
241
TABLE 8.1 The optimal ca and PFD in cases with different values of h2 and h3 Grouping
{1, 2} and {3} ∗
{1} and {2, 3}
PFD∗
ca
∗
{1}, {2}, and {3}
PFD∗
ca ∗ = cb ∗
PFD∗
(h2 ,h3 )
ca
(1, 1)
0.49075
0.007249
0.50925
0.007249
1/3
0.006500
(1, 0.5)
0.49075
0.005291
0.50595
0.004666
0.35938
0.004350
(0.5, 0.5)
0.49045
0.003167
0.50438
0.002854
0.31731
0.002733
(0.5, 0.1)
0.49045
0.001967
0.50141
0.001404
0.39674
0.001418
8.4.3
Numerical examples
Table 8.1 summarizes the optimal values of ca and PFD in cases with different values of (h2 , h3 ) and λ1 τ = 0.1. The table shows that it is better to form a group with two units with low failure rates. Fig. 8.10 shows in Eq. (8.24) for various values of h3 from 0.01 to 1 under h2 = 1. The optimal ca ∗ in two grouping cases is close to 1/2 and the optimal ca ∗ in case of two units with low failure rates has a value slightly greater than 0.5. But
FIGURE 8.10
Difference of PFD for type of two group formations
242
Safety and reliability modeling and its applications
the optimal ca ∗ in case of two units with high failure rates has a value slightly smaller than 0.5. Table 8.1 shows that different grouping methods do not give considerable gaps in PFD and grouping can save the testing cost. In case that λ3 is small relatively, PFD of grouping case is even less than that of the non-grouping case under restriction (ca =cb ) and refer the figures in red line of the difference.
8.5 Conclusions In this article, we considered staggered testing problems to determine optimally the testing time points in SIS with redundant units. Especially, we studied staggered testing schemes to reduce system PFD in which different testing time points are given to safety units. If we assume negligible testing times, PFD defined in IEC 61508 can be reduced by shifting the testing periods and applying different testing time points to redundant units in SIS. In existing papers related to staggered testing, redundant units have the same failure rates, equal testing intervals, and patterns in testing cycles. Thus, we tried to improve the testing scheme to reduce PFD of SIS in this article. As an exception, the optimality of staggered testing in SIS with 1-out-of-2 structure with the different failure rates has been investigated in detail in Liu (2014). This article studied the optimality of staggered testing in 1- and 2-outof-3 structure, which are the main target redundant systems of IEC 61508. We reviewed the optimality of current simple testing schemes and found the optimal staggered testing time points. As a practical and simple scheme to decrease the number of testing time points, we consider combining two units as a test group in SIS with 3 units and investigated the promising testing time points to minimize PFD. We studied some numerical examples to check the system PFD of different staggered test schemes. For further studies, we can consider different testing intervals for SIS with three units and also study cost models for staggered testing schemes. As more practical cases, we consider non negligible test times to optimize the staggered testing methods to minimize PFD.
References Contini, S., Copelli, S., Raboni, M., Torretta, V., Cattaneo, C.S., Rota, R., 2013. IEC 61508: Effect of test policy on the probability of failure on demand of safety instrumented systems. Chem. Eng. Trans. 33, 487–492. Green, A.E., Bourne, A.J., 1972. Reliability Technology. Wiley, New Jersey (USA). Hirsch, M., 1971. setting test intervals and allowable bypass times as a function of protection system goals. IEEE Trans. Nucl. Sci. 18, 488–494. IEC 61508, 2000. Functional Safety of Electrical/Electronic/Programmable Electronic (E/E/PE) Safety Related Systems, 1st edition IEC, Switzerland Part 1–7.
Optimum staggered testing strategy for 1- and 2-out-of-3 redundant Chapter | 8
243
IEC 61508, 2010. Functional Safety of Electrical/Electronic/Programmable Electronic (E/E/PE) Safety Related Systems, 2nd edition IEC, Switzerland Part 1–7. ISO 26262, 2011. Road Vehicles-Functional Safety. ISO, Switzerland Part 1-10. Jahanian, H., 2015. Generalizing PFD formulas of IEC 61508 for KooN configurations. ISA Transactions 55, 168–174. Liu, L., 2014. Optimal Staggered Testing Strategies for Heterogeneously Redundant Safety Systems. Reliab. Eng. Syst. Safe. 126, 65–71. Liu, Y.L., Rausand, M., 2013. Reliability effects of testing strategies on safety- instrumented systems in different demand modes. Reliab. Eng. Syst. Safe 119, 235–243. Oliveira, L.F., Abramovitch, R.N., 2010. Extension of ISA TR84.00.02 PFD equations to KooN architectures. Reliability Engineering and System Safety 95, 707–715. Rausand, M., Høyland, A., 2004. System Reliability: Models, Statistical Methods and Applications, 2nd ed. Wiley. Rausand, M., 2014. Reliability of Safety-Critical Systems. Wiley. Rouvroye, J.L., Wiegerinck, J.A.M., 2006. Minimizing costs while meeting safety requirements: modeling deterministic (imperfect) staggered tests using standard markov models for SIL calculations. ISA Transactions 45, 611–621. Seo, S.-K., 2012. on reliability performance of safety instrumented systems with common cause failures in IEC 61508 standard. IE Interfaces 25, 405–415 in Korean. Vaurio, J.K., 1980. Availability of Redundant Safety Systems with Common-Mode and Undetected Failures. Nucl. Eng. Des. 58, 415–424. Vaurio, J.K., 2011. Unavailability Equations of k-out-of-n Systems. Reliab. Eng. Syst. Safe. 96, 350– 352. Yun, W.Y., Seo, S.K., 2016. Optimum Staggered Testing for Redundant Safety Systems. IEICE Technical Committee on Reliability Conference 19–25.
Non-Print Items Abstract In redundant safety instrumented systems (SIS), staggered testing is effective in reducing the unavailability or average probability of failure on demand (PFD) of SIS than simultaneous or consecutive testing. In this Chapter, the existing models are reviewed, SIS with 1- and 2-out-of-3 structure is analyzed, and the optimal staggered testing schemes are found. As a practical case, the optimal staggered testing method is investigated for SIS in which two units among three units are formed as a testing group with equal testing time points. We investigate also the optimality of grouping method. Finally, we study also numerical examples. Keywords Functional safety; High-demand operating mode; Markov chain; Optimal staggered testing; Safety instrument system
Chapter 9
Modified failure modes and effects analysis model for critical and complex repairable systems Garima Sharma and Rajiv Nandan Rai Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
9.1 Introduction Failure modes and effects analysis (FMEA) is a systematic iterative documentation process, performed to identify basic faults at part level and determine their effects at higher-levels of assembly. It is a bottom-up approach and provides thorough insight into the system’s strengths and weaknesses by considering various failure modes. The failure modes are ranked according to the combined influence of severity, detection, and probability of occurrence. In other words, FMEA analyzes the potential failure modes to determine its effects on the system and is classified based on its severity and criticality. The benefits of FMEA are identification of problems in reliability design of the system which must be eliminated or their effect should be minimized by design modification or tradeoffs. Information and knowledge gained by performing the FMEA can also be used as a basis for trouble shooting activities, maintenance, manual development, and design of effective built-in test techniques. The FMEA could be both qualitative and quantitative (MIL_STD_1629A, 1984). Generally, industries prioritize the FMs using risk priority number (RPN), which is a multiplication of severity (S) involved in failures, probability of occurrence of failures (O), and detection (D) before the failure occurs. All three are assigned a value on scale of 1 − 10. High RPN represents the top priority for improvement. Reliability Analysis Center [CRTA –FMECA] (Borgovini et al., 1993) identifies RPN calculation as an alternate method for criticality analysis of the MIL-STD-1629A. Since past few decades, different methods of FMEA have been used effectively for nonrepairable systems (Ebeling, 2004) but they lack in addressing repairable systems’ (Rigdon and Basu, 2000) issues as a reliability and systems Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00016-7 Copyright © 2021 Elsevier Inc. All rights reserved.
245
246
Safety and reliability modeling and its applications
engineering tool to identify and mitigate risks arising due to frequent failures. Maintenance of complex systems like aero engines or nuclear power plants has been a challenging task (Sharma and Rai, 2019) for maintenance engineers, as high skills and expertise are required to accomplish these tasks proficiently. Also, the sub-systems of such systems are interrelated with each other for proper functionability of entire system. On failure, these subsystems are generally exposed to imperfect repairs, which imply that the repair actions bring the component to a state that is in between the new state and the state prior to failure. In this chapter, we will learn to conduct FMEA and estimate RPN for critical and complex repairable systems which are subjected to imperfect repair, using fuzzy analytic hierarchy process (fuzzy- AHP). The basics of repairable systems, imperfect repair, fuzzy- AHP and RPN are explained in the subsequent sections. However, a brief description of steps followed for the methodology are given as follows: Step 1: Identify such important decision criteria that could be considered as failure modes for repairable system. For example one of the most important criteria which can affect the quality of corrective maintenance (CM)/repair could be SKILL of the repair men. Step 2: Model these failure modes through Generalized Renewal Process (GRP) (Tanwar et al., 2014) and estimate the repair effectiveness index (REI) for each criteria. Step 3: Determine the importance/priority weights of all selected decision criteria through pairwise comparisons with the help of fuzzy- AHP (Chang, 1996a). This is a technique for converting the vagueness of human perceptions and multiple decision-making capability into a mathematical model. The fuzzy-AHP is the fuzzy extension of AHP to efficiently handle the fuzziness of the data involved in the decision. Step 4: Introduce REIs for all the criteria and priority weights in the FMEA model. Conventional FMEA (Ebeling, 2004) estimates RPN as a product of, O, and D. In the existing practice of carrying out FMEA, efforts to reduce RPN are done mainly by incorporating detection measures. Reducing O requires advanced reliability improvement measures like design modifications, etc. which are time consuming, expensive, and affects the component availability. Hence, this chapter presents a methodology to estimate RPN by incorporating the effect of REI and priority weights in it by extending the concept presented in (Rai and Bolia, 2015). The chapter also provides the remedial measures for improving the RPN. The potential of the FMEA model is explained through a case study of the space station’s environmental control and life support system (ECLSS).
Modified failure modes and effects analysis model Chapter | 9
247
9.2 Repairable Systems and Imperfect Repair A system is defined as repairable if it is repaired and not replaced when failure occurs. A system after imperfect repair can be assumed to be restored to 1) ‘as good as new’ (AGAN) or 2) ‘as bad as old’ (ABAO) condition (Brown and Proschan, 1983). As good as new and ABAO conditions can be easily modeled through perfect renewal process (PRP) and nonhomogeneous Poisson process (NHPP), respectively. However, many maintenance activities may not realistically result in either of these two extreme situations but in a complicated intermediate one. That is, when the system is maintained correctively or preventively, its failure rate is restored somewhere between AGAN and ABAO conditions. To model this intermediate condition, the GRP model (Kijima, 1989) provides a third parameter called REI denoted by q (Yanez et al., 2002) along with the shape and scale parameters such that, when: q = 0 the component is brought to AGAN; q = 1 signifies that the component is restored to ABAO condition and the values of q that fall in the interval 0 < q < 1 represent the after repair state in which the condition of the system is better than old but worse than new. Therefore q can be physically interpreted as an index for representing effectiveness and quality of repair. We have used GRP for the present methodology since it provides the most practical and realistic approach to deal with repairable systems. The concept of GRP and maximum likelihood estimators (MLEs) for estimation of model parameters have been discussed at length (Rai and Bolia, 2014; Rai and Sharma, 2017; Bolia and Rai, 2013). The same idea forms a part of the methodology of this chapter; hence are being reproduced in this section for ready reference. Kijima & Sumita (1986) develop an imperfect repair model by utilizing the concept of the virtual age (Vn ). Suppose Vn is the estimated age of the system instantly after the nth repair. If Vn = y, then the system has a time to the (n − 1)th failure, Xn + 1 , which is distributed according to the following cumulative density function (cdf): F (X|Vn = y) =
F (X + y) − F (y) 1 − F (y)
(9.1)
where F(x) is cdf the of a system or component. It is assumed that the nth repair would only recompense for the damage accumulated during the time between the (n − 1)th and the nth failure. Hence, the virtual age of the system after the nth repair is: Vn = Vn−1 + qXn
(9.2)
where q is the REI, V0 = 0 and Xn is the time between the (n − 1)th and the nth failure. The values ofq fall in the interval 0 ≤ q ≤ 1 representing various afterrepair states as explained earlier. However, in practice, the nth repair action could decrease all damage accumulated up to nth failure, yielding the Kijima model II
248
Safety and reliability modeling and its applications
(KII) for virtual age: Vn = q(Vn−1 + Xn )
(9.3)
Note: In this chapter we consider the KI model for our methodology. Let ti be the observed time to ith failure and xi be the observed interarrival time between (i − 1)th and ith failure and a and b represents the scale and shape parameters respectively; following cdf and pdf can be derived (Yanez et al. 2002): (9.4) f ((ti |ti−1 )) = ab(Vi−1 + xi )b−1 × exp a (Vi−1 )b − (Vi−1 + xi )b F (ti |ti−1 ) = 1 − exp a (Vi−1 )b − (Vi−1 + xi )b
(9.5)
It is then possible to derive the MLEs for parameters a, b, and q from the data. The likelihood function (Yanez et al., 2002) is placed below in Eq. (9.6). n ab(Vi−1 + xi )b−1 × exp a (Vi−1 )b − (Vi−1 + xi )b L=
(9.6)
i=1
Taking log on both sides ln L = n log (b) + n log a + (b − 1)
n
log (Vi−1 + xi )
i=1
+
n a (Vi−1 )b − (Vi−1 + xi )b
(9.7)
i=1
The logarithm of the likelihood function (Eq. (9.7)) is differentiated with respect to each of the three parameters a, b, and q and after equating each derivative to zero, a system of three equations with three unknown variables can be obtained. For obtaining the values of a, b, and q the nonlinear Eqs. (9.8)– (9.10)” shown below are required to be solved. However, the model parameters also can be obtained by maximizing log-likelihood function as shown in Eq. (9.7). n n ∂Log(L) = + a (Vi−1 )b log (Vi−1 ) − (Vi−1 + xi )b log (Vi−1 + xi ) ∂b b i=1
+
n
log(Vi−1 + xi ) = 0
(9.8)
i=1
∂Log(L) n (Vi−1 )b − (Vi−1 + xi )b = 0 = + ∂a a i=1 n
(9.9)
Modified failure modes and effects analysis model Chapter | 9
249
i−1 n ∂Log(L) j=1 x j = (b − 1) i−1 ∂q j=1 x j + xi i=1 q ⎡ ⎛ ⎞b−1 ⎛ ⎞ ⎛ ⎞b−1 ⎛ ⎞⎤ n i−1 i−1 i−1 i−1 ⎢ ⎝ ⎥ +a x j⎠ ⎝ x j ⎠ − b⎝q x j + xi ⎠ ⎝ x j ⎠⎦ = 0 ⎣b q i=1
j=1
j=1
j=1
j=1
(9.10)
9.3 Fuzzy AHP Fuzzy-AHP is a multicriteria decision-making method (Tang and Lin, 2011) that utilizes the characteristics of both AHP method as well as fuzzy set theory. AHP provides the weights of hieratical sequential criteria using pairwise comparisons (Celik et al., 2009) and fuzzy logic removes the fuzziness from the pairwise weights decided by the decision-makers for final selection of given alternatives. Fuzzy set theory uses the concept of membership function which ranges from 0 to 1. The membership function value of any criteria can differ according to the decision-maker’s perception on the same context. To remove the fuzziness from the process, fuzzy logic comprises of three types of membership functions: monotonic, triangular, and trapezoidal (Wang et al, 2008), (Chang, 1996a). For the present methodology, triangular membership function is used in which three base points of triangle are taken from triangular fuzzy conversion scale. Thus, the combination of these two methods produces efficient and acceptable results. The extent analysis method can be used to estimate the final weights of criteria as explained by (Chang 1996a), (Zhu, Jing, and Chang 1999). However, in the following subsection fuzzy extent analysis method is explained for the ready reference for the readers.
9.3.1
Fuzzy extent analysis method for weights calculation
The extent analysis method (Zhu et al., 1999) is used to estimate the final weights of criteria and subcriteria. A detailed stepwise calculation of final weights is shown as an example to understand the concept of fuzzy- AHP. Let Z = z1 , z2 ,…zn be an article set and U = u1 , u2 ,…un be an objective set. Each article is taken and extent analysis for each objective gi is performed, respectively. The m extent analysis values for each article can be obtained, with the following signs: Mgi1 , Mgi2 , Mgi3 , . . . ..Mgim i = (1, 2, . . . , n) Where, all Mgij = ( j = 1, 2 . . . ,) are triangular fuzzy numbers. The steps for extent analysis method on fuzzy – AHP are as follows. Step 1: Construction of the Fuzzy- AHP comparison matrix
250
Safety and reliability modeling and its applications
TABLE 9.1 Triangular fuzzy conversion scale (Lee, et al., 2013) Linguistic fuzzy scale
Triangular fuzzy
Triangular fuzzy number
number scale
reciprocal scale
Equally important
(1, 1, 1)
(1, 1, 1)
Weakly more important
(2/3, 1, 3/2)
(2/3, 1, 3/2)
Fairly strongly more important
(3/2, 2, 5/2)
(2/5, 1/2, 2/3)
Very strongly more important
(5/2, 3, 7/2)
(2/7, 1/3, 2/5)
Absolutely more important
(7/2, 4, 9/2)
(2/9, 1/4, 2/7)
The linguistic fuzzy scale is used to construct the matrix. The standard scale table for triangular fuzzy conversion is shown at Table 9.1. Triangular fuzzy numbers and the fuzzy are used for the pairwise comparison evaluation matrix A˜ = ai j n×m is constructed; where ai j = li j , mi j , ui j is the relative importance of ith element over jth element in pair wise comparison and lij , mij , and uij are the lower, medium, and upper bound values of ai j , respectively. Also ai j are satisfied with 1 1 1 ·, ·mi j = , .ui j = li j = l ji m ji u ji Step 2: Calculation of the value of fuzzy synthetic extent Si with respect to ith criteria The formula to calculate the Si is placed below: ⎡ ⎤−1 m n m j j Mgi ⎣ Mgi ⎦ Si = j=1
(9.11)
i=1 j=1
Here Si , is defined as the fuzzy synthetic extent and is the multiplication of m Mgji the fuzzy addition operation of m extent triangular fuzzy numbers. To get j=1
analysis values for a particular matrix is performed such that ⎛ ⎞ m m m m j Mgi = ⎝ l j, mj , u j⎠ j=1
j=1
j=1
For the calculation of the term below: ⎡ ⎤−1 n m ⎣ Mgji ⎦ i=1 j=1
j=1
(9.12)
Modified failure modes and effects analysis model Chapter | 9
First find: n m
Mgji
=
n
i=1 j=1
li ,
i=1
n
mi ,
i=1
n
ui
(9.13)
i=1
The inverse of the above equation will give the value of [ ⎡ ⎣
n m
⎤−1 Mgji ⎦
=
1
n
i=1
i=1 j=1
ui
251
, n
1
i=1
n m i=1
1
mi
j=1
Mgji ]−1
, n
i=1 li
(9.14)
Then, the value of fuzzy synthetic extent with respect to the ith article is defined as ⎡ ⎤−1 m n m Mgij ⎣ Mgji ⎦ Si = j=1
i=1 j=1
Step 3: Estimation of the sets of Weight Values of the Fuzzy- AHP To find the approximations for the sets of weight values under each criterion, it is essential to consider a principle of comparison for fuzzy numbers, which is as given below: The degree of possibility of M1 = (l1 , m1 , u1 ) ≥ M2 = (l2 , m2 , u2 ) is defined as V (M1 ≥ M2 ) = Sup x ≥y min μM1 (x), μM2 (x) V (M1 ≥ M2 ) = 1 · if · m1 ≥ m2 V (M2 ≥ M1 ) = hgt (M1 ∩ M2 ) = μM1 (d)
(9.15)
Where, d is the ordinate of the highest intersection point D (Chang, 1996a) (Fig. 9.1) between μM1 (d) and μM2 (d). Also, the above equation can be equivalently expressed as follows: V (M2 ≥ M1 ) = hgt (M1 ∩ M2 ) = μM1 (d) ⎧ 1, if m2 ≥ m1 ⎪ ⎨ if l1 ≥ u2 = 0, ⎪ ⎩ l1−u2 , otherwise (m2−u2)−(m1−l1)
(9.16)
The following Fig. 9.1 illustrates the above equation To compare M1 and M2 , we need both the values of V (M1 ≥ M2 ) and V (M2 ≥ M1 ) Step 4: Calculation of the Sets of Weight values of the Fuzzy- AHP (Chang, 1996b)
252
Safety and reliability modeling and its applications
FIGURE 9.1
The intersection between M1 and M2 .
The degree of possibility for a convex fuzzy number to be greater than k convex fuzzy numbers Mi = (i = 1, 2, ÅÅÅ, k) can be defined by V (M ≥ M1 , . . . , Mk ) = V [(M ≥ M1 ) and (M ≥ M2 ) and . . . and (M ≥ Mk )] = min V (M ≥ Mi )i = 1, 2, · · · , k (9.17) Assume that,
d (Ai ) = min V (Si ≥ Sk ) For k = 1, 2. . . , n; k = i Then, the weight vector is given by T W = d (A1 ), d (A2 ), . . . , d (An )
(9.18)
(9.19)
where, Ai (i = 1, …, n) are n elements. Through normalization, the normalized weight vectors are W = [(A1 ), (A2 ), . . . , (An )]T
(9.20)
where, W is a nonfuzzy number.
9.4 Estimation of RPN As explained earlier FMEA is performed by identifying the failure modes, finding out their causes and consequences, then estimating the probabilities of occurrence and finally determining corrective actions or preventive measures. Conventional RPN is a quantitative measure that combines the probability of the failure modes occurrence with its severity ranking and detection (Rai and Bolia, 2015). RPN is given by: RPN = O ∗ S ∗ D(1 ≤ RPN ≤ 1000)
(9.21)
The conventional approach of estimating O, S, D and calculating RPN is explained above. It is now aimed to present the RPN method based on REI (q).
Modified failure modes and effects analysis model Chapter | 9
253
From Eq. (9.2) the following equation can be obtained for KI model: Vi = q
i
xj
(9.22)
j=1
Assuming GRP, the conditional probability of occurrence can be written as: F (Vi |V(i−1) ) = 1 − exp a (V(i−1) )b − (V(i−1) + xi )b (9.23) Substituting Eq. (9.22) in (9.23), following equation is obtained: ⎡⎧ ⎫b ⎧⎛ ⎫b ⎤ ⎞ i−1 i−1 ⎨ ⎬ ⎬ ⎨ ⎥ ⎢ F (Vi |Vi−1 ) = 1 − exp a⎣ q x j − ⎝q x j ⎠ + xi ⎦ ⎩ ⎭ ⎭ ⎩ j=1
(9.24)
j=1
From the above equation it is observed that the failure probability F(Vi |Vi − 1 ), is directly related to q. It is recalled from Eq. (9.21) that the RPN is given by O∗ S∗ D. Here O is a measure of the probability of occurrence of the failures. The issue with using O is that it doesn’t incorporate the effect of REI if the failure dynamics are governed by GRP. Therefore, the presented approach works on replacing O with a measure of the conditional probability of occurrence, thus incorporating the effect of q. The methodology is further explained as follows [Eqs. (9.25)–(9.27)]: RPNSystem =
p
RPNi
(9.25)
i=1
where: p: number of selected criteria as failure modes (i = 1, 2, …, p) RPNSystem : risk priority number of system due to maintenance RPNi : risk priority number of system due to ith criteria Now RPNi = qFM(i) × WFM(i) × SFM(i) × DFM(i)
(9.26)
Where qFM(i) : repair effectiveness index due to ith criteria (FM) WFM(i) : importance weights of ith criteria (FM) estimated through fuzzy − AHP. SFM(i) : severity level due to ith criteria (FM). DFM(i) : detection Level of failures due to ith criteria (FM) Substituting Eq. (9.26) in Eq. (9.25) the final expression for RPNSystem is: RPNSystem =
p i=1
qFM(i) × WFM(i) × SFM(i) × DFM(i)
(9.27)
254
Safety and reliability modeling and its applications
The weights can be obtained through fuzzy − AHP as explained in Section 9.3 and the values of q can be obtained as explained in Section 9.2. The severity and detection values can be obtained as explained in succeeding Sections 9.4.1 and 9.4.2.
9.4.1
Classification of severity
Severity classifications can be designated to each failure mode on a scale of 1 − 10. The failures can be classified (Ebeling, 2004; Rai and Bolia, 2015) in one of the following four categories. Category I: catastrophic. Significant system failure occurs that can result in injury, loss of life, or major damage. Category II: critical. Complete loss of system occurs. Performance is unacceptable. Category III: marginal. System is degraded, with partial loss in performance. Category IV: negligible. Minor failure occurs with no effect on acceptable system performance. The foundation of assigning numbers is subjective in nature and is based on the consequence of the severity. Depending on the failure being in one of the above four categories, it can be rated on a scale of 1 to 10. Category I can be rated as 10 whereas category IV can be rated as 1. The numbers assigned to categories III and IV varies from 9 − 2.
9.4.2
Detection
The level of detection depends upon the ability of the sensors to detect and prevent the cause of the failure mode from occurring. It is also described on a 10 point scale. Bad detection can be rated as 10 and good detection as 1.0 (Rai and Bolia, 2015).
9.5 Case study1 The case study of environmental control and life support system (ECLSS) of orbital space station’s (OSS) is selected as an example to illustrate the presented methodology. Maintenance of complex system like OSS ECLSS is a challenging task for the modern day maintenance engineers, as high skills and expertise are required to accomplish these tasks proficiently. The ECLSS of an OSS is a critical system, which includes several complex subsystems, such as atmosphere management, water management, food production, waste management 1 The
paper has been published in Springer International Publishing AG, part of Springer Nature 2019, R. L. Boring (Ed.): AHFE 2018, AISC 778, pp. 77–87, 2019.https://doi.org/10.1007/978-3319-94391-6_8
Modified failure modes and effects analysis model Chapter | 9
255
and crew safety. Moreover, the systems are interrelated with each other for proper functionability of entire subsystem. On failure, ECLSS subsystems are generally exposed to imperfect repairs, which imply that the repair actions bring the subsystems to a state that is in between the new state and the state prior to failure. The first step is to identify important decision criteria relevant to the current maintenance of OSS ECLSS and that could be 1) skill, 2) environment, 3) procedure, and 4) resources. The skills are required to undertake a number of processes including inspection, servicing, troubleshooting, removal, installation, rigging, testing, and repairing during maintenance of OSS ECLSS. The environmental conditions inside OSS is adverse due to microgravity conditions. Since the crew is unable to bear weight on their feet, in the long term there are many health problems associated with it. Bones and muscles weaken, and other changes also take place within the body. This adversely affects the working conditions in executing the maintenance task of OSS ECLSS. Adherence to the procedure helps ensure that the crew is properly trained and each workplace has the necessary equipment and other resources to perform the job. Approved written procedures are required to be followed for performance of all maintenance and repair activities by the crew. Maintenance resources are needed to facilitate the successful completion of the maintenance task. The resource is the crew’s most important requirement to get the given work done. Generally, requirement of resources is dictated by the features of the environmental factors, and actions of the crew for OSS ECLSS maintenance. Now, considering these four criteria, fuzzy-AHP weights are estimated with the help of fuzzy extent analysis as explained in Section 9.3.1 and the obtained weights are as appended below: ⎤ ⎡ ⎤ ⎡ 0.17 WFM(Skill) ⎢ ⎥ ⎢WFM(Environment ) ⎥ ⎥ = ⎢0.38⎥ ⎢ (9.28) ⎣0.06⎦ ⎣ WFM(Procedure) ⎦ WFM(Resources) 0.38 The RPNSystem [Eq. (9.25)] in case of ECLSS can be written as: RPNECLSS(M) = RPNSkill + RPNEnvironment + RPNProcedure + RPNResources Where: RPNECLSS(M) : risk priority number of ECLSS due to maintenance RPNSkill : risk priority number of ECLSS due to skill RPNEnvironment : risk priority number of ECLSS due to environment RPNProcedure : risk priority number of ECLSS due to procedure RPNResources : risk priority number of ECLSS due to procedure Thus from Eq. (9.26), RPNSkill = qFM(Skill) × WFM(Skill) × SFM(Skill) × DFM(Skill)
256
Safety and reliability modeling and its applications
RPNEnvironment = qFM(Environment ) × WFM(Environment ) × SFM(Environment ) × DFM(Environment ) RPNProcedure = qFM(Procedure) × WFM(Procedure) × SFM(Procedure) × DFM(Procedure) RPNResources = qFM(Resources) × WFM(Resources) × SFM(Resources) × DFM(Resources) Where qFM(Skill) , qFM(Environment) , qFM(Procedure) and qFM(Resources) : repair effectiveness indices due to skill, environment, procedure, and resources. WFM(Skill) , WFM(Environment) , WFM(Procedure) and WFM(Resources) : importance weights of all the four criteria estimated through fuzzy- AHP. SFM(Skill) , SFM(Environment) , SFM(Procedure) and SFM(Resources) : severity level due to skill, environment, procedure, and resources. DFM(Skill) , DFM(Environment) , DFM(Procedure) , and DFM(Resources) : detection Level of failures due to skill, environment, procedure, and resources. Thus the final algorithm for the RPN of ECLSS due to maintenance is: RPNECLSS(M) = {[qFM(Skill) × WFM(Skill) × SFM(Skill) × DFM(Skill) ] + [qFM(Environment ) × WFM(Environment ) × SFM(Environment ) × DFM(Environment ) ] + [qFM(Procedure) ×WFM(Procedure) ×SFM(Procedure) ×DFM(Procedure) ] + [qFM(Resources) ×WFM(Resources) ×SFM(Resources) ×DFM(Resources) ]} The values of weights for the four criterions as shown in Eq. (9.28) are also scaled on a numeric scale of 1 − 10 on similar lines as that of scaling of the probability of occurrence (O) (Ebeling, 2004; Rai and Bolia, 2015). The scale decided for the weights for the RPN estimation is as follows: For 0 ≤ W ≤ 0.2, the values assigned are from (1 − 6) and For 0.2 ≤ W ≤ 0.3, the values assigned are from (7 − 10) Based on Eq. (9.28) and the scale explained above following values are assigned to the weights of all four criteria: ⎤ ⎡ ⎤ ⎡ 6 WFM(Skill) ⎢WFM(Environment ) ⎥ ⎢10⎥ ⎥ ⎢ ⎥ ⎢ ⎣ WFM(Procedure) ⎦ = ⎣ 2 ⎦ WFM(Resources) 10 The severity values designated to the four criteria respectively as explained in sub-section 9.4.1 are as follows: (T means transpose of matrix) The detection values assigned to all the four criteria as explained in subsection 9.4.2 are as follows: T DFM(Skill) , DFM(Environment ) , DFM(Procedure) , DFM(Resources) = [5, 4, 6, 2]T
Modified failure modes and effects analysis model Chapter | 9
FIGURE 9.2
257
Graphs between REI (q) and RPN.
The values of RPN as a function of the corresponding q obtained for all the criterions are as appended below: RPNSkill = 6 × 5 × 5 × qFM(Skill) = 150qFM(Skill) RPNEnvironmentill = 10 × 8 × 4 × qFM(Environment ) = 320qFM(Environment ) RPNProcedure = 2 × 6 × 6 × qFM(Procedure) = 72qFM(Skill) RPNResources = 10 × 10 × 2 × qFM(Resources) = 200qFM(Resources) As explained in Section 9.2, the values of q varies from 0 to 1. Hence the sensitivity graph of RPNi is plotted for different values of q and are placed at Fig. 9.2. The final value of RPNECLSS(M) can be obtained using Eq. (9.25). It can be seen from Fig. 9.2 that as the values of q increases from 0 to 1, the RPN also increases. Thus if the RPN has to be kept low the value of REI (q) should be as low as possible.
258
9.5.1
Safety and reliability modeling and its applications
Remedial measures
It is observed from Fig. 9.2 that to achieve a low value of RPN, the value of REI (q) for the four selected criteria (Sharma and Rai, 2019) i.e., skill, environment, procedure and resources need to be kept as low as possible for OSS ECLSS maintenance . The skill of the crew needs to be enhanced. The attributes of the selfmanagement, communication and interpersonal skills, problem-solving, ability to consistently work safely and rigorously and adhering to OSS regulations are required to be inculcated in the crew members. For possessing the required skill to carry out a specified maintenance task the crew should be trained accordingly. Training is an extremely useful tool that can help the crew to be in a position where the maintenance and repair task of OSS ECLSS can be done correctly, effectively, and meticulously. The crew members inside OSS should be fully aware of the right standard operating procedures and follow servicing packages properly while carrying out maintenance tasks for the ECLSS. Following the right procedure is the standard approach to identify the knowledge, skills, and attitudes necessary to perform each task in a given job. Adherence to the procedure helps ensure that each crew member is properly trained and each workplace has the necessary equipment and other resources to perform the job. Unworkable or ambiguous procedures are one of the most common reasons for procedural violations. The resources available should be adequate to cater for both preventive and corrective maintenance tasks. The ability of a crew member inside OSS to complete a ECLSS maintenance activity may be greatly affected by the non-availability of resources. The performance of an activity may be further affected if the available resources are of low quality or inadequate for an activity. Therefore, forward planning to locate, acquire, and store resources is essential to complete a job more effectively, correctly and efficiently. It is also essential to properly maintain the available resources. Moreover, necessary arrangements are required to be made to acquire resources particularly the spare parts in time to achieve a high availability of ECLSS equipment (both the storage and recycling). In view of the foregoing it is reiterated that, if the RPN is to be kept low so that the risk associated with the maintenance of OSS ECLSS is kept at bare minimum, then the REI (q) associated with skill, environment, procedure, and resources are to be kept as close to zero as possible.
9.6 Conclusion and Future Scope Failure modes and effects analysis presently finds a limited use in complex and critical repairable systems maintenance as a reliability engineering tool to identify and reduce failure risks. Conventional RPN is a quantitative measure that combines the probability of the failure modes occurrence with its severity ranking. This chapter modifies the conventional FMEA model by introducing REI (q) and priority weights for all the criteria in the modified FMEA model. Hence
Modified failure modes and effects analysis model Chapter | 9
259
the proposed modified algorithm incorporates the effect of the weights obtained through fuzzy- AHP and REI (q) while estimating RPN. Thus if the RPN has to be kept low, the value of REI (q) should be as low as possible. Adequate remedial measures for improvement can be taken according to estimated RPN value so that the risk associated with the maintenance and repair of repairable systems of can be minimized. The future work may consider REI as a function of various factors on which the repair effectiveness depends. A linear or non-linear relationship can be established between these factors and the weights can be obtained with the help of advanced MCDM tools and techniques like Fuzzy TOPSIS, artificial neural networks, etc. for more precise estimation of RPN.
Exercise Aero engines used in military aircraft are empowered with high thrust to enable sudden climb and sustain high “G” loads during maneuvers. They are also designed to prevent surge and stall due to back-pressure resulting from firing of rockets and missiles that create abundance of turbulence in front of the aero engine. Thus aero engines are subjected to very high aerodynamic and thermal stresses, hence are subjected to frequent failures. The failure times of such aero engines with time between overhauls of 550 hours is as given below: 0.5, 12, 13, 15, 17, 31, 38, 39, 42, 47, 52, 67, 98, 101, 125, 133, 144, 166, 167, 177, 179, 189.46, 198, 206, 211, 212, 226, 264, 267, 269, 273, 293, 298, 321, 344.42, 354, 361, 366, 383, 387, 390, 401, 408, 425, 443, 461, 475, 507, 520, 542, 544, 545, 547, 548, and 549. Estimate scale, shape parameters and repair effectiveness index. Identify important decision criteria relevant to the current maintenance practice of the aero engines that could be considered as failure modes for aero engines maintenance and can lead to failures of the system. Model these failure modes through Generalized Renewal Process (GRP) (Tanwar et al., 2014) and estimate the repair effectiveness index (REI) for each criteria. (Failure times for various failure modes can be assumed and segregated from the failures given for the complete system in the exercise) Determine the importance weights of all selected decision criteria through pairwise comparisons with the help of fuzzy analytic hierarchy process (Fuzzy AHP) as explained in the chapter. Introduce REI for all the criteria and priority weights in the FMEA model. Estimate RPN for all criteria and then for the complete aero engines. Discuss the results and suggest remedial measures.
References Bolia, N., Rai, R.N., 2013. Reliability based methodologies for optimal maintenance policies in military aviation. Int. J. Perform. Eng. 9 (3), 295–303.
260
Safety and reliability modeling and its applications
Borgovini, R., Pemberton, S., Rossi, M., 1993. Failure mode, effects, and criticality analysis (FMECA). Reliab. Anal. Center Griffiss AFB NY. Brown, M., Proschan, F., 1983. Imperfect Repair. J. Appl. Probab. 20 (4), 851–859. Celik, M., D. Er, I., Ozok, A.F., 2009. Application of fuzzy extended AHP methodology on shipping registry selection: The case of Turkish maritime industry. Expert Syst. Appl. 36 (1), 190–198. Chang, D.-Y, 1996a. Applications of the extent analysis method on fuzzy AHP. Eur. J. Oper. Res. 95 (3), 649–655. Chang, D-Y, 1996b. Applications of the extent analysis method on fuzzy AHP. Eur. J. Oper. Res. 95 (3), 649–655. Ebeling, C.E., 2004. An Introduction to Reliability and Maintainability Engineering. Tata McGrawHill Education. Kijima, M., 1989. Some results for repairable systems with general repair. J. Appl. Probab. 26 (1), 89–102. Kijima, M., Sumita, U., 1986. A useful generalization of renewal theory: counting processes governed by non-negative markovian increments. J. Appl. Probab. 23 (1), 71–88. Lee, S.K., Mogi, G., Hui, K.S., 2013. A fuzzy analytic hierarchy process (AHP)/data envelopment analysis (DEA)hybrid model for efficiently allocating energy r&d resources: in the case of energy technologies against high oil prices. Renew. Sustain. Energy Rev. 21, 347–355. doi:10.1016/j.rser.2012.12.067, https://doi.org/. MIL_STD_1629A, US. 1984. Procedures for Performing Failure Mode Effects and Criticality Analysis. November. Rai, R.N., Bolia, N., 2015. Modified FMEA model with repair effectiveness factor using generalized renewal process. SRESA J. Life Cycle Reliab. Saf. Eng 4, 36–46. Rai, R.N., Sharma, G., 2017. Goodness-of-fit test for generalised renewal process. Int. J. Reliab. Safe. 11 (1–2), 116–131. Rai, R.N., N., BOLIA, 2014. Availability based optimal maintenance policies in military aviation. Int. J. Perform. Eng. 10 (6), 641–648. Rigdon, Steven E., Basu, Asit P., 2000. Statistical Methods for the Reliability of Repairable Systems. Wiley, New York. Sharma, Garima, Rai, Rajiv Nandan, 2019. Reliability Modeling and Analysis of Environmental Control and Life Support Systems of Space Stations: A Literature Survey. Acta Astronautica 155 (February), 238–246. doi:10.1016/j.actaastro.2018.12.010, https://doi.org/. Tang, Yu-Cheng, Lin, Thomas W., 2011. Application of the fuzzy analytic hierarchy process to the lead-free equipment selection decision. Int. J. Bus. Syst. Res. 5 (1), 35–56. Tanwar, M., Rai, R.N., Bolia, N., 2014. Imperfect repair modeling using kijima type generalized renewal process. Reliab. Eng. Syst. Safe. 124, 24–31. Wang, Y.-M., Luo, Y., Hua, Z., 2008. On the extent analysis method for fuzzy AHP and its applications. Eur. J. Oper. Res. 186 (2), 735–747. Yanez, M., Joglar, F., Modarres, M., 2002. Generalized renewal process for analysis of repairable systems with limited failure experience. Reliab. Eng. Syst. Safe. 77 (2), 167–180. Zhu, K.-J., Jing, Y., Chang, D.-Y., 1999. A discussion on extent analysis method and applications of fuzzy AHP. Eur. J. Oper. Res. 116 (2), 450–456.
Chapter 10
Methodology to select human reliability analysis technique for repairable systems Garima Sharma and Rajiv Nandan Rai Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
10.1 Introduction In the modern industrial system, the prevention of hazardous accidents has always been provided due importance. In high risk industries, such as nuclear power plants and aerospace industries, it has been observed that 70%-80% of accidents take place due to human errors (Oliveira et al., 2017; Jang et al., 2016) and remaining are due to some technical issues (Di Pasquale et al., 2013). Hence, the predictive analysis of human errors are paid much attention in the literature through development of various human reliability analysis (HRA) techniques (Jung et al., 2001), which could be applicable to high-risk industries in particular and other industries in general. The Human reliability analysis (Dhillon, 2007) is used to examine the intrinsic risk of human behavior or events introducing errors into the operation and maintenance of critical and complex repairable systems (Rai and Sharma, 2017; Rai and Bolia, 2014; Tanwar et al., 2014). In other words, HRA helps to quantify the probability of human error for a specified task. It may provide guidance in classifying vulnerabilities within a task, and can provide directions on how to improve reliability of given task with the help of performance shaping factors (PSFs). For better understanding of basics related to HRA, readers can refer (Stamatelatos et al., 2011). Using the basic concepts of HRA, different methodologies are developed to meet a wider spectrum of industrial needs. Determination of concerned methodologies or combination of techniques can provide a quality human reliability assessment as a key element to develop effective strategies for understanding and dealing with risks caused by human errors (Dhillon, 2014). The development of various HRA techniques has been divided into three generations: Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00007-6 Copyright © 2021 Elsevier Inc. All rights reserved.
261
262
Safety and reliability modeling and its applications
First Generation: The HRA techniques developed at the initial phase (from early 1960s) (Di Pasquale et al., 2018), such as technique for human error rate prediction (THERP), absolute probability judgement (APJ), human error assessment and reduction technique (HEART), justified human error data information (JHEDI) etc. are categorized as first generation HRA techniques. These techniques are influenced by the viewpoint of probabilistic safety assessment (PSA) and identify man as a mechanical component. The methods mainly concentrated on quantifying the action in terms of success/failure without paying much attention towards causes and reasons of human behavior (Di Pasquale et al., 2013). Thus, these methods are often criticized for not considering impact of environment and other organizational factors on the human mind (Hollnagel, 1998); Sträter et al., 2004; Mosleh and Chang, 2004). Second Generation: To overcome the drawbacks of first generation HRA techniques, methods such as cognitive reliability and error analysis method (CREAM), simplified plant analysis risk human reliability assessment (SPARH), connectionism assessment of human reliability (CAHR), etc. were developed (in 1990s) (Bell and Holroyd 2009) by introducing a new category of error called “cognitive error” to capture the effects of human behaviour in assigned tasks. These methods depend on empirical data as their key ingredients for model development and its validation, which is hardly available and hence this becomes the main drawback of second generation HRA methods (Griffith and Mahadevan, 2011). Third Generation: The shortcomings of second generation method have led to development of new techniques which mainly focuses on improvement in existing HRA techniques. Nuclear action reliability assessment (NARA) is the only technique as of now which comes under third generation of HRA technique (Di Pasquale et al., 2013). This method is the advanced version of HEART developed specially for nuclear power plants. Although a variety of HRA methodologies are available to analyze human error events, determining the most appropriate HRA method to provide the most useful results can depend on industry specific cultures and requirements (Bell and Holroyd, 2009). In this chapter, we will discuss fuzzy analytic hierarchy process (fuzzy-AHP) and artificial neural networks (ANNs)-based hybrid methodology for selection of the best HRA model for the selected case study. Before this, the following subsection presents a brief review on some wellknown HRA methods as available in the literature.
10.1.1
Various HRA Techniques
There are many qualitative and quantitative methods available in the literature for assessment of human reliability. A list of some important HRA methods with their brief summary (Bell and Holroyd, 2009) is provided in Table 10.1.
TABLE 10.1 Brief summary of HRA methods available in the literature (Bell and Holroyd, 2009) Full Form
Domain
Remarks
THERP
Technique for human error rate prediction
Nuclear with wider applications
THERP is comprehensive HRA methodology that can be used as a screening or a detailed analysis (Stamatelatos et al., 2011), (Swain and Guttmann, 1983), (Boring, 2012). Unlike many of the quantification methodologies, THERP provides guidance on most steps in the HRA process including task analysis (e.g. documentation reviews and walk/ talk through), error representation, and quantification of human error probabilities (HEPs).
ASEP
Accident sequence evaluation programme
Nuclear
ASEP is an “Abbreviated and slightly modified version of THERP. ASEP comprises of pre-accident screening with nominal human reliability analysis, and post-accident screening and nominal human reliability analysis facilities. ASEP provides a shorter route to human reliability analysis than THERP by requiring less training to use the tool, less expertise for screening estimates, and less time to complete the analysis.” (M.H.C and H.A.P, 2008).
HEART
Human error assessment Generic and reduction technique
HEART is generic method for quantifying the risk involve because of human error. It is applicable to any industry where human reliability plays an important role.
SPAR-H
Simplified plant analysis Nuclear with wider risk human reliability application assessment
SPAR-H is another method that has been developed for nuclear plant applications (Gertman et al., 2005),(Whaley et al., 2012) and can be used for specific NASA applications. SPAR-H can be used as both a screening method and a detailed analysis method (Petruni et al., 2019). It allocates human actions into two general task categories: action or diagnosis. Action includes operation of equipment, conducting calibration or testing e.t.c and diagnosis tasks consist of reliance on information and experience to understand the existing circumstances, scheduling and prioritizing events, and defining appropriate sequence of actions.
263
(continued on next page)
Methodology to Select Human Reliability Analysis Technique Chapter | 10
Method
264
TABLE 10.1 (continued) Full Form
Domain
Remarks
ATHEANA
A technique for human error analysis
Nuclear with wider application
ATHEANA is used to obtain qualitative and quantitative results. The principle of the method is: plant (or working) conditions and other influences increases the probability of human errors.
CREAM
Cognitive reliability and Nuclear with wider error analysis method application
CREAM is developed for cognitive error analysis and is based on the contextual control model (Stamatelatos et al., 2011). This can be used as a screening or a detailed analysis. CREAM provides a list of fifteen basic cognitive tasks and their definitions to frame the cognitive error modeling. It could be used predictively and retrospectively.
APJ
Absolute probability judgement
APJ is also acknowledged as direct numerical estimation based on the quantification of HEPs. It is an expert judgement based approach which uses the expert’s beliefs for assessment of HEPs.
Generic
SLIMMAUD Success likelihood index Nuclear with wider methodology, application multi-attribute utility decomposition
SLIMMAUD assumes that the probability of an error happening in a specific condition hinge on the mutual effects of a relatively small set of PSFs and experts can judge each PSF impact on reliability of the task.
HRMS
Human reliability management system
“HRMS is a fully-computerized HRA system that contains a human error identification (HEI) module, which is used by the assessor on a previously prepared and computerized task analysis” (Kirwan, 1994).
JHEDI
Justified human error data Nuclear information
Nuclear
Both HRMS and JHEDI are similar in carrying out task and error analysis and PSFs quantification, but JHEDI includes a less detailed assessment than HRMS.
(continued on next page)
Safety and reliability modeling and its applications
Method
Method
Full Form
Domain
Remarks
CAHR
Connectionism assessment of human reliability
Generic
CAHR is a generic underlying model which combines event analysis and assessment to use past experience as the basis for HRA. It analyzes operational disturbances occurring due to inadequate human activities or organizational factors.
CESA
Commission errors search Nuclear and assessment
CESA is importance screening based method. In this, the identification process prioritizes the systems and prioritization of the scenarios are done based on the magnitude of the contribution to the core harm frequency.
NARA
Nuclear action reliability Nuclear assessment
NARA is developed for nuclear plant applications but can be used in specific types of National Aeronautics & Space Administration (NASA) applications. It can be used as a detailed analysis method and does not provide an explicit method for screening. It is similar to HEART but specially developed for long time-scale events.
Methodology to Select Human Reliability Analysis Technique Chapter | 10
TABLE 10.1 (continued)
265
266
Safety and reliability modeling and its applications
TABLE 10.2 Triangular fuzzy conversion scale (Lee et al., 2013) Linguistic Fuzzy Scale
Triangular fuzzy Number Scale
Triangular fuzzy Number reciprocal scale
Equally important
(1, 1, 1)
(1, 1, 1)
Weakly more important
(2/3, 1, 3/2)
(2/3, 1, 3/2)
Fairly Strongly more important
(3/2, 2, 5/2)
(2/5, 1/2, 2/3)
Very strongly more important
(5/2, 3, 7/2)
(2/7, 1/3, 2/5)
Absolutely more important
(7/2, 4, 9/2)
(2/9, 1/4, 2/7)
10.2 Selection of the best HRA technique for a particular case To choose the best HRA method, it is very important to understand the requirement of the analysis, which differs from system to system. For this purpose, the first thing should be done is to decide the optimization criteria and subcriteria. The selected criteria or subcriteria should be able to represent requirements of the analysis. Then choose different alternative HRA methods available in the literature according to the selected criteria. After selecting the criteria and alternative HRA methods, the best HRA method could be established with the help of tools and techniques i.e. fuzzy-AHP (Chang, 1996) and ANN (Pratihar, 2013). Fuzzy-AHP is a very popular and effective method to estimate the weights of given criteria and sub-criteria and ANN is a tool to obtain optimized results from given alternatives which helps in ranking purpose. The basic steps for fuzzy-AHP weight calculation and ANN formation is explained in the subsequent sections for better comprehension of the readers.
10.2.1
Fuzzy Analytical Hierarchical Process
Fuzzy analytic hierarchy process is a multicriteria decision-making method (Tang and Lin, 2011) that utilizes the characteristics of both AHP method as well as fuzzy set theory. Analytical hierarchical process (AHP) provides the weights of hieratical sequential criteria using pairwise comparisons (Celik et al., 2009) and fuzzy logic removes the fuzziness from the pairwise weights decided by the decision-makers. Fuzzy set theory uses the concept of membership function, which ranges from 0 to 1. The membership function value of any criteria can differ according to the decision-maker’s perception on the same context. To remove the fuzziness from the process, fuzzy logic comprises of three types of membership functions: monotonic, triangular, and trapezoidal (Chang, 1996; Wang et al, 2008). Here triangular membership function is used in which three base points of triangle are taken from triangular fuzzy conversion scale as displayed at Table 10.2. Thus the combination of these two methods (Fuzzy and AHP) produces efficient and acceptable results.
Methodology to Select Human Reliability Analysis Technique Chapter | 10
267
The fuzzy extent analysis method (Chang, 1996; Zhu et al, 1999) is used to estimate the final weights of criteria and sub-criteria. A detailed stepwise methodology for estimation (Chang, 1996) of final weights is shown in subsequent paras to understand the concept of fuzzy -AHP. Let X = x1 ,x2 ,…xn be an article set and U = u1 ,u2 ,…un be an objective set (Chang, 1996). Each article is taken and extent analysis for each objective gi is performed, respectively. The m extent analysis values for each article can be obtained, with the following signs: Mgi1 , Mgi2 , Mgi3 , n..Mgim i = (1, 2, n, n) where, all Mgij = ( j = 1, 2 . . . ,) are triangular fuzzy numbers. The steps for extent analysis method on fuzzy- AHP are as follows. Step 1: Construction of the fuzzy AHP comparison matrix The Linguistic fuzzy scale is used to construct the matrix. The standard scale table for triangular fuzzy conversion is shown at Table 10.2. Triangular fuzzy numbers and the fuzzy are used for the pairwise comparison evaluation matrix A˜ = ai j n×m is constructed; where ai j = li j , mi j , ui j is the relative importance of ith element over jth element in pair wise comparison and lij ,mij ,and uij are the lower bound, middle, upper bound values of ai j respectively. Also ai j are satisfied with 1 1 1 ., .mi j = , .ui j = li j = l ji m ji u ji Step 2: Calculation of the value of fuzzy synthetic extent Si with respect to ith criteria The formula to calculate the Si is placed below: ⎡ ⎤−1 m n m Si = Mgij ⎣ Mgji ⎦ j=1
(10.1)
i=1 j=1
Here Si , is defined as the fuzzy synthetic extent and is the multiplication of triangular fuzzy numbers. To get mj=1 Mgji the fuzzy addition operation of m extent analysis values for a particular matrix is performed such that ⎛ ⎞ m m m m Mgji = ⎝ l j, mj , u j⎠ (10.2) j=1
j=1
j=1
For the calculation of the term below: ⎡ ⎤−1 n m ⎣ Mgji ⎦ i=1 j=1
j=1
268
Safety and reliability modeling and its applications
FIGURE 10.1
The intersection between M1 and M2
First find: n m
Mgji
=
n
i=1 j=1
li ,
i=1
n
mi ,
i=1
n
ui
(10.3)
i=1
The inverse of the above equation will give the value of [ ⎡ ⎣
n m
⎤−1 Mgji ⎦
=
1 n i=1
i=1 j=1
ui
, n
1
i=1
mi
n m
1 , n
i=1
i=1 li
j=1
Mgji ]−1
(10.4)
Then, the value of fuzzy synthetic extent with respect to the ith article is defined as ⎡ ⎤−1 m n m Si = Mgij ⎣ Mgji ⎦ j=1
i=1 j=1
Step 3: Estimation of the sets of weight values of the fuzzy AHP To find the approximations for the sets of weight values under each criterion, it is essential to consider a principle of comparison for fuzzy numbers which is as given below (Chang, 1996): The degree of possibility of M1 = (l1 ,m1 ,u1 ) ≥ M2 = (l2 ,m2 ,u2 ) is defined as V (M1 ≥ M2 ) = Sup x ≥y min μM1 (x), μM2 (x) V (M1 ≥ M2 ) = 1.i f . m1 ≥ m2 V (M2 ≥ M1 ) = hgt (M1 ∩ M2 ) = μM1 (d)
(10.5)
Where, d is the ordinate of the highest intersection point D as presented in Fig. 10.1 (Chang, 1996) between μM1 (d) and μM2 (d). Also, the above equation
Methodology to Select Human Reliability Analysis Technique Chapter | 10
269
can be equivalently expressed as follows: V (M2 ≥ M1 ) = hgt (M1 ∩ M2 ) = μM1 (d) ⎧ 1, if m2 ≥ m1 ⎪ ⎨ if l1 ≥ u2 = 0, ⎪ ⎩ l1−u2 , otherwise (m2−u2)−(m1−l1)
(10.6)
The following figure illustrates the above equation To compare M1 and M2 , we need both the values of V (M1 ≥ M2 ) and V (M2 ≥ M1 ) Step 4: Calculation of the sets of weight values of the fuzzy AHP (Chang, 1996) The degree of possibility for a convex fuzzy number to be greater than k convex fuzzy numbers Mi = (i = 1, 2, · · · , k) V (M ≥ M1 , . . . , Mk ) = V [(M ≥ M1 ) and (M ≥ M2 ) and . . . and (M ≥ Mk )] = min V (M ≥ Mi ) → i = 1, 2, . . . , k (10.7) Assume that, d (Ai ) = minV (Si ?Sk )Fork = 1, 2 . . . , n; k?i Then, the weight vector is given by T W = d (A1 ), d (A2 ), . . . , d (An )
(10.8)
(10.9)
where, Ai (i = 1, ÅÅÅ, n) are n elements. Through normalization, the normalized weight vectors are W = [(A1 ), (A2 ), n, (An )]T
(10.10)
where, W is a non-fuzzy number.
10.2.2 Neural network modeling for selection and ranking of alternatives The weights of criteria (and subcriteria) obtained through fuzzy- AHP can be utilized to model the neural network. The principle of neural network, which is a replication of an artificial human brain, is developed by McCulloch and Pitts in 1943 (Pratihar, 2013). Artificial neural networks consist of three layers: input layer, output layer and hidden layer and is also known as multi- layer neural networks (Rajpal et al., 2006). Hidden layers can be increased depending on the complexity of the problem. Each layer, having a number of neurons, is interconnected by means of weights provided by the analyst. The output of one layer becomes the input of next layer using transfer function. Transfer functions are used to remove the nonlinearity of the weight connectivity. Every neuron
270
Safety and reliability modeling and its applications
shares its information with other neurons to give the final output (Kumar and Roy, 2010). Here a feed forward multilayer neural network modeling is used to provide the final ranking to the HRA models. Estimated weights through fuzzyAHP are utilized while modeling ANN. The steps (Pratihar, 2013) to obtain the final results using ANN model are explained in below. Let us consider one input layer (I), hidden layer(s) (H) and output layer (O) in an ANN and the number of neurons in input, hidden and output layers be denoted by M, N and P respectively. The input and output of first neuron in each layer is denoted by II1 , HI1 , OI1 and IO1 , HO1 , OO1 respectively. If one consider the specific neuron, for example ith , jth and kth neuron of input, hidden and output layer respectively, then i = 1, 2….M; j = 1, 2….N; k = 1, 2….P. The steps for the development of ANN model are explained below: Step 1: Calculation of output of input layer IOi (Assuming linear function) Input/output for each Neuron IIi = IOi = [1÷(no. of alternatives chosen )] Step 2: Calculation of the input of hidden layer(s) (HIj ) The hidden layer represents the selected criterions for analysis. Consider there are n criteria C1 , C2 ,…, Cn and let the weights of selected criteria estimated through fuzzy- AHP be denoted by wC1 , wC2 ,… , wCn . The estimation of input of hidden layer is placed below: HI j =
M
IOi × wC j + b × wb
(10.11)
i=1
Where b is bias value of bias neuron and wb is weight of bias neuron. Step 3: Calculation of output of hidden layer HOj (Assuming sigmoid function, which is given below) HO j=
1 1 + e−HI j
(10.12)
Note: There could be more than one hidden layers depending on criteria and their sub-criteria. In such situation, the output of first hidden layer multiplied by sub-criteria weights with bias term [as shown in Eq. (10.11)] will be the input of second hidden layer. Step 4: Input of output layer OIk The output neurons denote the alternative (i.e. HRA methods) and weight of alternative with respect to criterion is denoted by wjk . For example w11 denotes weight of criteria 1 with respect to alternative 1. The calculation of input of output layer is placed below: OIk =
n or N j=1
HO j × w jk + b × wb
(10.13)
Methodology to Select Human Reliability Analysis Technique Chapter | 10
271
Step 5: Estimation of output of output layer OOk (Assuming sigmoid function) OOk=
1 1 + e−OI j
(10.14)
The final values of OOk provide us the final weights of each alternative and can be used for ranking the HRA methods. The HRA method with highest ranking could be chosen as the most suitable method for a particular case. For the better comprehension of the readers, a case study of space station’s environmental control and life support system (ECLSS) is explained using the same methodology in the next section.
10.3 Case study of space station1 The orbital space stations (OSS) ECLSS maintenance is emerging into a multifaceted and technologically intensive sector (Sharma and Rai, 2019). This level of intricacy has led to the emergence of a work environment, where human machine interface and human reliability are now critical factors of performance especially for maintenance related tasks. Various methodologies for performing risk assessment considering human factors are already presented in the literature, but they are frequently tailored for aviation, nuclear, automotive, and process industries (Yang et al., 2012). A methodology for assessment of a suitable HRA technique for OSS ECLSS maintenance considering the ones proposed from other industries is presented. Alternative HRA methods selection can be done based on a variety of factors that include: 1) how people act and react in space stations, 2) expectations based on NASA standards, 3) factors that influence the occurrence of human errors due to tasks, tools, environment, workplace, support, training and procedure, 4) type and availability of data, 5) how the space station views risk & reliability, 6) types of emergencies, contingencies and routine tasks, and 7) complexity of tasks. Based on the requirements of analysis following four criteria are taken into consideration: 1) Adequacy: This criterion covers the application scope of the HRA techniques. 2) Costs: This criterion is set because of time and financial limitation consideration. 3) Effectiveness: The aim is to fulfill the consideration of the knowledge background, technical support and the complexity of the HRA application.4) Efficacy: This criterion is decided with the consideration of the accuracy of the outputs. For each selected criteria three sub-criterions with their significance is explained at Table 10.3. 1 The
paper has been published in Springer International Publishing AG, part of Springer Nature 2019, R. L. Boring (Ed.): AHFE 2018, AISC 778, pp. 128–137, 2019. https://doi.org/10.1007/9783-319-94391-6_13
272
Safety and reliability modeling and its applications
TABLE 10.3 Details of criteria and subcriteria Criteria
Sub-Criteria
Adequacy
AD1:- Applicability The sub-criteria covers domain of application, process phases, non-routine tasks and working conditions. AD2:- Historical data base The sub-criteria covers available data for various tasks. AD3:- Critical areas/detailed task analysis The sub-criteria cover special tasks like extravehicular activity (EVA), docking, maintenance and repair tasks, decent, ascent etc.
Cost
CS1:- Direct cost The sub-criteria covers cost for license, material, documents of new tool etc. CS2:- Time for data collection and analysis. CS3:- Frequency of required application.
Effectiveness
ET1:- Complexity of the method. ET2:- Education, skill, training and experience. ET3:- Type of material support The sub-criteria covers requirement of datasheets, software, worksheets and records.
Efficacy
EC1:- Qualitative and quantitative outputs. EC2:- Clarity of results. EC3:- Level of output details obtained.
The four HRA methods that are considered to the evaluated are: THERP, CREAM, NARA, SPAR − H. In the present time NASA is the world’s largest organization which is working on the project called “International Space Station (ISS)” and according to the PRA guide (Stamatelatos et al., 2011) published by NASA, the above selected four HRA models are more suitable candidates for aerospace applications such as space shuttles and space stations than the other HRA models. In addition, the selected methods are from different generations which provide a more holistic view of the analysis. The AHP hierarchical model is placed at Fig. 10.2.
10.3.1
Fuzzy AHP Weights Estimation
The fuzzy- AHP extent analysis method as explained in Section 10.2.1 is used to estimate the final weights of criteria and sub-criteria. Final normalized weights of sub-criteria w.r.t. criteria and alternatives are placed at Table 10.4 and Table 10.5. The comparison matrix used to estimate the weights are provided as Appendix to this chapter. The obtained weights are used to feed into the neural network model to achieve the final ranking of the alternatives i.e. HRA models.
Methodology to Select Human Reliability Analysis Technique Chapter | 10
FIGURE 10.2
273
Schematic view of criteria, subcriteria, and alternatives
TABLE 10.4 Final normalized weights of subcriteria w.r.t. Criteria Sub-Criteria
Criteria
Normalized Weights
[AD1,AD2,AD3]
AD
[0.4692,0.0615,0.4692]
[CS1,CS2,CS3]
CS
[0.4337,0.3628,0.2035]
[ET1,ET2.ET3]
ET
[0.5628,0.3227,0.1144]
[EC1,EC2,EC3]
EC
[0.4507,0.2256,0.3237]
10.3.2
ANN Model for ECLSS
In the ANN model for ECLSS, four layers are considered: input layer with four neurons, output layer with four neurons, and two hidden layers with 4 and 12 neurons, respectively. First hidden layer is considered for the main criterions and second hidden layer is considered for sub-criterions. The connectivity of the neurons is shown in Fig. 10.3. The weights as estimated in the previous section with the help of fuzzy-AHP, are used as connecting weights between the neurons. The output of each neuron is estimated with the help of log-sigmoid transfer function. The description of each layer is as follows:
274
Safety and reliability modeling and its applications
TABLE 10.5 Final normalized weights of alternatives w.r.t. subcriteria Between Alternatives
Sub-Criteria
Normalized Weights
[THERP,CREAM,NARA,SPAR-H]
AD1
[0.3164,0.1854,0.1141.0.3842]
[THERP,CREAM,NARA,SPAR-H]
AD2
[0.3289,0.1689,0.0718,0.4304]
[THERP,CREAM,NARA,SPAR-H]
AD3
[0.3219,0.1942,0.0575,0.4264]
[THERP,CREAM,NARA,SPAR-H]
CS1
[0.1531,0.233,0.19,0.4239]
[THERP,CREAM,NARA,SPAR-H]
CS2
[0.2814,0.0718,0.2163,0.4305]
[THERP,CREAM,NARA,SPAR-H]
CS3
[0.0718,0.1689,0.3289,0.4304]
[THERP,CREAM,NARA,SPAR-H]
ET1
[0.0216,0.0216,0.4784,0.4784]
[THERP,CREAM,NARA,SPAR-H]
ET2
[0.1141,0.1854,0.3164,0.3842]
[THERP,CREAM,NARA,SPAR-H]
ET3
[0.3888,0.0592,0.1132,0.4389]
[THERP,CREAM,NARA,SPAR-H]
EC1
[0.2346,0.3173,0.1308,0.3173]
[THERP,CREAM,NARA,SPAR-H]
EC2
[0.2814,0.0718,0.2163,0.4305]
[THERP,CREAM,NARA,SPAR-H]
EC3
[0.3468,0.3468,0.0618,0.2446]
FIGURE 10.3
Proposed neural network model
Input layer: Four neurons in the input layer represent the four HRA models under investigation i.e. THERP, CREAM, NARA and SPAR-H. In this layer equal weights are given to each neuron (1/ no. of alternatives). The logic behind allocating equal weights (with bias value equal to 1) is the consideration that all models are of equal rank. A linear transfer function is taken for the output of input layer. This output goes to the first hidden layer as input of the layer.
Methodology to Select Human Reliability Analysis Technique Chapter | 10
275
TABLE 10.6 Final ranking of models HRA Model
Final Normalized Weights
Ranking
THERP
0.2514
Rank 2
CREAM
0.243
Rank 4
NARA
0.2464
Rank 3
SPAR-H
0.2592
Rank-1
First hidden layer: First hidden layer neurons represent the four main criterions which are considered for the comparison between the models. The output weights of input layer, combined with the criteria weights, enter into the first hidden layer and generate the output with the help of log-sigmoid transfer function. This output becomes the input of second hidden layer. Second hidden layer: It has 12 neurons that represent the sub-criterions of respective main criteria. Neurons of the first hidden layer (main criteria) are connected with their respective neuron (sub-criteria) in second hidden layer. The output from first hidden layer combining with their respective sub-criteria weights enters into the second hidden layer. The output of this layer is produced using log-sigmoid function. This output becomes the input of the output layer. Output layer: Output layer neurons again represent the four HRA models. All the 12 neurons (i.e. sub-criteria) from the second hidden layer are connected with each output layer neuron. The input calculated with the help of Eq. (10.13) enters into the output layer and this layer produces the final output using log-sigmoid transfer function. The output of each neuron is considered as the final weight of the respective HRA method. The model designated maximum weight is considered as the best HRA model for the OSS maintenance tasks. The final ranking as achieved from neural network model is tabulated below at Table 10.6 and illustrated through Fig. 10.4. SPAR-H model gets assigned the highest weightage (0.2592) in the analysis. The above graph (Fig. 10.4) shows the weight distribution of SPAR-H model. Although SPAR-H is developed basically for nuclear power plants but its applications in the aerospace settings can be valuable. The model is very easy to use and understand. The material support like datasheets and worksheets makes the model user friendly and increases its effectiveness towards the particular task. The available details in the historical data make it economic in terms of data collection and repeated use. The model also covers different tasks such as extra vehicular activities (EVA), docking etc. Since it does not require very high skill and training, its direct cost becomes moderately less. It lacks in the field of providing detailed output of the task but the clarity of obtained output is considered to be very high. The features discussed above renders the model most suitable to be used for OSS ECLSS maintenance tasks.
276
Safety and reliability modeling and its applications
FIGURE 10.4
Weight distribution of each model
THERP is placed as Rank 2 (0.2514) in the analysis. Since THERP is a task dominant model, it is quite applicable to various aerospace programs. It is also one of the old models used by NASA for their various space programs due to the availability of required software and records. The efficacy level of this model is much better compared to the other three models. It provides both qualitative and quantitative outputs with good clarity and appropriate details. Its available historical database equips the analyst adequately to analyze the critical tasks related to maintenance and repair. The time consumed for data collection is less and the clarity of outputs is serene. The necessity of superior skill and advanced training is a serious limitation of this model as it leads to an enormous direct cost. Another limitation of this model as compared to SPAR − H is its complexity which makes it uneconomic to be used repeatedly for the required application. As a whole this model is considered comparatively less effective than SPAR-H for OSS ECLSS maintenance tasks. NARA is assigned as the third ranked (0.2464) model in the analysis. The fundamental basics of NARA is developed for nuclear plants but can be applied to the space settings. The model is easy to use and understand and does not require sophisticated skill and training. The time consumed for data collection and analysis is low and provides clarity in the results. The model is not very expensive in its usage. However, the unavailability of historical data and critical tasks analysis renders the model weaker than the SPAR-H and THERP. The model is also unable to provide the results with exhaustive details. The analysis reveals that CREAM (0.243) is at the last of the ladder as compared to the other three models. CREAM is based on the cognitive thinking of the
Methodology to Select Human Reliability Analysis Technique Chapter | 10
277
human being. It provides both the qualitative and quantitative outputs adequately. Since it has proper historical database, it can be applied to different task analysis of space programs. Its direct cost is less but the cost of repeated application and cost of data collection and analysis makes the model uneconomical. The model is not found to be as effective as the other three models. CREAM is a generic model which can be applied in every field but it lacks clarity in results and specific output for the particular OSS ECLSS maintenance task. The results show that the combination of THERP and SPAR-H can offer the most effective results for OSS ECLSS maintenance.
10.4 Conclusion and future scope The chapter briefly explains different methods available in the literature used for HRA in various industrial domains. There are several methods available in the literature but identifying the best HRA method which can be used for a particular domain is a challenging task. In view of this, the chapter presents a hybrid methodology which uses fuzzy AHP and ANN theory to choose a suitable HRA method for a particular case. By identifying the best HRA method with the help of presented methodology, one can achieve more effective and realistic results for complex and critical repairable systems. The future work may consider combination of various other advanced MCDM techniques for weight estimation. We can also enhance the efficacy of our methodology by considering a hybrid of advanced MCDM and machine learning tools like adaptive neuro fuzzy inference system (ANFIS) for alternatives selection problems rather than considering simple ANN structure. This will help in achieving more effective and realistic results for complex and critical repairable systems.
Exercise Select a case study of a system from Aviation, Railways, Nuclear or any other critical sector. Choose different HRA methods available in the literature related to the analysis. Then decide the optimization criteria and sub-criteria. The selected criteria or sub-criteria should be able to represent their requirements of the analysis. After selecting the criteria and various HRA methods, establish the best HRA method suitable for the case with the help of steps for fuzzy− AHP weight calculation and ANN formation.
Appendix Comparison matrix used for estimation of weights in Section 10.3.1
278
Safety and reliability modeling and its applications
BETWEEN CRITERIA NORMALIZED AD
CS
ET
EC
WEIGHTS
AD
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
(0.667,1,1.50)
0.3173
CS
(0.40,0.50,0.667)
(1,1,1)
(0.40,0.50,0.667)
(0.667,1,1.50)
0.1308
ET
(0.667,1,1.50)
(1.50,2,2.5)
(1,1,1)
(1,1,1)
0.3173
EC
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.2346
BETWEEN SUB CRITERIA WRT CRITERIA WRT AD
AD1
AD2
AD3
NORMALIZED WEIGHTS
AD1
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
0.4692
AD2
(0.40,0.50,0.667)
(1,1,1)
(0.40,0.50,0.667)
0.0615
AD3
(0.667,1,1.50)
(1.50,2,2.5)
(1,1,1)
0.4692
WRT CS
CS1
CS2
CS3
NORMALIZED WEIGHTS
CS1
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
0.4337
CS2
(0.40,0.50,0.667)
(1,1,1)
(1.50,2,2.5)
0.3628
CS3
(0.667,1,1.50)
(0.40,0.50,0.667)
(1,1,1)
0.2035
WRT ET
ET1
ET2
ET3
NORMALIZED WEIGHTS
ET1
(1,1,1)
(0.667,1,1.50)
(1.50,2,2.5)
0.5628
ET2
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.3227
ET3
(0.40,0.50,0.667)
(1,1,1)
(1,1,1)
0.1144
WRT EC
EC1
EC2
EC3
NORMALIZED WEIGHTS
EC1
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
0.4507
EC2
(0.40,0.50,0.667)
(1,1,1)
(0.667,1,1.50)
0.2256
EC3
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
0.3237
BETWEEN ALTERNATIVES WRT SUB CRITERIA NORMALIZED WRT AD1
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(0.667,1,1.50)
(1.50,2,2.5)
(0.667,1,1.50)
0.3164
CREAM
(0.667,1,1.50)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.1854
NARA
(0.40,0.50,0.667)
(0.667,1,1.50)
(1,1,1)
(0.40,0.50,0.667)
0.1141
SPAR-H
(0.667,1,1.50)
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
0.3842
WRT AD2
THERP
CREAM
NARA
SPAR-H
NORMALIZED
NORMALIZED
THERP
(1,1,1)
(0.667,1,1.50)
(1.50,2,2.5)
(1,1,1)
0.3289
CREAM
(0.667,1,1.50)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.1689
NARA
(0.40,0.50,0.667)
(0.667,1,1.50)
(1,1,1)
(0.40,0.50,0.667)
0.0718
SPAR-H
(1,1,1)
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
0.4304
(continued on next page)
Methodology to Select Human Reliability Analysis Technique Chapter | 10
279
NORMALIZED WRT AD3
THERP
CREAM
NARA
SPAR-H
NORMALIZED
THERP
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
(1,1,1)
0.3219
CREAM
(0.40,0.50,0.667)
(1,1,1)
(1.50,2,2.5)
(0.40,0.50,0.667)
0.1942
NARA
(0.667,1,1.50)
(0.40,0.50,0.667)
(1,1,1)
(0.40,0.50,0.667)
0.0575
SPAR-H
(1,1,1)
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
0.4264
WRT CS1
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.1531
NORMALIZED
CREAM
(1,1,1)
(1,1,1)
(0.667,1,1.50)
(0.667,1,1.50)
0.233
NARA
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
(0.40,0.50,0.667)
0.19
SPAR-H
(1.50,2,2.5)
(0.667,1,1.50)
(1.50,2,2.5)
(1,1,1)
0.4239
WRT CS2
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.2814
NORMALIZED
CREAM
(0.40,0.50,0.667)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.0718
NARA
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.2163
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
(1,1,1)
0.4305
WRT CS3
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
(0.40,0.50,0.667)
0.0718
NORMALIZED
CREAM
(0.667,1,1.50)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.1689
NARA
(1.50,2,2.5)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.3289
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
(1,1,1)
0.4304
WRT ET1
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(1,1,1)
(0.40,0.50,0.667)
(0.40,0.50,0.667)
0.0216
CREAM
(1,1,1)
(1,1,1)
(0.40,0.50,0.667)
(0.40,0.50,0.667)
0.0216
NARA
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
(0.667,1,1.50)
0.4784
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(0.667,1,1.50)
(1,1,1)
0.4784
WRT ET2
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
(0.40,0.50,0.667)
0.1141
NORMALIZED
NORMALIZED
CREAM
(0.667,1,1.50)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.1854
NARA
(1.50,2,2.5)
(0.667,1,1.50)
(1,1,1)
(0.667,1,1.50)
0.3164
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(0.667,1,1.50)
(1,1,1)
0.3842
WRT ET3
THERP
CREAM
NARA
SPAR-H
WEIGHTS 0.3888
NORMALIZED
THERP
(1,1,1)
(1.50,2,2.5)
(1.50,2,2.5)
(0.40,0.50,0.667)
CREAM
(0.40,0.50,0.667)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.0592
NARA
(0.40,0.50,0.667)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.1132
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
(1,1,1)
0.4389
(continued on next page)
280
Safety and reliability modeling and its applications NORMALIZED
WRT EC1
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
0.2346
CREAM
(0.667,1,1.50)
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
0.3173
NARA
(0.667,1,1.50)
(0.40,0.50,0.667)
(1,1,1)
(0.40,0.50,0.667)
0.1308
SPAR-H
(1,1,1)
(0.667,1,1.50)
(1.50,2,2.5)
(1,1,1)
0.3173
WRT EC2
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.2814
NORMALIZED
CREAM
(0.40,0.50,0.667)
(1,1,1)
(0.667,1,1.50)
(0.40,0.50,0.667)
0.0718
NARA
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.2163
SPAR-H
(1.50,2,2.5)
(1.50,2,2.5)
(1,1,1)
(1,1,1)
0.4305
WRT EC3
THERP
CREAM
NARA
SPAR-H
WEIGHTS
THERP
(1,1,1)
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
0.3468 0.3468
NORMALIZED
CREAM
(1,1,1)
(1,1,1)
(1.50,2,2.5)
(0.667,1,1.50)
NARA
(0.40,0.50,0.667)
(0.40,0.50,0.667)
(1,1,1)
(1,1,1)
0.0618
SPAR-H
(0.667,1,1.50)
(0.667,1,1.50)
(1,1,1)
(1,1,1)
0.2446
References Bell, J., Holroyd, J., 2009. Review of human reliability assessment methods. Health & Safety Laboratory 78. Boring, R.L., 2012. Fifty Years of THERP and Human Reliability Analysis. Idaho National Laboratory (INL). Celik, M., Er, I.D., Ozok, A.F., 2009. Application of fuzzy extended AHP methodology on shipping registry selection: the case of Turkish maritime industry. Expert Syst. Appl. 36 (1), 190–198. Chang, D.-Y., 1996. Applications of the extent analysis method on fuzzy AHP. Eur. J. Oper. Res. 95 (3), 649–655. Dhillon, Balbir S., 2007. Human Reliability and Error in Transportation Systems. Human Reliability and Error in Transportation Systems, 1st ed. Springer Science & Business Media, SpringerVerlag London. Dhillon, Balbir S., 2014. Human Reliability, Error, and Human Factors in Power Generation. Springer. Di Pasquale, V., Iannone, R., Miranda, S., Riemma, S., 2013. An overview of human reliability analysis techniques in manufacturing operations. Operations Management 221–240. Di Pasquale, V., Miranda, S., Neumann, W.P., Setayesh, A., 2018. Human reliability in manual assembly systems: a systematic literature review. IFAC-PapersOnLine 51 (11), 675–680. Gertman, David, Blackman, Harold, Marble, Julie, Byers, James, Smith, Curtis, 2005. The SPAR-H human reliability analysis method. US Nuclear Regulatory Commission 230, 35. Griffith, C.D., Mahadevan, S., 2011. Inclusion of fatigue effects in human reliability analysis. Reliab. Eng. Syst. Safe. 96 (11), 1437–1447. Hollnagel, Erik., 1998. Cognitive Reliability and Error Analysis Method (CREAM). Elsevier. Jang, I., Kim, A.R., Jung, W., Seong, P.H., 2016. Study on a new framework of human reliability analysis to evaluate soft control execution error in advanced MCRs of NPPs. Ann. Nuclear Energy 91, 92–104.
Methodology to Select Human Reliability Analysis Technique Chapter | 10
281
Jung, Won D., Yoon, Wan C., Kim, J.W., 2001. Structured Information Analysis for Human Reliability Analysis of Emergency Tasks in Nuclear Power Plants. Reliab. Eng. Syst. Safe. 71 (1), 21–32. Kirwan, B., 1994. A Guide to Practical Human Reliability Assessment. CRC Press. Kumar, J., Roy, N., 2010. A hybrid method for vendor selection using neural network. Int. J. Comput. Appli. 11 (12), 35–40. Lee, S.K., Mogi, G., Hui, K.S., 2013. A Fuzzy analytic hierarchy process (AHP)/data envelopment analysis (DEA)hybrid model for efficiently allocating energy r&d resources: in the case of energy technologies against high oil prices. Renew. Sustain. Energy Rev. 21, 347–355. doi:10.1016/j.rser.2012.12.067, https://doi.org/. M.H.C, Everdij, and Blom H.A.P. 2008. “Safety Methods Database.” http://www.nlr. nl/documents/flyers/SATdb.pdf. Mosleh, Ali, Chang, Y.H., 2004. Model-based human reliability analysis: prospects and requirements. Reliab. Eng. Syst. Safe. 83 (2), 241–253. Oliveira, L.N. de, I.J.A. Santos, and P. V.R. Carvalho. 2017. “A review of the evolution of human reliability analysis methods at nuclear industry.” Petruni, A., Giagloglou, E., Douglas, E., Geng, J., Leva, M.C., Demichela, M., 2019. Applying analytic hierarchy process (AHP) to choose a human factors technique: choosing the suitable human reliability analysis technique for the automotive industry. Safe. Sci. 119, 229– 239. Pratihar, Dilip Kumar, 2013. Soft Computing: Fundamentals and Applications. Alpha Science International, Ltd. Rai, R.N., Sharma, G., 2017. Goodness-of-fit test for generalised renewal process. Int. J. Reliab. Safe. 11 (1–2), 116–131. Rai, R.N., Nomesh, B., 2014. Availability BASED OPTIMAL MAINTENANCE POLICIES IN MILITARY AVIATION. Int. J. Perform. Eng. 10 (6), 641–648. Rajpal, P.S., Shishodia, K.S., Sekhon, G.S., 2006. An artificial neural network for modeling reliability, availability and maintainability of a repairable system. Reliab. Eng. Syst. Safe. 91 (7), 809–819. Sharma, G., Rai, R.N., 2019. Reliability modeling and analysis of environmental control and life support systems of space stations: a literature survey. Acta Astronautica 155 (February), 238– 246. doi:10.1016/j.actaastro.2018.12.010. Stamatelatos, M., H. Dezfuli, G. Apostolakis, C. Everline, S. Guarro, D. Mathias, A. Mosleh, T. Paulos, D. Riha, and C. Smith. 2011. “Probabilistic risk assessment procedures guide for NASA managers and practitioners.” Sträter, O., Dang, V., Kaufer, B., Daniels, A., 2004. On the way to assess errors of commission. Reliab. Eng. Syst. Safe. 83 (2), 129–138. Swain, A.D., Guttmann, H.E., 1983. Handbook of Human-Reliability Analysis with Emphasis on Nuclear Power Plant Applications. Final Report. Sandia National Labs. Tang, Y.C., Lin, T.W., 2011. Application of the fuzzy analytic hierarchy process to the lead-free equipment selection decision. Int. J. Bus. Syst. Res. 5 (1), 35–56. Tanwar, M., Rai, R.N., Bolia, N., 2014. Imperfect repair modeling using Kijima type generalized renewal process. Reliab. Eng. Syst. Safe. 124, 24–31. Wang, Y.-M., Luo, Y., Hua, Z., 2008. On the extent analysis method for fuzzy AHP and its applications. Eur. J. Oper. Res. 186 (2), 735–747. Whaley, A.M., Kelly, D.L., Boring, R.L., Galyean, W.J., 2012. SPAR-H Step-by-Step Guidance. Idaho National Laboratory (INL).
282
Safety and reliability modeling and its applications
Yang, Y., Liu, W., Kang, R., Zheng, W., 2012. The model framework of human reliability for complicated spaceflight mission. In: Proceedings of the IEEE 2012 Prognostics and System Health Management Conference (PHM-2012 Beijing. IEEE, pp. 1–5. Zhu, K.-J., Jing, Y., Chang, D.-Y., 1999. A discussion on extent analysis method and applications of fuzzy AHP. Eur. J. Oper. Res. 116 (2), 450–456.
Non-Print Items Abstract This chapter presents a methodology using fuzzy analytic hierarchy process (fuzzy-AHP) and artificial neural networks (ANNs) theory to select a suitable human reliability analysis (HRA) methodology for maintenance of repairable systems. The methodology can be successfully extended to the industries dealing with complex and critical repairable systems. The methodology consists of two modules: module 1 deals with fuzzy-AHP-based pair wise comparisons of criteria to estimate the weights and module 2 utilizes the results of fuzzy-AHP decision matrix into a ANN model for ranking of all selected HRA methodologies. The results yield the best HRA model for repairable systems maintenance with appropriate scores to compare the performance of each HRA model. The methodology is explained by considering space station as a case. Keywords Artificial neural network (ANN); Fuzzy- AHP; Human reliability analysis (HRA); Repairable systems
Chapter 11
Operation risk assessment of the main-fan installations of mines in gas and nongas conditions G.I. Grozovskiy a, G.D. Zadavin b and S.S. Parfenychev c a Deputy
Director General on Science, Doctor of Engineering, Professor, OJSC «Scientific Technical Centre (STC) «Industrial Safety», Moscow, Russia. b Adviser to the Director General, Candidate of Engineering Science, OJSC «Scientific Technical Centre (STC) «Industrial Safety», Moscow, Russia. c Researcher Junior, OJSC «Scientific Technical Centre (STC) «Industrial Safety».Master’s degree student of the Moscow Aviation Institute, Faculty № 3 «Control Systems, Informatics and Power Engineering», Department 307 «Digital Technologies and Information Systems», Moscow, Russia.
11.1 Introduction Development of different approaches to solving the problems of risk assessment, the ever-increasing knowledge, and experience of the practical application of the new results demonstrate that success in these areas is possible only if careful consideration of all specific details of “designing for safety” is a separate and independent field of science. One of the main tasks of ensuring the safety of the employees of the mining industry is to ensure uninterrupted ventilation of the underground mines using the main fan installation (MFI). Safety rules and regulations require that the redundant units back up the main fan installations, but in reality, some nongas mines operated without the redundant ventilators—due to the existing and past safety rules and regulations for the nongas mines. The use of the main-fan installations without a redundant unit reserved for the failure of the working unit can cause explosive and fire-hazardous conditions in the mine, and the air could become unsuitable and potentially poisonous for breathing by the miners. On the other side, in some cases, in the absence of the released toxic gases from rocks and the use of natural ventilation, it is possible to allow the operation of the main fan unit without a redundant unit with an acceptable level of risk for the mineworkers. Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00002-7 Copyright © 2021 Elsevier Inc. All rights reserved.
283
284
Safety and reliability modeling and its applications
The use of main-fan installations without a redundant unit can be justified using the accident risk assessment when the fan unit will fail. The largest number of failures in the ventilation systems occurs in the ventilation unit itself. This is due to the fact that the fan is the only element of the ventilation system which contains rotating elements, which can be exposed to significant loads and fail [1]. The failure of the ventilation system without a redundant unit is potentially hazardous because due to the potential release of the toxic gases from the rocks as well as toxic emissions from the vehicles with internal combustion engines, and as consequent the air will be unsuitable for breathing in the mines by the workers. According to the failure tree and event tree (Fig. 11.1), determine the overall failure of the mine ventilation without a redundant installation РV PV = p1 + p2 + P3 1/year Where: P1 – failure likelihood of the MFI (main fan installation) system P2 – failure likelihood due to a long-term power blackout P3 – fire likelihood in the building of the MFI. Electric motor failure probability: PEM = 1 − (1 − P2) x (1 − P3) = 1(1 − 0.29) x (1 − 0.0039) = 0.29 Electrical equipment failure probability: PEQ = 1 − (1 − 1EQ x (1 − P1) = 1 − (1 − 0.29) x (1 − 0.0205) = 0.437 The mechanical part failure probability: PMP = 1 − (1 − P4 x (1 − P5) = 1 − (1 − 0083) x (1 − 0.0266) = 0.326 Ventilation system (MFI) failure probability: PMFI = 1 − (1 − PEQ x (1 − PMP ) = 1 − (1 − 0.437) x (1 − 0.326) = 0.62 1/year Power blackout failure probability: P2 = p1xp2xp3, 1/year p1 – power supplied from the external sources failure probability p2 – delayed restoration of the external power supply probability p3 – automatic supply power failure probability P2 = 10−3 + 10−3 x10−1 x10−1 = 2x10−5 , 1/year Probability of fire ignition in the building: P3 = (p1 + p2 + p3 + p4)x5, 1/year P1 – violation probability of the electrical equipment of operation rules; P2 – careless handling of fire ignition probability
Fault tree for ventilation.
285
FIGURE 11.1
Operation risk assessment of the main-fan installations Chapter | 11
286
Safety and reliability modeling and its applications
FIGURE 11.2
Event tree for poisoning.
P3 – spontaneous combustion of combustible materials P4 – lightning discharge P5 – failure of fire extinguishing equipment. P3 = 10−3 + 10−3 + 10−4 + 10−6 x10−3 = 2.2x10−6 , 1/year And then PV = 6.2 x 10−1 + 2 x 10−5 + 2.2 x 106 = 6.2 x 10−1 , 1/year These probabilities are obtained under the condition of another ventilation system normal operation. The probability of poisoning is determined according to the failure and event trees (Fig. 11.1 and Fig. 11.2). Probability of event 3.2 – “accident, possible poisoning of mine workers” is: Ro = PV xPT I xPSR xPNE xPA , 1/year Where PV is a mine ventilation failure, 1/year PTI – the probability of staff error when transmitting information about the work of the GTG PSR – the probability of failure of the self-rescuer
Operation risk assessment of the main-fan installations Chapter | 11
287
TABLE 11.1 A qualitative assessment of the probability of the event Level designation
Numerical estimation of the probability, 1/year
Level name
Description
A
High probability of repeated events
Will happen in most More than 10−3 cases
B
Possible
Possibly will happen From 10−4 to 10−3 in most cases
C
Probable
Can happen
From 10−5 to 10−4
D
Improbable
May occur, but not expected
from 10−6 to 10−5
E
Extremely improbable
It will happen under Less than 10−6 exceptional circumstances
PNE – the probability of not timely evacuation due to staff error, 10−11 PA – the probability of decease in mine atmosphere unsuitable for breathing. PA = 5/16x12 = 0.026 = 2.6x10−2 , Where 5 – the number of victims, 12 – the number of people who risked their health during the liquidation of the accident for 16 years to that moment. PCP – the probability of getting combustion products in the building of the main ventilation unit in the mine workings provided that the second main ventilation unit is operated with suction ventilation at 10−2 , 1/year. Po = 0.62 x 1 x 10−2 x 1 x 10−1 x 2.6 x x 102 x 102 = 3.2 x 10−9 , 1/year The poisoning probability when working with a redundant unit PRU = 2x109, 1/year An assessment of the poisoning risk due to insufficient ventilation is given in Table 11.1, for main fan installation without a redundant unit (Po), as extremely improbable (E). An assessment of the severity of the potential hazard in the underground mine can be understood according to Table 11.2 as insignificant (2) in the case without a redundant unit. To assess the poisoning risk and determine its acceptability, we use the data in Table 11.3, taking into account estimates of the probability of poisoning and the severity of consequences. The poisoning risk is taken in accordance with Table 11.3.
288
Safety and reliability modeling and its applications
TABLE 11.2 Qualitative assessment of the consequences of the event Level designation
Level name (consequences)
Influence on people
Influence on the environment
5
Catastrophic
Numerous accidents
Extreme environmental damage
4
Significant
Nonrecoverable total disability, single accidents
Significant environmental damage
3
Moderate
Significant injuries or health damage, such as loss of business days
Moderate environmental damage
2
Insignificant
Minor injuries or damage to health
Local environmental damage
1
Minor
Minor damage to health
Minimal environmental damage
11.2 The ventilation system failures role in assessing the risk of flammable gases explosion Creating an explosive and fire-hazardous environment in the work area is possible if there is no ventilation with the gradual emission of flammable gases, or due to the release of hydrogen sulfide from underground water. This can lead to an emergency if the gas environment is not monitored. The cause may be a malfunction of the control devices of the gaseous medium containing combustible gas/hydrogen sulfide, or “human factor”. Requirements for the content of mine air: oxygen in the air, where people working are maybe located, must be at least 20 %. The content of carbon dioxide in the mine air should not exceed 0.5 % at work sites, 0.75 % in workings with a common outgoing stream of the mine, and 1 % when working on the blockage. The content of toxic gases in existing underground workings should not exceed permissible concentration (PC): hydrogen sulfide (H2 S) – 0.00071 (10 mg/m3 ); the level of toxic gases, dust content of air supplied through the air supply shafts and main workings of the mine should not exceed 30% of the established PC [1]. The amount of air supplied to the face where the blasting operations performed must be such that the resulting toxic products liquefied to PC before the workers admitted. The minimum speed of air movement in the workings should be where there is hydrogen sulfide – H2 S 0.5 m/s.
Assessing the consequences of events The level of events probability
Minor 1
Insignificant 2
Moderate 3
Significant 4
Catastrophic 5
А (Highly probable).
Rp
Ru
Ru
Ru
Ru
B (possible)
Ra
Rp
Rp
Ru
Ru
C (probable)
Ra
Ra
Rp
Rp
Ru
D (improbable)
Ra
Ra
Ra
Rp
Rp
E (extremely improbable)
Ra
Ra
Ra
Ra
Rp
Symbols in the table: Ra – acceptable risk; Rp - practically possible risk; Ru – unacceptable risk.
Operation risk assessment of the main-fan installations Chapter | 11
TABLE 11.3 Risk assessment matrix
289
290
Safety and reliability modeling and its applications
The content of hydrogen sulfide is from 0.001% to 0.00072 %, according to which, the gas regime introduced at the mine under the conditions of hydrogen sulfide separation. The explosion hazard of the mine atmosphere calculated by the sum of the combustible gases methane – CH4 and carbon monoxide – CO, and hydrogen H2 mixed with oxygen. The explosion hazard of the mine atmosphere in emergencies calculated in the following order: The total content of combustible gases in the mine atmosphere calculated using the formula: CCG = CCO + CCH4 + CH2
(11.1)
Where: CCO – concentration of carbon monoxide in the mine air in percentage CCH2 – methane concentration in mine air in percentage CH2 – concentration of hydrogen in the mine air in percentage The proportion of CO, CH4 , and H2 in the mixture calculated using the formulas: PCO = CCO /CCG , PCH4 /CCH4 /CCG , andPH2 /CCG
(11.2)
The PCO = 1 condition must be met. The explosion hazard of the mine atmosphere determined by the explosiveness triangles. If the applied point is located inside the explosiveness triangle corresponding to the calculated gas according to the formula (2), then the mine atmosphere is explosive. It shows the point (X) of the mine atmosphere explosiveness at the specified values of combustible gases. Under these conditions, the mine atmosphere is not explosive [5]. C – total concentration of combustible gases in the mine atmosphere [5]. The triangle shows that the explosion hazard of the mine atmosphere can occur with an increase in the percentage of combustible gases in the mine atmosphere while reducing the oxygen content [5]. From this example, an explosion in the absence of combustible gases rated as less “extremely improbable”. In emergencies, the calculation data is determined by the laboratory at the accident site. The occurrence of an emergency unsuitable mine atmosphere and an explosive and fire hazard situation and its further adverse development depends on certain probabilistic factors: • The probability of finding electrical equipment (as an ignition source) in the area of hydrogen sulfide manifestation PH • The probability of formation of an explosive and fire-hazardous environment PM depends on the emission or appearance of hydrogen sulfide in the working area
Operation risk assessment of the main-fan installations Chapter | 11
291
• The probability of a fire source from electrical equipment PEQ • The probability of failure of ventilation failure in the working area PW • The probability of the automatic control system failure PAC for hydrogen sulfide in the atmosphere of mine workings • The probability of failure of portable devices PD for monitoring hydrogen sulfide in the atmosphere • Operator failure probabilities (“human factor” when working with mine atmosphere monitoring devices); • The self-rescuer failure probability PSR Then the probability of a fire or explosion of hydrogen sulfide in the underground development is generally determined: P = PHxPM xPEQ xPW xPSR x[PD + POF ]
(11.3)
The failure tree model for a fire (or explosion) of hydrogen sulfide is shown in Fig. 11.3. The event tree is shown in Fig. 11.4. The probability of a fire source from electrical equipment PEQ consists of the probability of a fire source from electrical circuits PEC and the electrical motors probability of a fire source from mechanical components PFS . The probability of a fire source occurrence from electrical circuits was determined by taking into account the intensity of fire-dangerous failures occurrence that occur in electric current sources and electric consumers who connected to them by means of cable products. The probability of a fire source from mechanical components PFS of the electrical motors was determined by taking into account the failure rate of systems that ensure its regular operation, preventing the occurrence of firedangerous failures (overheating, sparks, flames). Then: PEQ = PEC + PFS Failure rate of the gas monitoring device λGM , indicating that the gas concentration has been reached [3]: λGM = λSF + λUF + λAD ≈ 0.088 + 0.175 + 0.088 ≈ 0, 35(1/year); Where: λSF = 1 × 10−5 (1/hour) × 8760(hour/year) = 8.76 × 10−2 (1/year)− the sensor failure rate; λUF = 2 × 10−5 (1/hour) × 8760(hour/year) = 1.75 × 10−1 (1/year)− the signal processing unit failure rate; λAD = 1 × 10−5 (1/hour) × 8760(hour/year) = 8.76 × 10−2 (1/year)− the alarm device failure rate;
A fault tree for the gas explosion.
Safety and reliability modeling and its applications
FIGURE 11.3
292
Poisoning and explosion events tree.
293
FIGURE 11.4
Operation risk assessment of the main-fan installations Chapter | 11
294
Safety and reliability modeling and its applications
Then the probability of failure of a portable gas monitoring device PGMD , signaling about gases: PGMD = (1 − e−λGM ) ≈ 0.3(1/year) Probability of failure of the hydrogen sulfide monitoring device in combination with operator failure [PD + POF ] determined as: [PD + POF ] = PAH X[PGMD + POF ] (Taking into account additional requirements for equipping atmospheric monitoring devices for the presence and concentration of hydrogen sulfide); Where: PAH – the probability of an automated hydrogen sulfide monitoring device installation failure in the middle section of the production section, signaling and disabling electrical equipment (when the concentration of flammable gases is reached). (When establishing additional requirements for monitoring the gas situation, the probability of failure of automated control systems is 0.01, 1/year). Then, taking into account the above, including additional measures to equip devices for monitoring the presence of hydrogen sulfide, the probability of hydrogen sulfide fire or explosion in the underground development takes the form of: POF = PM xPW xPSR x(PD + POF ) P1 = PHxPMx[PEC + PFS ]xPW xPW xPACx[PD + POF]
(11.4)
For underground mining operations, taking into account additional measures to equip devices for monitoring the presence and concentration of hydrogen sulfide in the atmosphere, the probability P1 of hydrogen sulfide fire or explosion. P2 = PHxPMx[PEC + PFS ]xPW x(PGMD + POF )
(11.5)
For underground mining operations, without equipment (without performing additional measures for installing devices) to monitor the presence and concentration of hydrogen sulfide in the atmosphere, the probability of P2 of a fire or explosion of hydrogen sulfide.
11.3 Analysis of the occurrence and development of accidents The probability model of a fire source from electrical equipment PEQ is represented in the failure tree as: 1. The probability of a fire source from electrical circuits and static electricity PEC , which in the model is determined by taking into account the intensity of
Operation risk assessment of the main-fan installations Chapter | 11
295
occurrence of fire-dangerous failures that occur in current sources and power consumers that are connected to them via a cable system. PEC = PFH + PFE According to the electrical equipment failure tree: PFE = [PFC + PFP + PC ]xPEPS
(11.6)
Where: PFE – the probability of failure of electrical equipment PFC – the probability of failure of current sources PFP – the probability of failure of power consumers PC – the probability of cable system failure PEPS – the probability of failure in the electrical equipment protection system PFH – the probability of fire hazard from static electricity 2. The probability of a fire source from mechanical units in an electric motor, which is determined in the model taking into account the failure rate of units that ensure the regular operation and prevent the occurrence of fire-dangerous failures of the electric motor (overheating, sparks, and flames). PMU = PB + PCF
(11.7)
Where: PB – the probability of electrical motor bearing failure and possible destruction PCF – the probability of an electric motor due to a cooling failure Probability of a fire source occurrence RP from electrical equipment in accordance with (6), (7) and figure 11.3: PEQ = PEC + PFS = {[PFC + PFP + PC ]xPEPS } + [PB + PCF ]
(11.8)
The causes of fires are related to possible damage to electrical equipment, as well as improper maintenance. Electrical equipment itself is an object of increased danger, including as a possible initiator of an explosion and fire situation. Electrical appliances can cause a fire (explosion) if they are not properly contained (sparking, short-circuiting, breaking contacts, and, as a result, heating them, etc.). A fire hazard is the failure of electrical machines in the lubrication system (bearings) or their cooling, which can lead to unacceptable overheating and destruction of the electrical motor as a whole or individual component. In most of the electrical equipment used in the mine, the cooling system is integrated into its design, it is quite simple, efficient, and reliable, and so the probability of the electric motor failure due to a cooling violation is accepted: PCF ≈ 0
296
Safety and reliability modeling and its applications
Let’s consider a simplified model of explosion and fire hazardous electrical equipment as a source of flammable gases ignition in the atmosphere at the mine. Fire hazard of electrical equipment is characterized by the following manifestations: - Sparks, - The ability to form molten metal particles at the moment of short circuit, - The ability of cables and wires in emergencies to overheat to the ignition temperature of flammable gases In accordance with (6), the probability of a fire source from electrical circuits is taken as a conservative option (the occurrence of a source of fire initiation from a spark or overheating) in the event of a fault for energetically stressed electrical equipment units. - Failure in the system of current sources (transformer, generator) - Failure in the system of current consumers (electric motor, etc.) - Failure in the cable system (short-circuit of power cables, heating due to poor contact) Fire safety is also determined by the environmental conductivity of electricity. Under certain conditions, static electricity charges accumulate, the potential difference of which can exceed the breakdown voltage and cause spark discharges [3]. Static electricity accumulated by friction between insulators or dielectrics on metal. Sparks generated by static electricity can ignite a flammable mixture of gases, vapors, and dust with air. Ignition (explosion) in the atmosphere of hydrogen sulfide occurs only under certain conditions: 1. Accumulation of an electric charge of sufficient magnitude that the spark caused by it is the source of initiating the ignition of an explosive mixture 2. The presence of an explosive mixture within the limits of explosive concentrations The discharge of electricity usually occurs on sharp edges, protrusions. Discharges of accumulated electricity can be of two types: corona and spark. Dangerous spark discharges that have a lot of energy. At a potential difference of 3 kV, a spark discharge can ignite almost all combustible gases, and at 5 kV – most types of combustible dust. This event is estimated with a probability depending on the environmental conditions 10−2 ÷ 10−5 (1/year). PFH = 10−2 ÷ 10−5 (1/year)− the probability of a fire source from static electricity, we will take. PFH = 10−3 (1/year)
Operation risk assessment of the main-fan installations Chapter | 11
297
The probability of an event (8) for electrical equipment and components of electrical machines is determined using the following formula: P = 1 − eγ t
(11.9)
Where: P – Element failure probability λ – A failure rate of the item t – Time According to known data on the failure rate of electrical equipment elements and mechanical components of electrical machines, we will complete a quantitative assessment of the elements failure probability, which can be a source of ignition and lead to a fire or explosion in the atmosphere of the mine combustible gases. There are 365 days of 24 hours a year. Total 8760 hours/year. Then, in accordance with [1] and [2]: λAF = 0.359 × 10−6 (1/hour) × 8760 hour/year = 3.1 × 10−3 (1/year)− the rate of the alternator failure (the numerical values of the failure rate show the average values) The probability of the alternator failure: PAF = (1 − e−λAF ) ≈ 0, 003(1/year) λPT = 1.04 × 10−6 (1/hour) × 8760(hour/year) = 9.1 × 10−3 (1/year) the power transformer failure rate The power transformer failure probability: (PPT ) = (PFC ) = (1 − e−λPT ) ≈ 0.009(1/year) λAM = 5.24 × 10−6 (1/hour) × 8760 hour/year = 4.6 × 10−2 (1/year) − failure rate of an asynchronous motor The asynchronous motor failure probability: PAM = PFP = (1 − e−λAM ) ≈ 0.045(1/year) λCBS = 0.475 × 10−6 (1/hour) × 8760(hour/year) = 4.16 × 10−3 − the cable system failure rate The cable system failure probability: PCBS = (1 − e−λCBS ) ≈ 0, 0042(1/year) Based on the results obtained, we estimate the electrical equipment failure probability by the failure rate of the main sources used at the mine, consumers,
298
Safety and reliability modeling and its applications
and power supply cables on average as: [PFC + PFP + PC ] ≈ 0, 06(1/year) Protection system for electrical equipment on relay protection machines. λAPS = 4.1 × 10−6 (1/hour) × 8760 hour/year = 3.6 × 10−2 (1/year) − the automatic protection system failure rate Probability of failure in the electrical equipment protection system: (PEPS = PAPS) = (1 − e−λAPS ) ≈ 0.036(1/year) According to (6) the probability of failure of electrical equipment and the probability of a fire source from electrical circuits and static electricity PEC is: Let us take a conservative option for the coefficient that takes into account the share of fire-dangerous failures almost equal to one [4], so we can assume that the failure of electrical equipment in the mine, which can be a source of fire from electrical circuits and static electricity (PEC ) in the conservative case, is: PEC ≈ PEPS ≈ 0.002(1/year) The occurrence of a source of ignition (overheating, sparks, and open flames) from mechanical components of electrical machines is a manifestation of features in the form of the duration of exposure and high power of the ignition source. The occurrence of a fire source (explosion) in the fire model from mechanical components in electrical machines is determined by taking into account the failure rate of components that ensure the regular operation and prevent the occurrence of fire-dangerous failures. The probability of an event (8) for the nodes of an electric machine is determined using the formula (9). Then according to [1,2]: λB = 1.8 × 10−6 (1/hour) × 8760 hour/year = 1.58 × 10−2 (1/year)− bearings failure rate PB = (1 − e−λB ) ≈ 0.015(1/year) Given the fact that the mine has a sufficient number of electric motors, but not every mechanical failure leads to a source of fire ignition, (we will select the option for the factor share of the fire failures [4] for the conservative case and take it equal to one). The likelihood of ignition source of the fire from the mechanical components in an electric motor you can take: PFS ≈ 0.015(1/year)
Operation risk assessment of the main-fan installations Chapter | 11
299
The probability of a fire source from electrical equipment PEQ, (given the fact that PEC ≈ 0,001), according to the failure tree model, we get: PEQ ≈ 0.016(1/year)
11.4 Analysis of the probability of explosion of flammable gases/hydrogen sulfide at the mine from electrical equipment Because the number of types of electrical equipment used at the mine is quite large(according to the list of fixed assets of industrial equipment, quantity equipment N-about 100 units), the probability PEQ of a fire source from "average" electrical equipment is assumed according to the model used. In Eq. (11.4) are included: the number of electrical equipment, compensating measures for equipping devices for monitoring the presence of flammable gases/hydrogen sulfide, the probability P1 of a fire or explosion of hydrogen sulfide in an underground mine. The probability of PH in the event of a fire is conservatively assumed to be PH = 0.7. The probability of an event C4.2 – fire (explosion) is determined from the expression; P1 = PHxPMx[PEC + PFS ]xNxPW xPWxPACx[PD + POF]x[PD + POF] = 7x10−2 x0.016xNx0.62x0.62x0.3x 0.3 + 10−2 x 0.3 + 10−2 = 1.2x10−4 (1/year) According to Eq. (11.5), for underground mining operations, taking into account the amount of electrical equipment, without equipment for monitoring the presence and concentration of hydrogen sulfide in the mine atmosphere, the probability P2 of a fire or explosion of hydrogen sulfide. P2 = PHxPMx[PEC + PFS ]xPW x(PGMD + POF ) = 0.7x10−2 x0.016x100x0.62x0.62x0.31x0.31 = 4.1x10−4 (1/year) Conduct a qualitative assessment of the probability of occurrence of fire or explosion of combustible gas/hydrogen sulfide based on the number of electrical equipment without control devices of combustible gases and hydrogen sulfide at the mine in accordance with Table 11.4, for underground mining in which there is the presence, of hydrogen sulfide P2 is possible (B). Conduct a qualitative assessment of the probability of occurrence of fire or explosion of combustible gas /hydrogen sulfide based on the number of electrical equipment and taking into account compensatory measures for equipment of portable monitoring devices combustible gas and hydrogen sulfide, as well as automatic control devices of combustible gases and hydrogen sulfide gas, installed in the middle part of the mine workings with simultaneous disconnection of electrical equipment. Qualitative assessment of the probability of a fire or explosion of flammable gases /hydrogen sulfide for underground mine workings
300
Safety and reliability modeling and its applications
TABLE 11.4 Qualitative assessment of event probability Level designation
Numerical estimation of the probability, 1/year
Level name
Description
A
High probability of repeated events
Will happen in most cases
More than 10−3
B
Possible
Possibly will happen in most cases
From 10−4 to 10−3
C
Probable
Can happen
From 10−5 to 10−4
D
Improbable
May occur, but not expected
From 10−6 to 10−5
E
Extremely improbable
It will happen under exceptional circumstances
Less than 10−6
in the atmosphere of which there is the presence of hydrogen sulfide P1 is possible (B). For a qualitative assessment of the risk of fire (explosion) of hydrogen sulfide from electrical equipment, we estimate the possible severity of the consequences in accordance with Table 11.5. To assess the risk and determine its acceptability, we will use the data in Table 11.6, taking into account the estimates of its consequences fire (explosion) probability. The risk of fire or explosion of hydrogen sulfide during the operation of electrical equipment at the mine is estimated in accordance with Table 11.6. For the case of fire or explosion, according to Table 11.6, (probability – (C) probable, severity – (3) moderate), risk – possible (Rp ).This level of risk requires the adoption of compensatory measures that may ensure the safety from fire (explosion). For the conservative case, fire (explosion), (probability – (D) improbable, severity – (3) moderate), we get – acceptable risk (Ra ). The results show that compensatory measures are sufficient.
11.5 The risk analysis results To ensure an acceptable level of risk of fire (explosion) of hydrogen sulfide in underground conditions in working areas with gas mode when operating electrical equipment in the mine normal design, general-purpose electrical equipment, and electrical appliances, it is necessary to take special measures for conducting
Operation risk assessment of the main-fan installations Chapter | 11
301
TABLE 11.5 Qualitative assessment of the consequences of events Level designation
Level name (consequences)
5
Influence on people
Influence on the environment
Catastrophic
Numerous accidents
Extreme environmental damage
4
Significant
nonrecoverable total disability, single accidents
Significant environmental damage
3
Moderate
significant injuries or health damage, such as loss of business days
Moderate environmental damage
2
Insignificant
Minor injuries or damage to health
Local environmental damage
1
Minor
Minor damage to health
Minimal environmental damage
mining operations in a gas mode in an underground mine, as well as perform additional measures. Based on the analysis and risk assessment, the author justifies the permissibility of using electrical equipment in normal mine design and general-purpose electrical equipment and electrical devices, when establishing additional safety requirements and performing compensatory measures to reduce the risk to an acceptable level [5]. Given a set of measures and requirements, and it is shown that the proposed compensatory measures and additional security requirements provide an acceptable level of risk in deviation from the requirements of industrial safety at the operation of electrical equipment in mine normal performance and equipment for general-purpose and electrical appliances. To maintain an acceptable level of risk associated with the risk of fire (explosion) during the operation of electrical equipment at the mine, measures designed to reduce the probability and degree of accidents consequences, as well as during their regular operation, aimed at achieving an acceptable level of risk are presented (Ra ). The criteria for determining the choice of compensatory measures are determined by the amount of acceptable risk (Ra ). The criterion for the sufficiency of the established measures was to reduce the indicators of the degree of emergency
302
Assessing the consequences of events The level of events probability
Minor 1
Insignificant 2
Moderate 3
Significant 4
Catastrophic 5
А (Highly probable).
Rp
Ru
Ru
Ru
Ru
B (possible)
Ra
Rp
Rp
Ru
Ru
C (probable)
Ra
Ra
Rp
Rp
Ru
D (improbable)
Ra
Ra
Ra
Rp
Rp
E (extremely improbable)
Ra
Ra
Ra
Ra
Rp
Symbols in the table: Ra – acceptable risk; Rp - practically possible risk; Ru – unacceptable risk.
Safety and reliability modeling and its applications
TABLE 11.6 The risk assessment matrix
Operation risk assessment of the main-fan installations Chapter | 11
303
risk to the level of acceptable risk. In order to reduce the likelihood of a failure in the ventilation of the mine, and to reduce the consequences of possible accidents, it is necessary to follow the following safety measures: 1. Fulfillment of the maintenance regulations of the main fan installation of mine in accordance with the requirements of the “Safety Rules for Mining and Solid Mineral Processing” approved by the order of Rostekhnadzor dated 12/11/2013. No. 599. 2. Equipment of the premises of the main fan installation with a fire alarm and video surveillance cameras, providing: • The clarity and uniqueness of the image of the main fan installation • Information output in the "real-time" mode to the control room of the mine 3. Daily check of automatic control and video surveillance systems. 4. Carry out the replacement of the working (operating) electric motor with a backup one, in accordance with the developed regulations. 5. Provide the building of the main fan installation with fire extinguishing measures. The main fan installation of the oil station should be equipped with an automatic fire extinguishing installation, in accordance with the project. 6. Conduct an examination of the industrial safety of the main fan installation. 7. Ensuring the supply of fresh air to mine workings where machines with internal combustion engines operate in an amount sufficient to maintain the oxygen content in the mine atmosphere of at least 20%, while the specific air consumption should be three m3/min per HP of diesel equipment used. 8. Provide the personnel of the mine, including self-rescuers, during normal operation in accordance with the requirements of regulatory documents.
11.6 Conclusion For underground mine workings in a mining atmosphere where there is no emission of toxic gases, the risk of poisoning workers without the main fan installation redundant unit according to Table 11.3, (probability - (E) extremely improbable, insignificant - (2), risk acceptable (Ra). The providing of acceptable level of risk requires taking the measures necessary to maintain the required level of reliability of the main fan installation. In the process of risk assessment, the software and hardware tool “Modos” was used.
References Кузьмина О. В., ЯнчийС. В. Анализ надежности вентиляционных систем методом дерева отказов: Современные научные исследования: актуальные вопросы, достижения и инновации. Сборник статей Международной научно-практической конференции. 2016 Издательство: «Наука и Просвещение» Kuzmina, O.V., Yanchiy, S.V., 2016. Reliability
304
Safety and reliability modeling and its applications
analysis of ventilation systems using the failure tree method: Modern scientific research: current issues, achievements and innovations. Collection of articles of the International scientific and practical conference Publisher: “Science and Education”. In Russian. Федеральные нормы и правила в области промышленной безопасности «Общие требования к обоснованию безопасности опасного производственного объекта», утвержденные приказом Ростехнадзора от, 2013. Federal rules and regulations in the sphere of industrial safety. General requirements for justifying the safety of a hazardous production facility approved by the order of Rostekhnadzor of July 15№ 306 (In Russian). Федеральные нормы и правила в области промышленной безопасности «Правила безопасности при ведении горных работ и переработке твердых полезных ископаемых», утвержденные приказом Ростехнадзора от 11.12.2013г. № 599, 2013. Federal rules and regulations in the sphere of industrial safety. Safety rules for mining operations and processing of solid minerals» approved by the order of Rostekhnadzor of December 11№ 599 (In Russian). «Методические основы по проведению анализа опасностей и оценки риска аварий на опасных производственных объектах», утвержденные приказом Ростехнадзора от 13.05.2015 г. № 188. Methodological bases for conducting a hazard analysis and risk assessment of accidents at hazardous production facilities approved by the order of Rostekhnadzor of May 13, 2015, № 188 (In Russian). Федеральные нормы и правила в области промышленной безопасности «Правила безопасности при ведении горных работ и переработке твердых полезных ископаемых», утвержденные приказом Ростехнадзора от 11.12.2013г. № 599, 2016. About the approval of Federal norms and rules in the sphere of industrial safety. Instruction for localization and elimination of consequences of accidents at hazardous production facilities where mining operations are conducted approved by the order of Rostekhnadzor of October 31№ 44480 (In Russian).
Non-Print Items Abstract This article describes the principal types of hazards when mines are operated equipped solely with the main ventilator, i.e. the installations without a redundant/secondary ventilator unit. The areas affected by the possible failure of the mine airing system are determined and described. Based on these findings, this article proposes measures to reduce the probability of the main fan failure, and, consequently, the probability of potential accidents. Keywords Carbon monoxide; hydrogen sulphide; main fan; main fan failure; risk assessment; toxic gases
Chapter 12
Generalized renewal processes Paulo R.A. Firmino a, Cícero C.F. de Oliveira b and Cláudio T. Cristino c a Center
for Science and Technology, Federal University of Cariri, Juazeiro do Norte-CE, Brazil. Institute of Education, Science and Technology of Ceará, Crato-CE, Brazil. c Department of Statistics & Informatics, Federal Rural University of Pernambuco, Recife-PE, Brazil
b Federal
Main Acronyms CDF GRP GuGRP HPP LL MLE NHPP PDF RP UGRP WGRP
Cumulative Density Function Generalised Renewal Processes Gumbel-based GRP Homogeneous Poisson Process log-likelihood Maximum Likelihood Estimation Non-Homogeneous Poisson Processes Probability Density Function Renewal Processes Uniform-based GRP Weibull-based GRP
Main Symbols α αˆ αˆ obs β βˆ βˆobs cy cˆ y cˆ yobs η F (·) hTi (·) I −1 (α, β, q) n q
WGRP scale parameter α estimator α estimate, i.e. an instance of α, ˆ in the light of the data WGRP shape parameter β estimator ˆ in the light of the data β estimate, i.e. an instance of β, Coefficients of the mixed Kijima virtual age model cy estimators cy estimates, i.e. an instance of cˆ y , in the light of the data Significance level adopted for hypothesis testing Cumulative probability distribution function Hazard function of the variable Ti ˆ q) Covariance matrix of WGRP with respect to (α, ˆ β, ˆ Sample size GRP rejuvenation parameter
Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00014-3 Copyright © 2021 Elsevier Inc. All rights reserved.
305
306 qˆ qˆobs Ti ti θ vi Wi wi Xi xi Yi yi
Safety and reliability modeling and its applications q estimator q estimate, i.e. an instance of q, ˆ in the light of the data Cumulative random time to occur the ith intervention Instance of Ti Mean value of Wi Virtual age of the system at the ith intervention Exponentially distributed random variable Instance of wi Time between the (i − 1)th and the ith interventions Instance of Xi Type of the intervention related to Xi Instance of Yi
12.1 Introduction Generalized renewal processes -GRP- are a flexible formalism for modeling, forecasting, and evaluating repairable systems, based on the concept of virtual age. Virtual age functions operate on the real age of the system under study by means of a rejuvenation parameter. Thus, mathematically, GRP allow one to model the time to occur events of interest. The formalism has been developed in the context of reliability engineering, where the events of interest are technological or process disturbances, for example, a machine failure, a human error, a software bug, a chemical power plant leak, a security system invasion, and so on. However, the term ‘special event’ can also be extended to an extreme environmental phenomenon (e.g., rare precipitation, temperature, wind speed or tidal height, a tsunami, a storm, a severe drought), an extraordinary social situation (e.g., a pandemics, a civil conflict, a political coup, a terrorist attack), an economical remarkable circumstance (e.g., a recession, an anomalous exchange rate variation, a shady bank account withdrawal), and so on. The study of undesirable events is prioritized here, as usual in areas like safety and reliability engineering. In turn, the term ‘time’ can also be relaxed, contemplating any measure unit (e.g., meters, seconds, kilograms, cubic meters). Thus, one could be interested in studying the quantity of investigated area (in m2 ) until detection of a leak in a pipeline, the quantity of experimented blood (in milliliters) until the identification of a leukemic cell or the time to failure of a component (in hours). Anyway, the occurrence of the event of interest might reflect a stepping update on the condition of the system under study. Considering GRP, such occurrences can thus be understood as interventions that can alter the age of the repairable system under consideration, leading to the virtual age concept. The development of this theory allows one to infer the condition of the system as well as the
Generalized renewal processes Chapter | 12
307
time to occur the next interventions given the previous intervention times and types. Therefore, based on the previous time event data set, GRP aim to summarize the underlying stochastic process and then to forecast when new interventions might return to occur. A stochastic process, say Z = {Zi , i ∈ I}, is a collection of random variables [40]. In the context of GRP, i can be understood as a count index. In this way, let Yi reflect the occurrence of the ith event of interest. Yi might also represent a random variable involving several competing events (e.g., corrective and preventive interventions). Then, Zi might represent the cumulative time until Yi (say Ti ) or even the time between Yi−1 and Yi realizations (say Xi ). Alternatively, one could also study the count process Nt where t is a time index and Nt is the number of occurrences of the event of interest until t (Nt is the number of indices i for which Ti lies in t [12]). GRP extend renewal processes (RP) and nonhomogeneous Poisson processes (NHPP). It is worthwhile to mention that homogeneous Poisson processes (HPP) are a special case of RP, where Xi follows an exponential distribution. In the reliability engineering context, RP (and thus HPP) reflect the cases where Yi are interventions that bring the system to a ‘as good as new condition.’ In turn, in NHPP the interventions bring the system to an ‘as bad as old condition.’ In this way, GRP allow different scenarios of rejuvenation. Specifically, GRP usually append a rejuvenation parameter, say q, in the parameter set of the probability distribution that underlies T1 . Under this reasoning, [23] has introduced the concept of the virtual age. For the sake of illustration, if q = 0 GRP approach RP; otherwise, if q = 1 one has an NHPP. GRP have been explored by a number of authors. Recently, Bakay and Shklyaev [2] did introduce asymptotic formulas regarding probabilities underlying GRP and then prove limit theorems. Koutsellis et al. [26] studied numerical simplifications to allow the inference of the expected number of failures of locomotive braking grids via GRP. Zhang et al. [52] presented the named quantified GRP to deal with the challenge of modeling the effect of maintenance activities on aircraft auxiliary power units. Wu and Scarf [46, 47] compare the performance of a number of failure processes modeling alternatives, including GRP, in the context of series systems composed of multiple components. Ferreira et al. [14] propose a GRP model able to distinguish the impact of different interventions types. Oliveira et al. [35] and Rai and Sharma [39] discuss GRP goodness-of-fit tests. Yang et al. [50] study a GRP model considering the product usage rate to forecast the number of failures of products sold. Cristino et al. [11] model and forecast economic recessions via GRP. More details regarding GRP are presented as follows. Specifically, probability functions, mean, variance, random generator function, hypothesis tests, and forecasting are summarized.
308
Safety and reliability modeling and its applications
12.2 The GRP models Let yi , ti , and xi be instances of Yi , Ti , and Xi , respectively. Thus, ti is the observed cumulative time on which the event yi has in fact occurred, and xi = ti − ti−1 (i = 1, 2, . . . ). As the time until first intervention (T1 = X1 ) is sometimes questionable, mainly in the cases where the involved information system has modest experience in data set maintenance or even when the system start up (T0 ) is uncertain, caution must be taken in this way. Case studies inspired in a data set from the literature allow to illustrate such a situation [14]. Anyway, the virtual age after i occurrences of the set of events of interest, say vi , can be defined as a function of the performance data set of the system in terms of both the realized times between occurrences of the highlighted events x = (x1 , · · · , xi ) and the respective nature of such events, say y=(y1 , · · · , yi ) (e.g., whether planned or unplanned interventions on a production system), besides the already known rejuvenation parameter q. Thus, vi = v((x1 , y1 ), (x2 , y2 ), . . . , (xi , yi ) | q) [14]. In summary, Ferreira et al. [14] suggest that the level of restoration imposed to the system by the occurrence of each event might depend on the respective event type, resulting in the mixed Kijima-based virtual age model [21,22]: vi = cyi (vi−1 + qxi ) + (1 − cyi )q(vi−1 + xi ),
(12.1)
where cyi ∈ [0, 1] and q ∈ (−∞, +∞). Thus, Eq. (12.1) is a linear combination in such a way that cyi = 1 (cyi = 0) leads to the Kijima Type I (Kijima Type II) model. In the Kijima Type I model, the occurrence of each event only impacts on the time since the previous occurrence. In the Kijima Type II, the impact is over the whole history of the system summarized in the vectors (x, y). Thus, Kijima et al. [22] and Ferreira et al. [14] highlight that the system has a cumulative time up to the ith occurrence, Ti , which is distributed according to the following GRP cumulative distribution function (CDF) FTi (x + vi−1 |vi−1 ) = FT1 (x + vi−1 |vi−1 ) =
FT1 (x+vi−1 )−FT1 (vi−1 ) 1−FT1 (vi−1 )
From the first equality in Eq. (12.2), it is easy to see that GRP are based on the supposition that the couple between the time until first intervention and the virtual age is sufficient to determine the family of distributions that model the times between interventions. Thus, it is supposed that the times between interventions (Xi ) are identically distributed (iid) and that their eventual dependency is incorporated in the model via the virtual age function. Further, the uncertainty over Ti is fully modeled once the virtual age in the (i − 1)th intervention as well as the distribution of T1 are known.
Generalized renewal processes Chapter | 12
309
From Eq. (12.2) and assuming its respective probability density function (PDF), fTi (x + vi−1 |vi−1 ), one obtains the GRP hazard function: hTi (x + vi−1 |vi−1 ) =
fT1 (x + vi−1 |vi−1 ) . 1 − FT1 (x + vi−1 |vi−1 )
(12.3)
In a general context, the hazard function can represent the instantaneous rate of interventions on the system (e.g., corrective or preventive actions) as time evolves. It surely is the main GRP equation for a general understanding of the condition of the system under study. In fact, the higher hTi (·) the lesser the time between consecutive interventions x is, reflecting system deterioration when the events of interest are undesirable circumstances (e.g., failures, deaths, errors, bugs, attacks, catastrophes, and so on). The reasoning for systems improvement follows the same fashion, with a time-decreasing hTi (·). In turn, stable systems lead to constant values for hTi (·). The resulting CDF and hazard functions for Ti in Eqs. (12.2) and (12.3) encapsulate the previous performance of the system (in terms of both technology and intervention process) utilizing the virtual age vi−1 .
12.2.1
Maximum likelihood estimation of the GRP model
Among the classical frameworks for statistical inference, the maximum likelihood estimation (MLE) has been highlighted due to its asymptotic properties. In MLE, the parameters estimates are the ones that maximize the joint PDF of the variables under study in the light of the data [5]. Considering y = {y1 , y2 , . . . , yn } and x = {x1 , x2 , . . . , xn } as the sample of observed types and times between interventions, one has the following joint PDF: f (x | ϑ) = f (x1 + v0 | v0 , ϑ) · f (x2 + v1 | v1 , ϑ) · · · f (xn + vn−1 | vn−1 , ϑ) (12.4) where ϑ = {θ1 , θ2 , . . . , θ p } is a p-dimensional parameter set and = ln f (x|ϑ) is the respective log-likelihood (LL) function. One of the main challenges in GRP exercises is to estimate ϑ. It must be emphasized that each paired history ((x1 , y1 ), . . . , (xi , yi )) is encapsulated by the respective virtual age vi .
12.2.2
The asymptotic confidence intervals of the GRP parameters
In fact, according to the regularity conditions mentioned by authors such as Cramr [8, Section 33], Meeker and Escobar [30, Appendix B], and Cordeiro [7, Subsection 4.1.3], and the variables in X are independent assuming that and iid and that ϑˆ = θˆ1 , θˆ2 , . . . , θˆp is a consistent solution of the first1 (ϑ) order derivative of the respective maximum likelihood function ∂∂ϑ = 0 (the maximum likelihood estimators), where ϑ = θ1 , θ2 , . . . , θ p , then the following convergence in distribution holds: D √ n ϑˆ − ϑ −→ Np 0, I −1 (ϑ) ,
310
Safety and reliability modeling and its applications
that is to say, as n increases, the distribution of ϑˆ approaches a p-dimensional normal distribution with mean ϑ and covariance matrix I −1 (ϑ) = n−1 I −1 (ϑ), where I(ϑ) = nI (ϑ) is the Fisher information matrix, allowing one to compute confidence intervals for the elements in ϑ. ˆ might follow a normal distribuThus, the maximum likelihood estimators, ϑ, tion with mean ϑ and covariance matrix I −1 (ϑ), that is ϑˆ ∼ N ϑ, I −1 (ϑ) and the 100 · (1 − η)% confidence interval of ϑ can be respectively approached by: θˆi obs ± zη/2 Ii1−1 (ϑ)obs , (12.5) where i = 1, 2, . . . , p, zη/2 is the η2 quantile of the standard normal distribution, and I −1 (ϑ)obs is the instance of I −1 (ϑ) in the light of the data (x, y).
12.2.3
Inverse function of the GRP model
For the sake of simulation, each realization x = (x1 , · · · , xi , · · · , xn ) can be computed via the inverse transform method. This method makes use of the equation ui = RTi (xi +vi−1 | vi−1 ), where RTi (xi +vi−1 | vi−1 ) = 1−FTi (xi +vi−1 | vi−1 ) is the reliability (survival) function underlying the GRP model and ui is an instance of Ui ∼ Uniform[0, 1]. Thus, from Eq. (12.2), one has the GRP realization: F −1 (ui ) = in f xi ∈ R | ui ≤ FTi (xi + vi−1 | vi−1 ) , 0 ≤ ui ≤ 1. (12.6) As follows the GRP formalism is customized to uniform, Weibull, and Gumbel probability distributions.
12.3 The UGRP modeling The simplest GRP model is the one based on the uniform distribution, say UGRP, introduced by Oliveira et al. [35].
12.3.1
The continuous uniform model
Let Ti be a uniform random variable over the interval [a, b]. Then the CDF of Ti is given by: ⎧ t b,
Generalized renewal processes Chapter | 12
12.3.2
311
The UGRP model
The UGRP CDF, considering Eqs. (12.2) and (12.7), can be given by FTi (x + vi−1 | vi−1 , a, b)
FT1 (x + vi−1 ) − FT1 (vi−1 ) 1 − FT1 (vi−1 )
=
x + vi−1 − a vi−1 − a − b−a b−a vi−1 − a 1− b−a ⎧ x + vi−1 > b, ⎨1, x , a ≤ x + vi−1 ≤ b, ⎩ b−vi−1 0, otherwise.
=
= The respective UGRP PDF is: fTi (x + vi−1 | vi−1 , a, b) =
1 , b−vi−1
0,
a < x + vi−1 < b, otherwise.
In turn, the UGRP hazard function is given by: hTi (x + vi−1 | vi−1 , a, b)
= =
fT1 (x + vi−1 | vi−1 , a, b) 1 − FT1 (x + vi−1 | vi−1 , a, b) 1 . b − (x + vi−1 )
From Eq. (12.9), one can see that the greater the argument (x + vi−1 ) the greater the hazard, indicating that UGRP is only adequate to fit deteriorating systems. Thus, it is unable to model stable or improving systems. According to Eqs. (12.6) and (12.8), the UGRP inverse function is given by F −1 (ui ) = xi = ui × (b − vi−1 );
i = 2, 3, . . . ,
(12.10)
where ui is a random number from the interval [0, 1], vi−1 is computed according to Eq. (12.1), and x1 comes from a uniform (a, b).
12.4 The WGRP modeling The Weibull-based GRP (WGRP) is the most used GRP model. It has been applied in a number of works (e.g., [14, 18, 34–36, 49]).
12.4.1
The Weibull model
Let X = (X1 , X2 , · · · , Xn ) be a random vector, in which Xi follows a Weibull distribution (i = 1, 2, · · · , n) [43] with parameters (α, β ). The CDF of Xi is
312
Safety and reliability modeling and its applications
given by
β 1 − exp − xαi , xi ≥ 0, FXi (xi | α, β ) = 0, xi < 0,
(12.11)
where α > 0 and β > 0 are the scale and shape parameters, respectively. In particular, the kth moment of Xi is given by k , E(Xik ) = α k 1 + β where (·) is the Gamma function ∞ (z) = t z−1 e−t dt;
z > 0.
0
The shape of the Weibull PDF is dependent on the value of β [13, p. 82]. For example, for β < 1 the PDF is similar in form to the Exponential model; for large values of β (from β ≥ 3) the PDF approaches a normal model; for 1 < β < 3 the PDF is slightly distorted; and for β = 1 the PDF is equal to the exponential model.
12.4.2
The WGRP model
Initially, Smith and Leadbetter [41] proposed an iterative solution to the renewal equation in cases where the times between interventions follow a Weibull model. From there, the Weibull model in Eq. (12.11) has been instrumental in modeling the times between interventions via GRP formalism. For example: β = 1 and independent variables in X lead to an HPP, based on an exponential model (with constant hazard function); β > 0 ( = 1) and q = 1 lead to a Weibull-based NHPP (minimal intervention), reflecting interventions that lead the system to an ”as bad as old” condition, and thus dependent variables in X; β > 0 ( = 1) and q = 0 lead to a Weibull-based RP (perfect intervention), reflecting interventions that lead the system to an “as good as new” condition, and thus independent in X; and iid Weibull variables β > 0 ( = 1) and q ∈ R | q = {0, 1} (generic interventions), reflecting interventions that lead the system to conditions other than “as bad as old” and “as good as new”, and thus dependent variables in X, demanding the usage of the virtual age functions. Therefore, WGRP extends HPP, RP, and NHPP models. In the literature, it has been usual to adjust the WGRP for the analysis of repairable systems, as its parameters are very flexible and allow physical interpretation. Some examples can be found in [14, 35, 36, 49].
Generalized renewal processes Chapter | 12
313
Based on [49], we have the following WGRP CDF in terms of Ti at the point (x + vi−1 ): FT1 (x + vi−1 ) − FT1 (vi−1 ) 1 − FT1 (vi−1 ) β β 1 − exp − x+vαi−1 − 1 + exp − vi−1 α = β 1 − 1 + exp − vi−1 α β β − exp − x+vαi−1 + exp − vi−1 α = β exp − vi−1 α
β β 1 − exp vi−1 , (x + vi−1 ) ≥ 0, − x+vαi−1 α = 0, (x + vi−1 ) < 0, (12.12)
FTi (x + vi−1 | vi−1 , α, β ) =
where vi = vi (q, x1 , y1 , x2 , y2 , . . . , xn , yn ) is given by Eq. (12.1) and α, β > 0. One should note that T1 = X1 . Again, it must be emphasized that, under GRP formalism, the uncertainty regarding [Ti | vi−1 ] is the same underlying [T1 | vi−1 ]. The respective WGRP PDF is: v β x + v β β x + vi−1 β−1 i−1 i−1 exp − fTi (x + vi−1 | vi−1 , α, β ) = α α α α (12.13) for (x + vi−1 ) ≥ 0. For Eq. (12.13), α > 0, β > 0, q ∈ (−∞, +∞), and x > 0. In turn, the WGRP hazard function is given by: hTi (x + vi−1 | vi−1 , α, β )
=
fT1 (x+vi−1 |vi−1 ,α,β ) 1−FT1 (x+vi−1 |vi−1 ,α,β )
=
β αβ
(x + vi−1 )β−1
Thus, from Eq. (12.14), we can see that β = 1 leads to a constant hazard function, reflecting stable systems. In turn, β < 1 (β > 1) makes hTi (·) a time-decreasing (increasing) function, characterizing improving (deteriorating) systems. Authors like Ferreira et al. [14] have brought a discussion regarding the meaning of q depending on β. In fact, if β = 1, q is useless, while if β < 1, it is desired that q increases, thus making to rise the argument in hTi (·), which is time-decreasing. In a similar fashion, if β > 1, it is desired that q → 0. In practical terms, it has been considered that β reflects the underlying condition of the system itself and q summarizes the quality of the intervention process. Therefore, β < 1 and q → 0 only reflect the possibility of improvement of the intervention process, once it is desired q > 0 for this case.
314
Safety and reliability modeling and its applications
According to Eqs. (12.6) and (12.12), the inverse WGRP function is given by
F −1 (ui ) = xi = α
vi−1 β − ln(ui ) α
β1
− vi−1
i = 1, 2, . . . ,
(12.15)
where ui is a random number from the interval [0, 1], vi−1 is computed according to Eq. (12.1) and v0 = 0.
12.4.3
Maximum likelihood estimation of the WGRP model
From the WGRP PDF in Eq. (12.13), one has the following joint PDF to be optimized according to the values of (α, β, q, cy ): f (x, y | α, β, q, cy ) = f (x1 + v0 | v0 , α, β ) · f (x2 + v1 | v1 , α, β ) · · · f (xn + vn−1 | vn−1 , α, β ) n
n βn 1 β β−1 β (xi + vi−1 ) − vi−1 , (xi + vi−1 ) = nβ exp − β α α i=1 i=1 (12.16) where x = {x1 , x2 , . . . , xi , . . . , xn } is a sample of times between interventions and cy is the vector of weights correspondent to the intervention types under study. Let = ln f (x|α, β, q) be the LL function related to the Eq. (12.16): n ln (xi + vi−1 ) = n ln(β ) − β ln(α) + (β − 1) i=1
n 1 β (xi + vi−1 )β − vi−1 . − β α i=1
(12.17)
In the light of the data (x, y), making the derivative of the Eq. (12.17) with respect to each parameter of the WGRP equal to zero will lead to a system of nonlinear equations for which the solution results in the MLE estimates, say (αˆ obs , βˆobs , qˆobs , cˆ yobs ), for (α, β, q, cyobs ). WGRP researchers have worked with algorithms dedicated to obtain approximate solutions to this optimization problem, but exact algebras are still challenging.
12.4.4
From WGRP to HPP
In this section, both Weibull power transformation (Subsection 12.4.4.1) and the WGRP power transformation (Subsection 12.4.4.2) are briefly reviewed [36]. The idea is to clarify how to convert a realization from a WGRP with parameters α, β, q, and cy in a sequence of instances of independent and iid exponential variables, with mean θ = α β . In this way, let X = {Xi }ni=1 be the sequence of random times between point interventions to be performed in a given repairable
Generalized renewal processes Chapter | 12
315
system. It must be highlighted that point interventions require a negligible time to be performed. In turn, let T = {Ti }ni=1 be the cumulative times to intervene, i that is, Ti = j=1 X j . It is usually assumed T0 = 0, though caution must be taken in some circumstances.
12.4.4.1 The simple power transformation Many distributions have been developed from Eq. (12.11), such as the Exponentiated Weibull, the Modified Weibull, the Inverse Weibull, and the Weibullbased stochastic point processes (see [35] and [27]). Specifically, one of these distributions comes from the power transformation Wi = Xiβi
that follows the distribution FWi (wi ) = P(Wi ≤ wi ) = P Xiβi ≤ wi = P Xi ≤ wi1/βi and therefore ⎧ ⎨ wi 1 − exp − βi , wi ≥ 0, αi FWi (wi | αi , βi ) = (12.18) ⎩ 0, wi < 0, where αi , βi > 0. Thus, from (12.18) Wi follows an Exponential distribution, say Wi ∼ Eq. βi βi Exponential αi , implying in Xi ∼ Exponential αiβi (see [51]). Concerning Eq. (12.1), it results from q = 0, reflecting in turn an exponential-based RP (i.e., an HPP).
12.4.4.2 The WGRP power transformation According to Oliveira et al. [35, 36], Wi assume the denominated WGRP power transformation: β . (12.19) Wi = (Xi + vi−1 )β − vi−1 By definition, Xi is a continuous random variable with PDF fXi (x + vi−1 | vi−1 , α, β ), given in Eq. (12.13), and Wi = ϕ(Xi ), where ϕ(·) is a strictly monotone and differentiable function in its domain. Thus, the PDF for Wi can be determined as dx (12.20) fWi (w | vi−1 , α, β ) = fXi (x + vi−1 | vi−1 , α, β ) · dw β with x = ϕ −1 (w). As Xi + vi−1 = (Wi + vi−1 )1/β , consequently
β )1/β−1 . vi−1
β (Wi + vi−1 )1/β
dx dw
=
1 (w β
+
In turn, Xi ≥ 0 leads to ≥ vi−1 ⇒ Wi ≥ 0. Then, from Eq. (12.20) and assuming θ = α β : 1 · exp − wθ , w ≥ 0, (12.21) fWi (w | θ ) = θ 0 w < 0, with θ > 0.
316
Safety and reliability modeling and its applications
From Subsections 12.4.4.1 and 12.4.4.2, and works of some authors [34, 35, 36], we have the following Theorem 12.4.1: Theorem 12.4.1 (WGRP power transformation). Let W = {Wi }ni=1 be a β random vector of size n, where Wi = (Xi + vi−1 )β − vi−1 , and Xi conditioned to (i − 1)th virtual age, vi−1 , follows a W GRP(α, β, q, cy ). Thus: (i) Wi ∼ Exponential(θ ), where θ = α β (> 0) is the expected value of Wi . (ii) W1 , W2 , · · · , Wn are iid. Proof. β , so: (i) From Wi = (Xi + vi−1 )β − vi−1
FWi (w | vi−1 , α, β )
= = =
P(W i ≤ w | vi−1 , α, β ) β P (Xi + vi−1 )β − vi−1 ≤ w | vi−1 , α, β β1 β P Xi + vi−1 ≤ w + vi−1 vi−1 , α, β
The last step in Eq. (12.22) involves the uncertainty regarding the remaining time to intervene on the system, Xi + vi−1 , since the last intervention time, adjusted according to its virtual age vi−1 . Such reasoning is in alignment with GRP framework. In this way, it is also possible to model [Xi + vi−1 |vi−1 ] as [X1 + vi−1 |vi−1 ], leading Eq. (12.12) to 1 β ) β vi−1 , α, β P Xi + vi−1 ≤ (w + vi−1 1 β FT1 (w + vi−1 ) β − FT1 (vi−1 ) = 1 − FT1 (vi−1 ) β β w+vi−1 v 1 − exp − αβ − 1 + exp − αi−1 β β = . v 1 − 1 + exp − αi−1 β Thus,
1 − exp − wθ , w ≥ 0 FWi (w | vi−1 , θ ) = 0, w < 0,
where θ = α β > 0. (ii) The likelihood function is given by Eq. (12.16): f (x | α, β, q) =
n i=1
fTi (xi + vi−1 | vi−1 , α, β )
(12.23)
Generalized renewal processes Chapter | 12
317
Now, to show that the variables W1 , W2 , · · · , Wn are mutually independent to each other, we need to prove that Eq. (12.23) implies n
f (w | θ ) =
fWi (wi | θ ).
(12.24)
i=1
For Item (i), the right side of Eq. (12.23) implies the right side of Eq. (12.24). Now, it will be proved that the left side of Eq. (12.23) implies the left side of Eq. (12.24). β , and Making a variable change in Eq. (12.16), that is Wi = (Xi + vi−1 )β − vi−1 β1 β then solving (Xi + vi−1 ) in terms of Wi , we get: Xi + vi−1 = Wi + vi−1 . The joint PDF of W1 , W2 , · · · , Wn , as well as in Eq. (12.20), is given by: 1 1 β fWi (w1 , · · · , wn | vi−1 , α, β ) = f (w1 + v0β ) β , · · · , (wn + vn−1 ) β | α, β, q | J | ,
(12.25) where the Jacobian J is defined by: (w + v β ) β1 −1 1 0 β 0 J = .. . 0
0 1 −1 (w2 +v1β ) β
β
... ...
.. .
..
0
...
.
0 0 .. . 1 β −1 β (wn + vn−1 ) β
β n (wi + vi−1 ) β −1 β i=1 1
=
Thus, according to Eq. (12.25), we have: n n βn β 1− β1 fWi (wi | α, β ) = nβ (wi + vi−1 ) α i=1 i=1
n n (w + v β ) β1 −1 1 i i−1 × exp − β wi α i=1 β i=1
n 1 1 = nβ exp − β wi α α i=1
n 1 1 = n exp − wi . θ θ i=1 Therefore, W1 , W2 , . . . , Wn are independent and iid.
318
Safety and reliability modeling and its applications
12.4.5
Some contributions of power law transformation
In this subsection, some contributions of Theorem 12.4.1 are presented. Particularly, central moments (see, [34]), goodness-of-fit test (see, [35]), and asymptotic confidence intervals (see, [36]) for the WGRP model are summarized.
12.4.5.1 Moments for the WGRP model Knowledge of the moments in a probability model, when they all exist, is generally equivalent to knowledge of the model, in the sense that it is possible to express all the properties of the model in terms of the moments [9]. Theorem 12.4.2 (Moments WGRP). For every integer k, the kth moment of X + v, μk is: β v k vβ k k (12.26) μk = E[(X + v) | α, β, v] = α · 1 + , β · exp β α αβ The kth central moment of X + v, μk is: k k j μk = E [(X + v) − μ]k | α, β, v = j=0 k− j · μk− j · [−μ] k k · μk−1 · μ + k−2 · μk−2 · μ2 − . . . (−1)k μk = kk · μk − k−1 (12.27) where μ = μ1 = E[(X + v) | α, β, v]. Proof. First Eq. (12.26) will be demonstrated. It is known that W = (X + v)β − 1 v β implies X + v = (W + v β ) β . So, according to the Item (i), Theorem 12.4.1, and k = 1, 2, . . ., and the kth moment of the variable X + v, we have: ∞ k k E[(X + v)k | α, β, v] = E[(W + v β ) β ] = 0 (w + v β ) β · α1β · exp − αwβ dw =
∞ 0
k
(w+v β ) β k
(α β ) β
αβ β · ( αβ) · exp − αwβ dw k
β
β
Now, doing t = w+v it follows that − αwβ = −t + αv β implies dw = α β dt. αβ β When w ↓ 0 it follows that t ↓ αv β and w ↑ ∞ implies t ↑ ∞. Thus, ∞ k k β E[(X + v)k | α, β, v] = vβ t β · ααβ · exp −t + αv β α β dt αβ β ∞ k = α k · exp αv β · vβ t β +1−1 · exp(−t ) dt αβ
The incomplete Gamma function is defined by [1]: ∞ (a, z) = t a−1 · e−t dt for a > 0 and z
z > 0.
Generalized renewal processes Chapter | 12
319
Therefore, the closed form for the kth WGRP moment is: β v k vβ E[(X + v)k | α, β, v] = α k · 1 + , β · exp β α αβ To proof Eq. (12.27), one only needs to use the Binomial Theorem and the result in Eq. (12.26). Example 12.4.1. The expectation of WGRP model is β 1 vβ v μ = E[(X + v) | α, β, v] = α · 1 + , β · exp β α αβ This result is presented in Ferreira et al. [14]. These authors found a closed form only for the first-order moment. Example 12.4.2. The variance of the WGRP model is Var[(X + v) | α, β, v] 2 = μ2 = μ2 − μ1 β β v 1 vβ v 2 vβ − 2 α μ 1 + · exp + μ2 , = α 2 1 + , β · exp β β β α α β α αβ β 2 β v 1 vβ 2 vβ 2v 2 − 1+ , β = α · 1 + , β · exp · exp β β α α β α αβ
12.4.5.2 Goodness-of-fit test for the WGRP (WGRP GOFT) From Oliveira et al. [35], the WGRP GOFT involves the following hypotheses test, based on the performance data set (x, y) = ((x1 , y1 ) . . . , (xi , yi ), . . . , (xn , yn )): H0 :
The sample (x, y) comes from a WGRP;
H1 :
The sample (x, y) does not come from a WGRP.
Let TWGRP be the vector resulting from the WGRP power transformation in Eq. (12.19), according to a given parameter set estimates. In other terms, TWGRP = (W1 ,…, Wi ,…, Wn ) in which Wi = (Xi + vi−1 )β − vi–1 β , with vi given by Eq. 12.1. Thus a WGRP GOFT approach can be easily designed. Oliveira et al. [35] have introduced algorithms in this way. In summary, it is argued that the sample (x, y) will come from a WGRP only if the respective TWGRP realization (w) comes from an exponential distribution; otherwise (x, y) would come from another stochastic process. Obviously, this GOFT for the exponential distribution could be replaced by an HPP GOFT as well. Oliveira et al. [35] have worked with KolmogorovSmirnov (K-S), Bartlett ¸ Mises (C-M), and Anderson-Darling (A-D) statistics. (B), CramEr-von
320
Safety and reliability modeling and its applications
Let D generically represent any of the aforementioned statistics and let d be an instance of D. Therefore the hypotheses test is based on p∗ = P(D ≥ d | H0 is true), the probability of observing a statistic at least as extreme as d, given H0 is true. The probability p∗ is also named p-value. This measure is, in this way, inversely proportional to the discrepancies observed between the WGRP model and the empirical data. Therefore, the lesser the p-value the lesser the probability that (x, y) come from a WGRP. The hypothesis H0 is then rejected at a significance level of η if p∗ < η.
12.4.6 Asymptotic confidence intervals for the WGRP parameters Following the ideas of Oliveira et al. [36], in this section it is shown how to use Theorem 12.4.1 for computing the Fisher information and covariance matrices related to the triple (α, β, q). The cases involving more than one intervention type, that is, cy , can also be derived. Specifically, it is presented how these matrices allow one, via asymptotic theory and concepts, to obtain confidence intervals for such parameters in the light of the performance data set x and the respective maximum likelihood estimates ˆ q) ˆ β, ˆ be the maximum likelihood estimators (αˆ obs , βˆobs , qˆobs ). In this way, let (α, ˆ q) ˆ β, ˆ related to the of (α, β, q). Thus, (αˆ obs , βˆobs , qˆobs ) is an instance of (α, realization x. The algebra is based on the simplification of Eq. (12.17) by using Theorem 12.4.1.
12.4.6.1 A Simple and random version of the LL function of WGRP Without loss of information and considering a random WGRP, X, one can β1 β introduce the equation Xi + Vi−1 = Wi + Vi−1 , leading to a random version of the LL function of WGRP in Eq. (12.17) based on the Theorem 12.4.1: 1 = n[ln(β ) − β ln(α)] +
n n 1 (β − 1) β − β ln Wi + Vi−1 Wi (12.28) β α i=1 i=1
where Vi−1 is the random version of vi−1 , due to the uncertainty regarding Xi−1 . In turn, in the light of an instance of X, x, the equations and 1 are equivalent and thus lead to the same parameters estimates (αˆ obs , βˆobs , qˆobs ).
12.4.6.2 Fisher information matrix of WGRP From asymptotic theory [30, Appendix B], the Fisher information matrix (or matrix of expected information) of a WGRP X can be defined as the expected
Generalized renewal processes Chapter | 12
values:
⎡ ∂2 E − ∂α21 ⎢ ⎢ ⎢ ⎢ 2 ⎢ ∂ 1 I(α, β, q) = ⎢E − ∂β∂α ⎢ ⎢ ⎢ ⎣ ∂ 2 1 E − ∂q∂α
2 ∂ 1 E − ∂α∂β 2 E − ∂∂β21 2 ∂ 1 E − ∂q∂β
2 ⎤ ∂ 1 E − ∂α∂q ⎥ ⎥ ⎥ 2 ⎥ ∂ 1 ⎥ E − ∂β∂q ⎥ ⎥ ⎥ ⎥ 2 ⎦ ∂ 1 E − ∂q2
321
(12.29)
It is well known that the inverse of Eq. (12.29), say I −1 (α, β, q), is the covariance matrix of WGRP with respect to its parameters estimators. In turn, one has: n n nβ β2 β ∂ 2 1 = − W − Wi . i ∂α 2 α2 α β+2 i=1 α β+2 i=1 So,
n n 2 ∂ 1 nβ β2 β E − 2 = − 2 + β+2 E Wi + β+2 E Wi , (12.30) ∂α α α α i=1 i=1
and n n n n β ln(α) β ∂ 1 ∂ 2 1 = − − β+1 Wi + β+1 Wi + β+1 Wi , ∂α∂β α α α ∂β i=1 α i=1 i=1
therefore
n n ∂ ∂ 2 1 n β ln(α) β E − = + β+1 E Wi − β+1 E Wi ∂α∂β α α α ∂β i=1 i=1 n 1 − β+1 E Wi . (12.31) α i=1
Now, β ∂ ∂ 2 1 = β+1 Wi , ∂α∂q α ∂q i=1 n
therefore
n ∂ β ∂ 2 1 = − β+1 E Wi , E − ∂α∂q α ∂q i=1 ∂ 2 1 ∂ 2 1 =E − . E − ∂β∂α ∂α∂β
(12.32)
322
Safety and reliability modeling and its applications
Finally, n n n ∂ 2 1 n ln(α)2 2 ln(α) ∂ 1 ∂2 = − − W + W − Wi , i i ∂β 2 β2 α β i=1 α β ∂β i=1 α β ∂β 2 i=1
hence
n 2 n ∂ n ln(α)2 ∂ 1 2 ln(α) E Wi − E Wi E − 2 = 2+ ∂β β αβ αβ ∂β i=1 i=1 n ∂2 1 + βE Wi (12.33) α ∂β 2 i=1
Now, assuming the derivatives of first and second orders of Vi−1 in Eq. (12.1) with respect to q, = cyi−1 Vi−2 + xi−1 + (1 − cyi−1 ) Vi−2 + xi−1 + qVi−2 Vi−1 Vi−1 , = cyi−1 Vi−2 + (1 − cyi−1 ) 2Vi−2 + qVi−2 then
⎛ ∂ 2 1 = ∂β∂q
n i=1
⎞
1 ∂2 ⎜ ⎟ ln(α) ∂ Wi − β Wi . ⎝ β1 ⎠ + α β ∂q α ∂β∂q i=1 β i=1 Wi + Vi−1 Vi−1
n
n
As a consequence, we have ⎡ ⎛ ⎞⎤ n n 2 Vi−1 ∂ ∂ 1 ⎢ ⎜ ⎟⎥ ln(α) E − = − E⎣ Wi ⎝ β1 ⎠⎦ − α β E ∂q ∂β∂q β i=1 i=1 Wi + Vi−1 n ∂2 1 + βE Wi (12.34) α ∂β∂q i=1 ∂ 2 ∂ 2 1 =E − E − ∂q∂α ∂α∂q Further, ∂ 2 1 = (β − 1) ∂q2 implying
⎛ n i=1
V ⎜ i−1 ⎝
and
∂ 2 1 ∂ 2 E − =E − . ∂q∂β ∂β∂q
⎞ β1 β 2 n Wi + Vi−1 − [Vi−1 ] 1 ∂2 ⎟ − Wi . ⎠ β2 α β ∂q2 i=1 β Wi + Vi−1
323
Generalized renewal processes Chapter | 12
2 n Vi−1 [Vi−1 ]2 ∂ 1 E − 2 = (1 − β ) E −E 1 2 ∂q (Wi + Vi−1 ) β (Wi + Vi−1 ) β i=1 n 1 ∂2 + βE Wi (12.35) α ∂q2 i=1 In the previous summations, it is considered that V0 tends to zero. The expectations in Eqs. (12.30)–(12.35) can be calculated as follows: n n ∂ 1 1 β β ln Wi + Vi−1 Wi = E[W1 ln(W1 )] + E Wi + Vi−1 E ∂β i=1 β β i=2
−
n β Vi−1 log (Vi−1 ) i=1
+
,-
(12.36)
.
A1
n n 2 1 ∂2 1 β β 2 E Wi = 2 E W1 ln(W1 ) + 2 E Wi +Vi−1 ln Wi +Vi−1 ∂β 2 i=1 β β i=2 −
n β Vi−1 log (Vi−1 )2 i=1
+
,-
(12.37)
.
A2
n n n 1− β1 ∂ β β−1 E −β Wi + Vi−1 Vi−1 Wi =β E Vi−1 Vi−1 ∂q i=1 i=1 i=1 + ,. A3
(12.38)
n n n 1− β1 ∂2 β β β−1 Wi + Vi−1 − Vi−1 E Wi = E Vi−1 ln Wi + Vi−1 Vi−1 ∂β∂q i=1 i=1 i=1 ,. + A3
n n 1− β1 β β−1 −β Wi + Vi−1 Vi−1 + E Vi−1 Vi−1 ln (Vi−1 ) i=1
i=1
+
,-
.
A4
(12.39)
324
Safety and reliability modeling and its applications
n n 1− β2 ∂2 β 2 W E W E [V ] + V = β(β − 1) i i i−1 i−1 ∂q2 i=2 i=1 n n 1− β1 β−1 β Vi−1Vi−1 +β E Vi−1 Wi + Vi−1 −β i=1
+
,-
i=1
.
A5
− β(β − 1)
n
β−2 [Vi−1 ]2Vi−1
i=1
+
,-
(12.40)
.
A6
In turn, the expectations in Eqs. (12.35), (12.36), (12.37), (12.38), (12.39), and (12.40) can be calculated as follows: E[W1 ln(W1 )] = α β (β ln(α) + 1 − γ ).
(12.41)
π2 2 β 2 2 2 + γ − 2γ . E W1 ln(W1 ) = α β ln(α) − 2βγ ln(α) + 2β ln(α) + 6 (12.42) / 0 / β 0 β Vi−1 V β β β E Wi + Vi−1 ln Wi + Vi−1 = α 0, β exp i−1 α αβ β ln(Vi−1 ) + α β , + β α β ln(Vi−1 ) + β Vi−1 (12.43)
where (·) is the incomplete gamma function. Following, for β >
1 2
/ 0 / β 0 β 1− β1 V 1 Vi−1 β β−1 =α E Vi−1 Wi + Vi−1 Vi−1 2 − , β exp i−1 , β α αβ (12.44) / 0 / β 0 β 1− β1 V 1 Vi−1 β β−1 =α E Vi−1 Wi + Vi−1 Vi−1 2 − , β exp i−1 β α αβ (12.45)
Generalized renewal processes Chapter | 12
325
Now, for β > 1: / 0 / β 0 β 1− β2 V 2 Vi−1 β 2 β−2 2 =α E [Vi−1 ] Wi + Vi−1 [Vi−1 ] 2 − , β exp i−1 , β α αβ (12.46) E E
Vi−1 1
(Wi + Vi−1 ) β
Vi−1 1
(Wi + Vi−1 ) β
/ 0 / β 0 β V 1 Vi−1 1 = Vi−1 1 − , β exp i−1 α β α αβ
(12.47)
/ 0 / β 0 β V 1 Vi−1 1 = Vi−1 1 − , β exp i−1 . (12.48) α β α αβ
And, for β > 2: / 0 / β 0 β V [Vi−1 ]2 2 Vi−1 1 2 E = 2 [Vi−1 ] 1 − , β exp i−1 . 2 α β α αβ (Wi + Vi−1 ) β (12.49) It is worthwhile to mention that the Fisher information matrix of WGRP is thus constrained to β > 2, according to Eq. (12.49). Following, 2 β β E Wi + Vi−1 ln Wi + Vi−1 β + α β ln(Vi−1 )2 + 2α β ln(Vi−1 ) = Vi−1 / β 0 V Vi−1 β + 2α 1 + ln(Vi−1 ) × 0, β exp i−1 α αβ / β 0/ / β 002 Vi−1 V β + α exp γ + ln i−1 β α αβ / 0 β β 2Vi−1 Vi−1 π2 , (12.50) − β 3 F3 [2, 2, 2], [3, 3, 3], − β + 6 α α where the constant γ = 0.5772156649 was first introduced by Leonhard Euler [42], and 1− β1 β β E Vi−1 Wi + Vi−1 ln Wi + Vi−1 / β 0 V 1 Vi−1 αβ β−1 β 1 − , β exp i−1 = Vi−1 Vi−1 ln(Vi−1 ) + α β α αβ 0 / β Vβ α 1 Vi−1 1 β ) 1 − + 1− · exp i−1 · ln(V , i−1 β αβ α β αβ
326
Safety and reliability modeling and its applications
/ β 00 / V αβ π π (β − 1) −1 − ln i−1 + (β ) − π cot −1 α(β ) β αβ β−1 β 2Vi−1 1 π (β − 1) 1 + , × csc × 2 F2 1 − , 1 − β (β − 1)2 β β β Vi−1 1 1 2 − ,2 − , (12.51) ,− β β β α where p Fq [a1 , . . . , a p ], [b1 , . . . , bq ], z is the generalized hypergeometric function: ∞ (a1 )n (a2 )n . . . (a p )n zn [a , F , . . . , a ], [b , . . . , b ], z = p q 1 p 1 q (b1 )n (b2 )n . . . (bq )n n! n=0
(12.52)
with a j ( j = 1, 2, .., p) and b j ( j = 1, 2, .., q) being complex and positive numbers; and (λ)n is the Pochhammer symbol or rising factorial:
1, if n = 0 (λ)n = (λ+k) = λ · (λ + 1) · (λ + 2) · · · (λ + k − 1), if n ∈ (1, 2, 3, · · · ) (λ) The generalized Hypergeometric function in Eq. (12.52) can be computationally obtained via many software, for example Maple, Mathematica, Matlab, and R. For instance, in the free software R, it is computed via the function genhypergeo_series() from the package hypergeo [17], that is, / 0 β vi−1 3 F3 [2, 2, 2], [3, 3, 3], − β α / 0 β vi−1 = genhypergeo_series U = c(2, 2, 2), L = c(3, 3, 3), z = − β , α and
/ 0 β vi−1 1 1 1 1 1 − ,1 − , 2 − ,2 − ,− β 2 F2 β β β β α 1 1 , = genhypergeo_series U = c 1 − , 1 − β β 0 β vi−1 1 1 ,z=− β L = c 2 − ,2 − β β α
Therefore, for obtaining the Fisher information matrix of WGRP from Eq. (12.29), constrained to β > 2, one can summarize:
Generalized renewal processes Chapter | 12
327
(1) A1 and A2 in Eqs. (12.36) and (12.37), respectively, tend to zero when V0 → 0. A3 , A4 , A5 , and A6 in Eqs. (12.38), (12.39) and (12.40), equal zero when β > 2; (2) Let Y1 and Y2 represent Eqs. (12.41) and (12.42), respectively; (3) Let Y3 , Y4 , Y5 , Y6 , Y7 , Y8 , Y9 , Y10 , Y11 be, respectively, Eqs. (12.43), (12.44), (12.45), (12.46), (12.47), (12.48), (12.49), (12.50), and (12.51). To complete the n terms in the summations Y3 and Y10 , the value zero was added, so the sum does not change the result; (4) Now, A2 , A2 , A3 , A4 , A5 , A6 , Y1 , Y2 , Y3 , Y4 , Y5 , Y6 , Y7 , Y8 , Y9 , Y10 , and Y11 can be multiplied and divided by n. Replacing the results from item (4) in Eqs. (12.36), (12.37), (12.38), (12.39), and (12.40) leads to: n Y1 n Dβ = + E 3 − βA1 = · D∗β β n β + ,. D∗β
Dββ =
n β2
Y2 n + E 10 − β 2 A2 = 2 · D∗ββ n β + ,. D∗ββ
Dq = nβ E 4 − A3 = nβ · D∗q + ,- . D∗q
Dβq = n E 11 − A3 + E 4 − βA4 = n · D∗βq + ,. D∗βq
Dqq = n β(β − 1) E 6 − A6 + β E 5 − A5 = n · D∗qq , + ,. D∗qq
where Ai (E i ) is the mean value of Ai (Yi ). Thus, the elements of the matrix in Eq. (12.29) are given by 2 2 E − ∂∂α21 = n · βα2 2 ∂ 1 n β ∗ E − ∂αβ = αβ+1 βα log(α) − Dβ 2 2 1 = − αnββ+1 · D∗q E − ∂∂αq
328
Safety and reliability modeling and its applications
β β 2 2 ∗ ∗ = E α + α β log(α) − 2β log(α) · Dβ + Dββ 2 1 = αnβ − α β E 7 − β log(α) · D∗q + D∗βq E − ∂∂βq 2 β ∂ 1 n ∗ E − ∂q2 = αβ (1 − β ) E 8 − E 9 α + Dqq
2 − ∂∂β21
n β 2 αβ
Therefore, one has I(α, β, q) = nI (α, β, q), leading to I −1 (α, β, q) = n I (α, β, q), according to Meeker and Escobar [30]. As follows, the obtained Fisher information matrix Eq. (12.29) is studied with respect to the Kijima virtual age functions generically presented in Eq. (12.1) and the value of q. −1 −1
12.4.6.3 The asymptotic confidence intervals of the WGRP parameters According to Asymptotic Theory, for n large enough, it is possible to infer the confidence intervals of the WGRP parameters, assuming the respective maximum likelihood estimators follow a multivariate normal distribution. Thus, ˆ q), the maximum likelihood estimators, (α, ˆ β, ˆ might follow a normal distribution with mean (α, β, q) and covariance matrix I −1 (α, β, q), that is ˆ qˆ ∼ N (α, β, q), I −1 (α, β, q) α, ˆ β, and the 100 · (1 − η)% confidence interval of α, β, and q can be, respectively, approached by −1 αˆ obs ± zη/2 I11 (α, β, q)obs , −1 (12.53) (α, β, q)obs , βˆobs ± zη/2 I22 −1 (α, β, q)obs , qˆobs ± zη/2 I33 where zη/2 is the η2 quantile of the standard normal distribution and I −1 (α, β, q)obs is the instance of I −1 (α, β, q) in the light of the data x (see, Subsection 12.2.2). Anyway, according to Subsection 12.4.6.2, I −1 can be adequately computed if β > 2 (for q = 0 and q = 1), only. Further, reliable results are achieved when Kijima Type I virtual age model with q near 0 (e.g., RP) and Kijima Type II with q not near 1 are taken into account. Subsection 12.4.6.4, as follows, presents instances of the obtained confidence intervals of WGRP parameters in the wellknown special cases of RP (q = 0) and NHPP (q = 1).
12.4.6.4 Confidence intervals for special cases of WGRP This subsection deals with the confidence intervals of WGRP parameters when RP (i.e., q = 0) and NHPP (q = 1) are taken into account.
Generalized renewal processes Chapter | 12
329
12.4.6.5 Renewal Processes - RP Assuming q = 0 in Eq. (12.28) leads one to the simplified version of the maximum likelihood function related to RP: 2 (α, β ) = n[ln(β ) − β ln(α)] +
n n 1 (β − 1) ln (Wi ) − β Wi , β α i=1 i=1
where one has, now, Wi = Xiβ . Thus, the Fisher information matrix in Eq. (12.29) is simplified to ⎡ 2 2 ⎤ 2 ∂ 2 n(γ −1) nβ E − ∂∂α22 E − ∂α∂β 2 α 2 ⎦ = n(γα−1) I(α, β ) = ⎣ ∂ 2 n π2 + (γ − 1)2 E − ∂β∂α2 E − ∂∂β22 α 6 β2 = nI (α, β ) Therefore, as n increases, the maximum likelihood estimators (α, ˆ βˆ ) of (α, β ) might follow a two-dimensional Normal distribution with mean (α, β ) and covariance matrix ⎡ ⎤ α 2 π 2 +6 (γ −1)2 6α (γ −1) − −1 nβ 2 π 2 nπ 2 ⎦ = n−1 I −1 (α, β ) I (α, β ) = ⎣ (γ −1) − 6α nπ 2
6β 2 . nπ 2
The respective 100 · (1 − η)% confidence interval estimates of the Weibullbased RP parameters are 1 1 αˆ obs π 2 + 6(γ − 1)2 βˆobs 6 ˆ and βobs ± zη/2 √ . αˆ obs ± zη/2 √ π2 n π2 βˆobs n Such algebra has also been presented by Kotz et al. [25].
12.4.6.6 Nonhomogeneous Poisson Processes - NHPP Now, taking q = 1 in Eq. (12.28), leads WGRP to a NHPP and one has the maximum likelihood function 3 (α, β ) = n[ln(β ) − β ln(α)] +
where Vi−1 = β
n i=1
i−1 k=1
ln (Xi + Vi−1 ).
Xk ,
n i=1
Wi
n n 1 (β − 1) β − β ln Wi + Vi−1 Wi , β α i=1 i=1
n β n β = = Xi , and ln Wi + Vi−1 i=1
i=1
330
Safety and reliability modeling and its applications
The Fisher information matrix is then ⎡ 2 2 ⎤ ∂ 3 E − ∂∂α23 E − ∂α∂β ⎣ ⎦ I(α, β ) = 2 ∂ 2 3 E − ∂∂β23 E − ∂β∂α nβ 2 − n (n+1) α α2 = n ( (n+1)2 + (1,n+1)+1 ) − n (n+1) α
β2
and, when n increases, the maximum likelihood estimators (α, ˆ βˆ ) of (α, β ) converges to a normal distribution with mean (α, β ) and covariance matrix ⎡ 2 ⎤ α (n + 1)2 + (1, n + 1) + 1 α (n + 1) ⎢ nβ 2 ( (1, n + 1) + 1) n( (1, n + 1) + 1) ⎥ ⎥= ˆ βˆ ) = ⎢ I −1 (α, ⎣ ⎦ β2 α (n + 1) n( (1, n + 1) + 1) n( (1, n + 1) + 1) n−1 I −1 (α, β ) A similar result can be found in Gaudoin et al. [15], by using a power law process. The respective 100 · (1 − η)% confidence interval estimates of the Weibull-based NHPP are given by 2 + (1,n+1)+1 αˆ obs ± zη/2 βˆ αˆ obs√n (n+1) (1,n+1)+1 obs ˆ 1 , βˆobs ± zη/2 β√obsn (1,n+1)+1 where the Digamma function (z) is the logarithm derivative of (z) and the Poligamma function is the mth derivative of the Digamma function, that is, ∂m (z) = (m, z). ∂zm It must be emphasized that the Gamma, Incomplete Gamma, Digamma, Poligamma, Logarithm, and Exponential functions can be described by precise asymptotic expansions, that is these functions can be expressed by mathematical series that achieves predetermined precision when n increases [4]. It facilitates the computation of the above mentioned algebras. As follows, the possibility of adopting the proposed confidence intervals for performing hypotheses testing regarding the stage experimented by a system is also introduced.
12.4.6.7 Testing the stage experimented by a system Here, following Oliveira et al. [36], some insights about how to make use of the WGRP confidence intervals for evaluating repairable systems are provided. With this purpose, the relationship between hypotheses testing and confidence intervals is considered. In fact, as emphasized by authors such as Casella and Berger [5], there is a very strong correspondence between hypothesis testing and interval estimation. These authors claim that generally every confidence set corresponds to a test and vice versa. It is advantageous to remind that hypotheses
Generalized renewal processes Chapter | 12
331
tests for a parameter ϑ classically evaluate whether the corresponding estimate, ϑˆobs , belongs to a rejection region regarding a given hypothesis, for a fixed η significance level. Such decision rule is indeed performed by inverting the test statistic, which also leads to a 100 · (1 − η)% confidence region. Casella and Berger [5] bring more details about this theme. In this way, let ϑ (s) be a WGRP parameter ϑ with respect to the system s. In turn, let LLϑ(s) and ULϑ(s) be, in this order, the lower and upper limits of the 100·(1−η)% confidence interval of ϑ (s) . Thus, the resulting confidence intervalbased decision rule for comparing two systems, say a and b, with respect to ϑ and at a 100 · η% significance level might be ⎧ ⎨lesser than ϑ (b) , if ULϑ (a) < LLϑ (b) (H0 ); ϑ(a) is = greater than ϑ (b) , if LLϑ (a) > ULϑ (b) (H1 ); ⎩ otherwise (H2 ). equal to ϑϑ (b) , Furthermore, from the WGRP literature, it is well known that the stage faced by a system (whether stable, deteriorating or improving) can be enveloped by the value of the underlying WGRP shape parameter β, in such a way that the system is stable if β = 1, is improving if β < 1, and is deteriorating if β > 1 [14]. Thus, the proposed confidence intervals also provide an approach for performing a WGRP test for inferring the stage experimented by the system. The resulting confidence intervalbased decision rule, at a 100 · η% significance level, might be ⎧ if LLβ ≤ 1 ≤ ULβ (H0 ); ⎨stable, if ULβ < 1 (H1 ); The system is = improving, ⎩ deteriorating, 1 < LLβ (H2 ). As previously emphasized, the quality of both the system itself and its intervention process can be summarized by the WGRP parameters, as discussed by Ferreira et al. [14]. For instance, if β > 1 (i.e., a deteriorating system), then the greater the q, the worst the feedback of the system to the performed interventions. It allows one to negatively evaluate the intervention team of the system a in relation to the one of the system b if q(a) > q(b) . In the same fashion, one can conclude a is ageing faster than b (in the sense that a deteriorates more quickly than b) whether β (a) > β (b) > 1 and q(a) > q(b) > 0.
12.4.7
Case studies of WGRP
This section presents the behavior of the WGRP GOFT and confidence intervals using the case studies are based on Oliveira et al. [34–36]. Some results involving the adjustment of WGRP models to real cases from the literature and the respective WGRP GOFT, and the 95% confidence intervals of the respective parameters are then presented. The RP, NHPP, Kijima I, and Kijima II models were adjusted to each data set, via MLE, where the LL function
332
Safety and reliability modeling and its applications
in Eq. (12.17) was optimized according to the generalized simulated annealing algorithm provided by the GenSA package [48] of the free-ware R software [38]. In this context, it was considered the following space of possibilities for βˆobs and qˆobs : (βˆobs , qˆobs ) ∈ [10−100 , 10] × [−3, 3]. In turn, αˆ obs was obtained as a function of the pair (βˆobs , qˆobs ), as performed by Ferreira et al. [14]. The performance data sets taken into account are described as follows: (a) The first data set is from Langseth and Lindqvist [28] where 84 stoppage times of a compressor system of an offshore facility are considered; (b) The second data set involves 80 stoppage times regarding intervention actions on a windshield system, from Murthy et al. [32]; (c) The third data set corresponds to 61 transformers stoppage events presented by Cristino [10]. These data sets reflect situations where failure could lead to relevant financial as well as social and environmental losses. For instance, the compressor system under consideration is dedicated to the oil and gas industry (exploration and production), involving failure modes such as leakage [37] that could result in waste, explosion, and environmental pollution. In turn, regarding the windshield system, the potential damage resulting from bird/aircraft collisions has greatly increased, as a result of raised aircraft speeds and low-altitude flights [44]. Such damage may result in an aborted mission, loss of human lives, loss of the aircraft and financial resources. In the same way, transformers failure might directly damage power generation and transmission lines and, subsequently, components, equipment, and production systems in general (e.g., commercial and domestic electrical and/or electronic devices), usually involving financial penalties to the energy distribution companies. Thus, it is paramount to adequately model the times between interventions of these systems to promote optimized intervention policies and quality control.
12.4.8
Applying the WGRP GOFT
Tables 12.1, 12.2, and 12.3 present the WGRP GOFT for the different WGRP models taken into account for these data sets. One can see (p∗ of K-S (p∗K-S ) in the TWGRP columns of Table 12.1) that each WGRP model (whether RP, NHPP, Kijima I or Kijima II) might adhere to the offshore facility data set, under significance levels (η) lesser than 10.40% (as usually done in hypothesis testing practice). Fig. 12.1 exhibits the observed cumulative times, the respective WGRP samples point and 95% interval estimates as well as some of the series simulated from the best model, according to Ferreira et al. [14], namely the Kijima II. These authors considered the mean squared error (MSE) and log-likelihood (LL) metrics, instead of the p-values studied here. On the other hand, according to p∗C-M and p∗A-D , for η > 4.4% one could conclude that the NHPP model does not fit to the data set, that is, the system
Generalized renewal processes Chapter | 12
333
TABLE 12.1 WGRP GOFT for WGRP models adjusted by Oliveira et al. [35] to the offshore facility performance data set from Langseth and Lindqvist [28] Model
WGRP αˆ
βˆ
qˆ
p∗K-S
p∗C-M
p∗A-D
p∗B
RP
14.394
0.791
0
0.011
0.007
0.004
0.016
NHPP
2.392
0.694
1
0.011
0.007
0.004
0.016
Kijima I
3.299
0.517
0.024
0.011
0.007
0.004
0.016
Kijima II
5.855
0.951
1.500
0.011
0.007
0.004
0.016
θˆ
TWGRP ˆ αˆ β
p∗K-S
p∗C-M
p∗A-D
p∗B
8.247
0.229
0.326
0.241
0.594
Model
RP NHPP
1.832
0.107
0.044
0.034
0.110
Kijima I
1.853
0.568
0.812
0.638
0.500
Kijima II
5.374
0.104
0.080
0.054
0.207
p∗K-S p∗A-D
≡ p value of ≡ p value of
Kolmogorov-Smirnov test, p∗C-M ≡ p value of Cramé-von Mises Anderson-Darling test, and p∗B ≡ p.value of Bartlett’s test
test,
TABLE 12.2 WGRP GOFT for WGRP models adjusted by Oliveira et al. [35] to the windshield system from Murthy et al. [32] Models
WGRP αˆ
βˆ
qˆ
p∗K-S
p∗C-M
p∗A-D
p∗B
RP
0.028
0.897
0
0.444
0.436
0.504
0.366
NHPP
0.120
1.465
1
0.444
0.436
0.504
0.366
Kijima I
0.111
1.489
0.662
0.444
0.436
0.504
0.366
Kijima II
0.060
1.045
1.496
0.444
0.436
0.504
0.366
Models
TWGRP ˆ θˆ (αˆ β )
p∗K-S
p∗C-M
p∗A-D
p∗B
0.0406
0.938
0.948
0.915
0.708
RP NHPP
0.0447
0.541
0.818
0.806
0.687
Kijima I
0.0377
0.540
0.800
0.789
0.653
Kijima II
0.0531
0.874
0.922
0.926
0.971
p∗K-S p∗A-D
≡ p.value of ≡ p value of
Kolmogorov-Smirnov test, p∗C-M ≡ p.value of Cramé-von Mises Anderson-Darling test, and p∗B ≡ p.value of Bartlett’s test.
test,
334
Safety and reliability modeling and its applications
TABLE 12.3 WGRP GOFT computed by Oliveira et al. [35] for WGRP models
adjusted to the transformers system from Cristino [10] Models
WGRP αˆ
βˆ
qˆ
p∗K-S
p∗C-M
p∗A-D
p∗B
RP
179.770
1.589
0
0.003
0.014
0.016
0.0008
NHPP
227.142
1.089
1
0.003
0.014
0.016
0.0008
Kijima I
210.343
1.910
0.006
0.003
0.014
0.016
0.0008
Kijima II
273.109
2.336
0.381
0.003
0.014
0.016
0.0008
ˆ θˆ (αˆ β )
p∗K-S
p∗C-M
p∗A-D
p∗B
3816.2
0.651
0.602
0.529
0.647
Models
RP
TWGRP
NHPP
368.07
0.002
0.013
0.016
0.0008
Kijima I
27369
0.497
0.438
0.462
0.724
Kijima II
492524
0.756
0.727
0.789
0.966
p∗K-S p∗A-D
≡ p value of ≡ p value of
Kolmogorov-Smirnov test, p∗C-M ≡ p value of Cramé-von Mises Anderson-Darling test, and p∗B ≡ p value of Bartlett’s test.
test,
would not return from the interventions ”as bad as old.” In turn, the original data set would not come from an exponential distribution, for η > 1.6% (see p∗ for the WGRP columns); thus the system is not stable. In fact, based on the hazard function in Eq. (12.14), as βˆ < 1 for every WGRP model, it seems evident that the system is improving. Thus, the interventions might be bringing the system to a ”better than new” condition, generating times between interventions directly proportional to the number of interventions (as can be seen in Fig. 12.1). Further, the TWGRP p-values could also be considered for evaluating the quality of adjustment of the models, once, similarly to MSE, p∗ is inversely proportional to the discrepancy observed between empirical data and fitted model. In these terms, there is evidence of the best adjustment of the Kijima I for the offshore system, with p∗K-S assuming 0.568, for instance. It must be highlighted that Ferreira et al. [14] suggest the Kijima II model as the best one, mainly in terms of MSE, once via LL the Kijima I is only slighted beaten. Therefore, the proposed WGRP GOFT may also be applied for comparing WGRP-based models, in addition to LL and MSE. In fact, authors such as Moura et al. [31] have also resorted to hypothesis testing to compare WGRP models. Taking the windshield system into account (Table 12.2), the adherence of the WGRP would be confirmed for any η < 54% (see p∗ of the TWGRP columns). This behavior could be justified due to the fact that the original data come from an exponential distribution (β = 1) for any η < 36.6% (see p∗ of the WGRP columns). In turn, assuming p∗ as a comparison metric, one can see that from
335
1000
1500
2000
Observed series WGRP samples WGRP samples 95% confidence intervals WGRP samples means
0
500
Cumulative times (Langseth)
2500
Generalized renewal processes Chapter | 12
0
20
40
60
80
intervention index (a, b, q) = ( 5.87, 0.95, 1.5 )
FIGURE 12.1 WGRP adjustment to data set of Langseth and Lindqvist [28], following Ferreira et al. [14], from Oliveira et al. [35].
p∗K-S and p∗C-M (p∗A-D and p∗B ) the RP (Kijima II) model must be assumed. For this case, Ferreira et al. [14] suggest Kijima II model, based on MSE and LL, thus agreeing with p∗A-D and p∗B . Under the inferred Kijima II model, one could conclude that the system is deteriorating (βˆ > 1.0). In this case, however, there is no unanimity regarding the phase faced by the system (whether deteriorating, improving or stable), once βˆ < 1 for RP. On the other hand, in the transformers system case (Table 12.3), the WGRP GOFT suggests adherence of any WGRP model only if η < 0.08%. In fact, for η > 1.6%, something usual in practice, Oliveira et al. [35] indicate that the maximum likelihoodbased NHPP model is inappropriate for the transformers system stoppage events, regardless of using p∗A-D , p∗K-S , p∗C-M , or p∗B . In the same fashion, the exponential distribution would not adhere to the original data set for any η > 1.6%. Considering p∗ as a quality metric, one can also assume Kijima II, based on any of the computed pvalues, in agreement with LL metric [14]. In this case, every WGRP model suggests that the system is deteriorating (βˆ > 1).
336
Case [28]
[32]
[10]
Model
αˆ obs
βˆobs
qˆobs
LLα
ULα
rαˆ obs
LLβ
ULβ
rβˆ
LLq
obs
ULq
rqˆobs
RP
14.395
0.791
0
10.273
18.516
0.5726
0.658
0.924
0.3363
0
0
-
NHPP
2.391
0.694
1
-0.951
5.734
2.7959
0.546
0.843
0.4280
1
1
-
Kijima I
3.299
0.517
0.024
-
-
-
-
-
-
-
-
-
Kijima II
5.855
0.951
1.499
-
-
-
-
-
-
-
-
-
RP
0.028
0.897
0
0.021
0.035
0.5000
0.744
1.05
0.3411
0
0
-
NHPP
0.12
1.465
1
0.04
0.2
1.3333
1.146
1.784
0.4355
1
1
-
Kijima I
0.111
1.489
0.662
-
-
-
-
-
-
-
-
-
Kijima II
5.855
0.951
1.499
-
-
-
-
-
-
-
-
-
RP
179.77
1.588
0
149.867
209.673
0.3327
1.277
1.899
0.3917
0
0
-
NHPP
227.142
1.089
1
6.955
447.329
1.9388
0.818
1.36
0.4977
1
1
-
Kijima I
210.343
1.91
0.006
-
-
-
-
-
-
-
-
-
Kijima II
273.109
2.336
0.381
124.712
421.505
1.0867
0.926
3.747
1.2076
0.055
0.706
1.7087
LL ≡ Lower Limit and UL ≡ Upper Limit of the 95% confidence intervals
Safety and reliability modeling and its applications
TABLE 12.4 Ninety-five percent confidence intervals of WGRP parameters adjusted to offshore, windshield, and transformers systems performance data set from Langseth and Lindqvist [28], Murthy et al. [32], and Cristino [10], respectively
Generalized renewal processes Chapter | 12
12.4.9
337
Applying the WGRP confidence intervals
These real cases allow us to study the confidence intervals for WGRP parameters in the light of small sample sizes, under asymptotic theory perspective. In this way, one can perform relative comparisons of magnitude with respect to the level of information provided by the interval estimates, via the ratio rϑ = ϑ /ϑ, where ϑ is the mean length of the confidence intervals of the parameter ϑ from the set (α, β, q). Obviously, the lesser rϑ the greater the level of information underlying the confidence interval dedicated to ϑ. In Table 12.4, there is a summary of the performance of the confidence intervals depending on the WGRP model. As highlighted in Section 12.4.6, the calculation of the intervals of the Kijima type models was computationally possible when βˆobs > 2, only; notably in the case of the transformers data set from Cristino [10]. As expected, the confidence intervals associated with RP have presented better relative precision than the ones related to the alternative WGRP models taken into account (see columns rϑˆobs ). To date, RP would adhere to the performance data sets under study for any significance level (η) lesser than 0.229, following Oliveira et al. [35]. These authors have proposed a goodness-of-fit test for WGRP models via Theorem 12.4.1 and they have performed several tests considering KolmogorovSmirnov (KS), Cramer-von Mises (CM), Anderson-Darling (AD), and Bartlett (B) statistics. On the other hand, NHPP would not adhere to the offshore facility and transformers data if η > 0.034 (taking AD statistic) and η > 0.0008 (taking B statistic), respectively. Further, considering the RP models adjusted to the performance data sets taken into account and the hypothesis testing approach presented at the end of Subsection 12.4.6.5, one can thus infer, at η = 5%, that the offshore facility is improving, the windshield system is stable, and the transformers one is deteriorating.
12.5 The Gumbel GRP (GuGRP) modeling Recently, the Gumbel-based generalized RP (GuGRP) has been introduced [11], taking the forecasting of economic recessions as a case study.
12.5.1
The Gumbel distribution
The GuGRP is especially useful for dealing with extreme value random variables. In this way, let X = (X1 , X2 , . . . , Xn ) be a random vector, where each Xi (i = 1, 2, . . . , n) follows a Gumbel model [43] with parameters α ∈ R and β > 0. Then the CDF of Xi is xi − α , −∞ < xi < ∞ , (12.54) FXi (xi | α, β ) = 1 − exp − exp β where α and β are scale and location parameters, respectively.
338
Safety and reliability modeling and its applications
The respective PDF of Xi is x−α x−α 1 − exp . fXi (x | α, β ) = exp − β β β The Gumbel distribution is known as generalized extreme value distribution Type-I and it is used to model the distribution of the maximum (or the minimum) of some samples of various distributions. It is also referred to as the smallest extreme value distribution [24].
12.5.2
The GuGRP model
Let Xi {i = 1, 2, . . . , n} be a sequence of times between interventions following a GuGRP. Using Eq. (12.2) and Eq. (12.54) the distribution of the cumulative time to the ith intervention, Ti , at (x + vi−1 ) following the GuGRP can be written as FTi (x + vi−1 | vi−1 , α, β ) FT1 (x + vi−1 ) − FT1 (vi−1 ) 1 − FT1 (vi−1 ) −α 1 − exp − exp x+vi−1 − 1 + exp − exp vi−1β−α β = 1 − 1 + exp − exp vi−1β−α i−1 . = 1 − exp − exp vi−1β−α − exp x−α+v β =
(12.55)
The respective GuGRP PDF is fTi (x + vi−1
x − α + vi−1 x − α + vi−1 1 − exp | vi−1 , α, β ) = exp β β β −α + vi−1 + exp . (12.56) β
In turn, the GuGRP hazard function is given by fT1 (x + vi−1 | vi−1 , α, β ) 1 − FT1 (x + vi−1 | vi−1 , α, β ) x − α + vi−1 1 . = exp β β
hTi (x + vi−1 | vi−1 , α, β ) =
(12.57)
From Eq. (12.57) one can see that the greater the argument (x + vi−1 ) the greater the hazard. Thus, GuGRP might not model stable nor improving systems.
Generalized renewal processes Chapter | 12
339
According to Eqs. (12.6) and (12.55), the inverse GuGRP function is given by
−α + vi−1 − log(1 − u) + α − qvi−1 , xi = β log exp β
i = 1, 2, . . . ,
(12.58) where ui is a random number from the interval [0, 1], vi−1 is computed according to Eq. (12.1), and v0 = 0.
12.5.3
Maximum likelihood estimation of the GuGRP model
For MLE, the joint WGRP PDF is f (x, y | α, β, q, cy ) =
n x j − α + v j−1 x j − α + v j−1 1 exp − exp + β β β j=1 −α + v j−1 , + exp β
(12.59)
with v j−1 given by Eq. (12.1) (taking v0 = 0). Consequently, the LL function is (x, y | α, β, q, cy ) = log f (x, y | α, β, q, cy ) = −n log(β ) + + exp
12.5.4
n x j − α + v j−1 j=1
−α + v j−1 β
β
− exp
x j − α + v j−1 β
(12.60)
Goodness-of-fit test for GuGRP
Similarly to Oliveira et al. [35, 36], Cristino et al. [11] have introduced the GuGRP GOFT. Theorem 12.5.1 (GuGRP power transformation). Let (W1 , . . . , Wn ), n ≥ 1, be a random vector such that Vi−1 − α Xi + Vi−1 − α − exp (12.61) Wi = exp β β where Xi conditioned on the virtual age Vi−1 follows a GuGRP with parameters α and β. Then the following claims hold true (i) Wi follows an exponential distribution with parameter 1; (ii) W1 , . . . , Wn are iid.
340
Safety and reliability modeling and its applications
Proof. (i) Directly, we obtain, for each i = 1, . . . , n: P(Wi ≤ wi | Vi−1 ; α, β ) Vi−1 − α Xi + Vi−1 − α − exp ≤ wi | Vi−1 ; α, β = P exp β β Vi−1 − α Xi + Vi−1 − α ≤ wi + exp | Vi−1 ; α, β = P exp β β Vi−1 − α = P Xi ≤ β log wi + exp + Vi−1 − α | Vi−1 ; α, β β ⎞⎞ ⎛ ⎛ Vi−1 −α + exp β log w i β Vi−1 − α ⎠⎠ − exp ⎝ = 1 − exp ⎝exp β β = 1 − exp(−wi ). So, we have Wi ∼ Exponential(1). (ii) The joint PDF of (X, Y) = ((X1 , Y1 ), . . . , (Xn , Yn )) is given by f (X,Y) (x, y | vi−1 ; α, β ) =
n
fTi (xi + vi−1 | vi−1 ; α, β ).
(12.62)
i=1
Now, to show that the variables W1 , . . . , Wn are independent of each other, we need to show that the Eq. (12.62) implies fW1 ,...,Wn (w1 , . . . , wn | vi−1 ; α, β ) =
n
fWi (wi | vi−1 ; α, β ).
(12.63)
i=1
By item (i), the right side of Eq. (12.62) implies the right side of Eq. (12.63). Now, we will proof that the left side of Eq. (12.62) implies the left side of Eq. (12.63). Making a change of variables Eq. (12.61), and solving the (Xi + Vi−1 ) s in terms of Wi s, we obtain Vi−1 − α Xi + Vi−1 = β log wi + exp +α β The joint PDF of W = (W1 , . . . , Wn ), is given by fW (w1 , . . . , wn | vi−1 ; α, β ) = f (ϕ1 (x1 + v0 , . . . , xn + vn−1 ), . . . , ϕn (x1 + v0 , . . . , xn + vn−1 )) where
V0 − α ϕ(x1 + v0 , . . . , xn + vn−1 ) = β log w1 + exp +α β
Generalized renewal processes Chapter | 12
341
and J is the determinant of matrix n × n given by: 1 ,...,xn ) J = det ∂ϕ(x∂w j i, j β 0 w1 +exp v0 −α β β 0 v −α w2 +exp 1β = .. .. . . 0 0 2 β v −α . = ni=1 i−1 wi +exp
... ... ..
. ...
0 .. . β v −α wn +exp n−1 β 0
(12.64)
β
Thus, according to Eq. (12.64), we have that: fW (w1 , . . . , wn | vi−1 ; α, β ) ⎧ ⎫ n ⎨ v j−1 − α ⎬ xi + v j−1 − α xi + v j−1 − α 1 exp − exp + exp = ⎩ ⎭ β β β β j=1 ⎛ ⎞ n β ⎝ ⎠ × vi−1 −α w + exp i=1 i β Replacing xi + vi−1 in the last equality, we obtain fW1 ,...,Wn (w1 , . . . , wn | vi−1 ; α, β ) ⎧ ⎡ ⎤⎫ V j−1 −α n ⎬ ⎨ β log + exp w +α−α 1 β 1 − w j⎦ = n exp ⎣ ⎭ β ⎩ j=1 β ⎛ ⎞ n 1 ⎝ ⎠ × βn vi−1 −α w + exp i=1 i β ⎛ ⎞ n n V j−1 − α 1 ⎝ ⎠ w j + exp × exp (−w j ) × = β wi + exp vi−1β−α j=1 i=1 =
n
exp(−w j )
j=1
The last equality (obviously) does not depend of the vi−1 . Thus, W1 , W2 , . . . , Wn are independent and iid.
342
Safety and reliability modeling and its applications
FIGURE 12.2 Representation (example) of the time during which an undesirable event (e.g., a recession) occurs, from Cristino et al. [11]. The occurrence of an undesirable event is identified as the decreasing period of the curve.
12.5.5
Case studies of GuGRP
The economy of the United States of America is one of the most important in the world. Thus, Cristino et al. [11] have used GuGRP to model and forecast times between consecutive US recessions, considering data from NBER [33]. For all of this section, we refer these undesirable events as American recessions. As a recession involves itself a relevant time period, adaptations on the usual GRP framework have been required. In fact, GRP are point process, that is, the duration of the event occurrence is negligible. Thus, Cristino et al. [11] have adapted the virtual age function, as follows.
12.5.5.1 Virtual age model for interval interventions From here on it is supposed that some value can occur in nonstationary cycles, meaning that amplitude and frequency are not regular or constant nor negligible. Also, Cristino et al. [11] were interested in events characterized by periods of increased values (times between consecutive events) and decreased values (duration of the event). So, in Eqs. (12.55) and (12.56) the virtual age, vi−1 , will be considered as a function of the duration of the event of interest. In this way, let τi (see Figs. 12.2 and 12.3) be the duration of the ith recession. Specifically, τi−1 is the time between the instant at which the economy indicator suggests the beginning of the (i − 1)th recession period (maxi−1 ) and the instant at which such indicator suggests the recuperation has begun (mini−1 ). The respective recession amplitude is thus Li−1 = maxi−1 − mini−1 . We must note the behavior of the analyzed data presents the same behavior of wave whose the maximum is the ”goal event.” Therefore the time between
Generalized renewal processes Chapter | 12
FIGURE 12.3 et al. [11].
343
Times between consecutive undesirable events (e.g., recessions), from Cristino
consecutive events is the maximum time until the recession period and Cristino et al. [11] justify the application of the Gumbel distribution [3, 16]. They define the types of virtual age functions + q(1 − exp[−(λ(τi , Li ))])xi vi = vi−1
(12.65)
+ (1 − exp[−(λ(τi , Li ))])xi ], vi = q[vi−1
(12.66)
and
where vi is the virtual age after the occurrence of i events of interest, xi is the (observed) time between the (i − 1)th and the ith events, λ(·) is a positive and monotonic function of τi and Li and q is a restoration coefficient. In fact, the coefficient q in Eqs. (12.65) and (12.66) is the ageing parameter of the modeling. It is easy to see that + q(1 − exp[−(λ(τi , Li ))])xi vi = vi−1 i 1 − exp − (λ(τr , Lr )) xr =q
(12.67)
r=1
and vi =
i
qr 1 − exp[−λ(τi−r+1 , Li−r+1 )] xi−r+1 .
(12.68)
r=1
Similarly to the literature, vi reflects the virtual age Type I and vi the virtual age Type II models.
344
Safety and reliability modeling and its applications
12.5.5.2 Explicit Model - using virtual age definition Considering the definition of virtual age types in Eqs. (12.65) and (12.66), the GuGRP CDF is given by Eq. (12.55) with explicit virtual age defined by
vi = γ ·q·
i
1 − e−λ
τr ,Lr
i xr +(1−γ ) qi−r 1 − e−λ τr ,Lr xr , (12.69)
r=1
r=1
with γ ∈ [0, 1]. Obviously, the same definition in Eq. (12.69) is used in Eq. (12.56) to explicit the PDF for the GuGRP. The joint GuGRP PDF, considering the parameter set θ = (α, β, γ , q), is1 L = f (x1 , . . . , xn | θ ) n = fTj (x j + v j−1 | α, β, q) j=1
n x j −α+v j−1 −α+v j−1 x j − α + v j−1 1 β β exp −e = +e β β j=1 where v j−1 is given by Eq. (12.69), and v0 = 0. The log of the joint GuGRP PDF is then adapted to (x1 , . . . , xn | θ ) = log L(x1 , . . . , xn | θ ) n x j −α+v j−1 −α+v j−1 x j − α + v j−1 . −e β = −n log(β ) + +e β β j=1 The MLE for θ is given by solving the system ⎧ ⎨ ⎩
1 After
∂ ∂α
= 0;
∂ ∂β
= 0;
∂ ∂γ
= 0;
∂ ∂q
= 0.
definition of λ(τi , Li ), it will be necessary to analyze other parameters. See next section.
Generalized renewal processes Chapter | 12
345
Explicitly, ⎧ x +γ qs +(1−γ ) j−1 qr y j−1 r n γ qs j−i +(1−γ ) q y j−r j j−i j−r r=0 r=0 ⎪ − βα ⎪ β β e =n e −e ⎪ ⎪ ⎪ ⎪ ⎪ j=1 ⎪ j−1 r ⎪ j−1 r n ⎪ q y j−r ⎪ x j − α + γ qs j−i + (1 − γ ) r=0 q y j−r x j −α+γ qs j−i +(1−γ ) r=0 ⎪ β ⎪ − 1 e ⎪ ⎪ β ⎪ ⎪ j=1 ⎪ ⎪ j−1 r j−1 r ⎪ q y j−r ⎪ −α + γ qs j−i + (1 − γ ) r=0 q y j−r −α+γ qs j−i +(1−γ ) r=0 ⎪ ⎪ β − · e =n ⎪ ⎪ ⎪ β ⎪ ⎪ ⎪ j−1 r ⎨ n qs j−1 − r=0 q y j−1 ⎪ β ⎪ ⎪ j=1 ⎪ ⎪ j−1 r j−1 r ⎪ x j −α+γ qs j−i +(1−γ ) q y j−r −α+γ qs j−i +(1−γ ) q y j−r ⎪ r=0 r=0 ⎪ ⎪ β β =0 +e × 1−e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ j−1 r−1 ⎪ n ⎪ γ s j−1 + (1 − γ ) r=0 rq y j−1 ⎪ ⎪ ⎪ ⎪ ⎪ β ⎪ ⎪ j=1 ⎪ ⎪ j−1 r j−1 r ⎪ x j −α+γ qs j−i +(1−γ ) q y j−r −α+γ qs j−i +(1−γ ) q y j−r ⎪ r=0 r=0 ⎪ β β ⎩ =0 +e × 1−e (12.70) where sk = kr=1 (1 − exp[−λ(τr , Lr )])xr and yk = (1 − exp[−λ(τk , Lk )])xk . From the first passage in Eq. (12.70), one obtains ⎞ ⎛ ⎜ αˆ = βˆ log ⎝
n j=1
exp
x j +γˆ qs ˆ j−i +(1−γˆ ) βˆ
j−1
n
r r=0 q y j−r
− exp
γˆ qs ˆ j−i +(1−γˆ ) βˆ
j−1 r=0
qˆr y j−r
⎟ ⎠,
ˆ γˆ , q) where (α, ˆ β, ˆ is the MLE of (α, β, γ , q). So, the estimator of α only depends on the estimators of β , γ , and q.
12.5.5.3 GuGRP fit to the American recession data Cristino et al. [11] proceed a split of recession data in two subsets: the first one is composed by recessions before the Great Depression (depression, outlier value) in August 1929 (20 recessions) and the second set with the times after this Great Depression, considering the first recession of this period in May 1937 (13 recessions). These two subsets are defined based on Chauvet [6], Kauppi and Saikkonen [20] and Liu and Moench [29]. Adequacy tests were done for all American recessions without good results. The authors did use λ(τi , Li ) = δ1 τi + δ2 Li where δ1 and δ2 are coefficients with δ1 (units per year). Using this λ function in Eq (12.55), one has −α+vi−1 xi −α+vi−1 , FTi (x + vi−1 | α, β, γ , δ, q) = 1 − exp e β − e β
346
Safety and reliability modeling and its applications
TABLE 12.5 GuGRP parameters estimates for recessions data set from Cristino et al. [11], per period αˆ
Period
βˆ
γˆ
δˆ
qˆ
B.G.D.
4.13292
0.788003
0.0636201
21.3288
0.395581
A.G.D.
3.49627
3.01529
0.403901
0.249004
−0.874915
where vi−1 = γ · q
i−1 i−i 1 − e−δ1 τr −δ2 Lr xr + (1 − γ ) qi−r 1 − e−δ1 τr −δ2 Lr xr . r=1
r=1
(12.71) As the data set from NBER [33] does not present values for Li for each recession, Cristino et al. [11] did use δ1 = δ and δ2 = 0. So, Cristino et al. [11] have a complete likelihood equation, explicitly (α, β, γ , δ, q, n)
j j n −α + (1 − γ ) r=1 qr x j−r+1 1 − e−δ·τ j−r+1 + γ · q · r=1 xr 1 − eδ·τr = exp β j=1 j j x j − α + (1 − γ ) r=1 qr x j−r+1 1 − e−δ·τ j−r+1 + γ · q · r=1 xr 1 − e−δ·τr − exp β j j x j − α + (1 − γ ) r=1 qr x j−r+1 1 − e−δ·τ j−r+1 + γ · q · r=1 xr 1 − e−δ·τr + − n log(β ). β
Note that in the system in Eqs. (12.70), we have a new equation: ∂ =0⇒ ∂δ where yr =
j−1
j−1 n
x j −α+qyr −α+qyr = 0, (τr e−δτr )xr 1 − e β + e β
(12.72)
j=1 r=0
(1 − e−δτr )xr .
r=0
Using the MLE (numerically obtained from Wolfram Mathematica), Cristino et al. [11] have estimated the GuGRP parameters for periods before Great Depression (B.G.D.) and after Great Depression (A.G.D.). See Table 12.5. For the recessions before the Great Depression, the authors performed a Gumbel GOFT, using an Anderson-Darling test and the null hypothesis (the data are distributed according to the Gumbel Distribution with parameters α = 4.13292 and β = 0.788003) is rejected at the 5% level (p value < 10−4 ). In turn, for data after the Great Depression, the null hypothesis that the data are distributed according to the Gumbel Distribution (with parameters α =
Generalized renewal processes Chapter | 12
347
TABLE 12.6 MLE for pure types virtual age, from Cristino et al. [11] γ = 1 (virtual age Type I, Eq. (12.67)) αˆ = 2.82203
βˆ = 0.926477
δˆ = 105
qˆ = 0.020839
δˆ = 22.5785
qˆ = 0.366962
γ = 0 (virtual age Type II, Eq. (12.68)) αˆ = 3.52511
βˆ = 0.883711
3.49627 and β = 3.01529) is not rejected at the 5% significance level based on the Pearson χ 2 test. On the other hand, the GuGRP models summarized in Table 12.5 adhere to the B.G.D and A.G.D. data at significance levels lesser than 0.051 [11], according to the GuGRP GOFT based on Eq. (12.61). Cristino et al. [11] also simulated two other possibilities for the modeling: use of the pure Type I (γ = 1) and the pure Type II (γ = 0) virtual age (see definition in Eq. (12.71)). In both cases, the model complexity decreases (see Table 12.6). For the two virtual age types, some adherence tests have been computed (Table 12.7). These tests can be used to select the best model: using the MLE for γ (mixed virtual age); using γ = 1 (Type I); using γ = 0 (Type II). Considering only the hypothesis tests, we must select the estimation using γ = 1, i.e., vi = vi−1 + qxi , because the performed tests do not reject the H0 at the 5% significance level for this case (H0 : the times between consecutive events (recessions) follow the Gumbel distribution with the indicated parameters). [11] have also considered other criteria, such as simulations and predictive power.
12.5.5.4 Simulation from real data Considering the GuGRP inverse function, Eq. (12.58), one now has i−1 i−1 vi−1 = γ · q · 1 − e−λ τr−1 ,Lr−1 xr + (1 − γ ) qi−r 1 − e−λ τr ,Lr xr . r=1
r=1
A total of 100,000 samples of size 20 were generated with the following results (Fig. 12.4 ) for cumulative times to recessions before the Great Depression. Fig. 12.5 summarizes the variability of the next 13 American recessions. One can see the envelopment of the real series by the GuGRP model. The duration of recessions was modeled as Weibull-independent random variables from an adequacy hypotheses test. The null hypothesis that the data is distributed according to the Weibull distribution with parameters (α, β, μ) = (0.845723, 0.891526, 0.591218)2 is not rejected at the 5% significance level based on the Cramér-von Mises test (p value 0.712). 2α
shape parameter, β scale parameter and μ location parameter [19].
348
Safety and reliability modeling and its applications
TABLE 12.7 Some hypothesis tests for times between recessions after Great
Depression considering Gumbel Distribution from Crostino et al. [11] per virtual age type (parameters α = 2.82203 and β = 0.926477 for Type I; α = 3.52511 and β = 0.883711 for Type II) Virtual age type
Type I
Measure
Statistic
Type II p-value
Statistic
p-value
Anderson-Darling
2.08362
0.0832902
8.38300
0.000111
Cramér-von Mises
0.441877
0.0561412
1.65003
0.000078
Pearson χ 2
11.5
0.0740991
24.000
0.000522
FIGURE 12.4 Cumulate times (years) to recessions, from Cristino et al. [11]. The blue line represents the real data [33] and the box plots are representations of simulated data for 20 first American recessions.
In Wright [45] the duration of recessions was modeled by an Exponential distribution, a special case of the Weibull distribution. Thus the GuGRP seems useful for modeling times between economic crises (recessions) series. The partition promoted in the data to carry out the applications indicates that the nature of the phenomenon studied has obvious complexity. The analysis of the parameter q shows that this factor represents two different points: the action of market agents and governments in the face of crises and the market reaction to such actions. In other words, the parameter q
Generalized renewal processes Chapter | 12
349
FIGURE 12.5 Cumulative times (years) to recessions, from Cristino et al. [11]. The blue points represent the real data [33] and the box plots are representations of simulated data for last 13 American recessions.
indicates how efficient the decisions are in the light of an economic crisis and how the economy behaves facing these periods.
12.6 Conclusion To model and forecast stochastic processes is a challenging task. Further, to study systems exposed to interventions is specially intriguing. The interventions might promote rejuvenation, deterioration, or stabilization states. GRP seek to capture the answer of the system to the interventions, allowing one to forecast when new interventions might arise and to evaluate the performance of the system itself as well as the intervention process. The chapter aimed at introducing GRP framework in general terms and then in the specific cases of UGRP, WGRP, and GuGRP models. All of these alternatives allow one to model deteriorating systems, that is, systems for which times between undesirable events decrease through the time. Further, WGRP also capture stable and improving systems. UGRP have been useful for illustrating GRP concepts (e.g., the CDF, PDF, and hazard function of the time to occur relevant events). In turn, cases contemplating interventions to oil and gas, air craft, power generation and transmission, and economic systems have been studied to illustrate the usefulness of WGRP and GuGRP approaches. Hypotheses tests, confidence intervals, and forecasting exercises have been performed in this way. Though GRP are point process (i.e., the time during which the event of interest occurs is negligible) some authors have introduced flexible alternatives, allowing the use of GuGRP to model and forecast the time to occur recessions.
350
Safety and reliability modeling and its applications
Some mathematical steps are yet preventing a wide applicability of the GRP models. For instance, analytical MLE estimates are intriguing, leading to approximate solutions. In the same way, confidence intervals of the WGRP parameters are only possible for a very restrictive parameter set (involving deteriorating systems, only). Furthermore, GRP models based on alternative probability distributions, such as Gamma, normal, beta, and so on, seem to be absent in the literature.
Acknowledgement The authors thank the financial support from the National Council for Scientific and Technological Development (CNPq) Brazil under grant numbers 308,725/2015-8 and 315,027/2018-5.
References Abramowitz, M., Stegun, I., 1964. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover Publications. Bakay, G.A., Shklyaev, A.V., 2020. Large deviations of generalized renewal process. Discrete Mathematics and Applications 30 (4), 215–241. Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., 2006. Statistics of extremes: theory and applications. John Wiley & Sons. Bleistein, N., Handelsman, R.A., 1975. Asymptotic expansions of integrals. Courier Corporation. Casella, G., Berger, R.L., 2002. Statistical inference, 2. Duxbury Pacific Grove, CA. Chauvet, M., 1998. An econometric characterization of business cycle dynamics with factor structure and regime switching. International Economic Review 39 (4), 969–996. Cordeiro, G.M., 1999. Introdução à Teoria Assintótica. Conselho Nacional de Desenvolvimento Scientifico e Tecnológico, Institutode Matemática Pura e Aplicada. Rio de Janeiro Cramér, H., 1946. Mathematical methods of statistics. Princeton landmarks in mathematics and physics, 9. Princeton university press. Cramér, H., 1946. Mathematical methods of statistics (princeton. Press, Princeton, NJ 367–369. Cristino, C.T., 2008. Risco e Confiabilidade sobre Estruturas Combinatórias: Uma Modelagem para Redes Elétricas. Mathematics Department of Universidade Federal de Pernambuco, Recife, PE. Cristino, C.T., ©ebrowski, P., Wildemeersch, M., 2020. Assessing the time intervals between economic recessions. PLOS ONE 15 (5), 1–20. doi:10.1371/journal.pone.0232615. Daley, D.J., Vere-Jones, D., 2007. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media. Ebeling, C., 2004. An introduction to reliability and maintainability engineering. McGraw-Hill. Ferreira, R.J., Firmino, P.R.A., Cristino, C.T., 2015. A mixed kijima model using the weibull-based generalized renewal processes. PLoS ONE 10 (7), e0133772. doi:10.1371/journal.pone.0133772. Gaudoin, O., Yang, B., Xie, M., 2006. Confidence intervals for the scale parameter of the power-law process. Communications in Statistics - Theory and Methods 35 (8), 1525–1538. doi:10.1080/03610920600637412. Gumbel, E.J., 2012. Statistics of extremes. Courier Corporation. Hankin, R. K. S., 2015. hypergeo: The Gauss Hypergeometric Function. R package version 1.2-11.
Generalized renewal processes Chapter | 12
351
Jain, M., Maheshwari, S., 2006. Generalized renewal process (grp) for the analysis of software reliability growth model. Asia-Pacific Journal of Operational Research 23 (02), 215–227. doi:10.1142/S0217595906000917. Johnson, N.L., Kemp, A.W., Kotz, S., 2005. Univariate discrete distributions, 444. John Wiley & Sons. Kauppi, H., Saikkonen, P., 2008. Predicting us recessions with dynamic binary response models. The Review of Economics and Statistics 90 (4), 777–791. Kijima, M., 1989. Some results for repairable systems with general repair. Journal of Applied probability 89–102. Kijima, M., Morimura, H., Suzuki, Y., 1988. Periodical replacement problem without assuming minimal repair. European Journal of Operational Research 37 (2), 194–203. Kijima, M., Sumita, U., 1986. A useful generalization of renewal theory: counting processes governed by non-negative markovian increments. Journal of Applied Probability 71–88. Kotz, S., Balakrishnan, N., Read, C., Vidakovic, B., 2006. Encyclopedia of Statistical Sciences. Number v. 4 Encyclopedia of Statistical Sciences. Wiley-Interscience. Kotz, S., Johnson, N., Read, C., 1988. Encyclopedia of statistical sciences. Number v. 2 Encyclopedia of Statistical Sciences. Wiley. Koutsellis, T., Mourelatos, Z.P., Hu, Z., 2019. Numerical estimation of expected number of failures for repairable systems using a generalized renewal process model. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering 5 (2). Lai, C.-D., Murthy, D., Xie, M., 2006. Weibull distributions and their applications. Springer Handbook of Engineering Statistics. Springer. Langseth, H., Lindqvist, B.H., 2006. Competing risks for repairable systems: a data study. Journal of statistical planning and inference 136 (5), 1687–1700. Liu, W., Moench, E., 2016. What predicts us recessions? International Journal of Forecasting 32 (4), 1138–1150. Meeker, W.Q., Escobar, L.A., 1998. Statistical methods for reliability data. John Wiley & Sons. Moura, M., Droguett, E., Ferreira, R., Firmino, P., 2014. A competing risk model for dependent and imperfect condition–based preventive and corrective maintenances. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability 228 (6), 590–605. Murthy, D.P., Xie, M., Jiang, R., 2004. Weibull models, 505. John Wiley & Sons. NBER, 2010. US business cycle expansions and contractions. http://www.nber.org/cycles.html. [Accessed: 08-March-2017]. Oliveira, C.C.F., 2016. Transformation of Generalized Renewal Processes in Poisson Homogeneous Processes and their Developments. Ph.D. thesis. Department of Statistics & Informatics, Federal Rural University of Pernambuco. Oliveira, C.C.F., Cristino, C.T., Firmino, P.R.A., 2016. In the kernel of modelling repairable systems: a goodness of fit test for weibull-based generalized renewal processes. Journal of Cleaner Production 133, 358–367. Oliveira, C.C.F., Firmino, P.R. A., Cristino, C.T., 2019. A tool for evaluating repairable systems based on generalized renewal processes. Reliability Engineering & System Safety 183, 281–297. OREDA, 2002. Offshore Reliability Data Handbook. Det Norske Veritas. 4th edition. R Core Team, 2016. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. Rai, R.N., Sharma, G., 2017. Goodness-of-fit test for generalised renewal process. International Journal of Reliability and Safety 11 (1-2), 116–131. Ross, S.M., Kelly, J.J., Sullivan, R.J., Perry, W.J., Mercer, D., Davis, R.M., Washburn, T.D., Sager, E.V., Boyce, J.B., Bristow, V.L., 1996. Stochastic processes, 2. Wiley New York.
352
Safety and reliability modeling and its applications
Smith, W.L., Leadbetter, M.R., 1963. On the renewal function for the Weibull distribution. Technometrics 5 (3), 393–396. doi:10.1080/00401706.1963.10490107. Sweeney, D.W., 1963. On the computation of euler, s constant. Mathematics of Computation 17 (82), 170–178. Weibull, W., 1951. Wide applicability. Journal of applied mechanics 51, 293–297. West, B., 1984. Improving the Birdstrike Resistance and Durability of Aircraft Windshield Systems: Program Technical Summary. Technical Report. DTIC Document. Wright, I., 2005. The duration of recessions follows an exponential not a power law. Physica A: Statistical Mechanics and its Applications 345 (3), 608–610. Wu, S., 2019. A failure process model with the exponential smoothing of intensity functions. European Journal of Operational Research 275 (2), 502–513. Wu, S., Scarf, P., 2017. Two new stochastic models of the failure process of a series system. European Journal of Operational Research 257 (3), 763–772. Xiang, Y., Gubian, S., Suomela, B., Hoeng, J., 2013. Generalized simulated annealing for global optimization: The GenSA package for R. The R Journal 5 (1). Yanez, M., Joglar, F., Modarres, M., 2002. Generalized renewal process for analysis of repairable systems with limited failure experience. Reliability Engineering & System Safety 77 (2), 167– 180. Yang, D., He, Z., He, S., 2016. Warranty claims forecasting based on a general imperfect repair model considering usage rate. Reliability Engineering & System Safety 145, 147–154. Zhang, Y., Meeker, W.Q., 2005. Bayesian life test planning for the weibull distribution with given shape parameter. Metrika 61 (3), 237–249. Zhang, Y., Wang, L., Wang, S., Wang, P., Liao, H., Peng, Y., 2018. Auxiliary power unit failure prediction using quantified generalized renewal process. Microelectronics Reliability 84, 215– 225.
Non-Print Items Abstract Generalized renewal processes - GRP - are a flexible formalism for modeling, forecasting, and evaluating repairable systems. Via a virtual age function, GRP extend classical reliability engineering alternatives, such as renewal and Poisson processes. We introduce GRP, taking Uniform-, Weibull-, and Gumbel-based models as exercises. Specifically, mean, variance, random generator function, hypothesis tests, and forecasting are reviewed in the light of recent literature [11, 14, 35, 36]. Keywords Reliability models; Virtual age functions; Uniform distribution; Weibull distribution; Gumbel distribution; Exponential distribution; Maximum likelihood estimation; Hypothesis test; Forecasting; Asymptotic confidence interval
Chapter 13
Multiresponse maintenance modeling using desirability function and Taguchi methods Suraj Rane a, Raghavendra Pai b, Anusha Pai c and Santosh B. Rane d a Professor,
Mechanical Engineering Department, Goa College of Engineering, Farmagudi, Goa, India, 403401. b Project Management Office Lead (Asia Pacific), Syngenta, Corlim, Ilhas, Goa, India, 403110. c Associate Professor, Computer Engineering Department, Padre Conceicao College of Engineering, Verna, Goa, India, 403722. d Dean-Academics, Sardar Patel College of Engineering, Andheri, Mumbai, India, 400058
13.1 Introduction Efficiency of a production system is critically dependent on the condition of plant machinery. The quality of machine performance over a period of time during its life cycle depends on the way the machine is maintained. Often, flawless operation of the machines is taken for granted and it is only during production interruptions owing to machine breakdowns, the true importance of uninterrupted performance is understood. The on-time reliability of any machinery is dependent on quality of maintenance work performed under various constraints. The common constraints faced could be availability of maintenance personnel, repair facilities, spares, and consumables, and the most important of all is the downtime during maintenance work. Proper planning and implementation of maintenance practices is the key to successful maintenance. The objectives of maintenance team is to improve availability and operating condition of the machine, utilize optimum resources during maintenance and improve the overall equipment life. The aim of this work is to study simultaneously the responses with respect to various input variables. The approach will provide the maintenance planners with a holistic view in the planning task. Maintenance process from a multinational pharmaceutical industry is used to demonstrate the working of this approach. These industries are working with a very tight schedule and at the same time needs to comply with very critical regulations of food and drug industry. Improper maintenance of the production system can Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00013-1 Copyright © 2021 Elsevier Inc. All rights reserved.
353
354
Safety and reliability modeling and its applications
affect the product quality, production scheduling, equipment utilization, spare parts utilization, and delivery schedule. Preventive maintenance is usually classified into time-based maintenance risk-based maintenance, condition-based maintenance, and reliability centered maintenance. Literature studies have shown that considerable amount of research is being undertaken in these maintenance philosophies. Each of these approaches has its own philosophies, strengths and application areas. In time-based maintenance, time determines initiation of maintenance activity. The time is obtained from past data, manufacturer’s specification or any other relevant conditions. As the machine gets aged the uptime time reduces, thus making maintenance activity more frequent. An overview of time-based maintenance is presented by authors in (Ahmad and Kamaruddin, 2012), which focused on replacement and repair-replacement problems faced by industry. Authors in (Jonge et al., 2015) have investigated the benefits of postponing time-based maintenance activities, which revealed that more effective planning of maintenance is possible but the cost also increases in the initial phase. In risk-based maintenance, risk of failure is the focal point around which decisions regarding maintenance are undertaken. The approach takes into account probability of failure of the machine and its quantified impact on the ecosystems. A novel risk-based maintenance methodology is presented by authors in (Leoni et al., 2019) to estimate optimal maintenance interval so as to reduce the cost of maintenance and non-productive time of operation in a natural gas production measuring station. The risk and associated uncertainties is modeled using Bayesian networks. Techniques like net present value discounted cash flow index and the associated risk criteria in developing a methodology to improve the availability and lengthen life of power unit elements (Rusin and Wojaczek, 2019). The method was used on testing of boiler tubs which were failing due to corrosion. Robust portfolio modeling is applied in risk-based maintenance decision-making to reduce severity and likelihood of failures in gas networks. Such a framework can handle incomplete knowledge regarding technical parameters, operating conditions and degradation status of the components (Sacco et al., 2019). Dynamic risk-based maintenance methodology is applied on offshore managed pressure drilling operation to improve the safety and reliability of systems. Bayesian networks were applied to develop the model (Pui et al., 2017). Condition-based maintenance reduces the dependence on time-based decisions taken regarding maintenance. It relies solely on health of machine which is monitored by using sensors which provides data to decision making algorithms. Condition-based maintenance was applied on liquefied natural gas floating production storage and offloading vessel, in order to identify abnormalities of equipment, to diagnose fault conditions, to predict the deteriorated states of equipment and to provide timely maintenance support (Hwang et al., 2018). A dynamic condition-based maintenance and inspection policy is proposed based on partially observable Markov decision processes in order to reduce the total cost of condition-based maintenance program (Nguyen et al., 2019). Use of stochastic petrinets is made in reliability
Multiresponse maintenance modeling using desirability Chapter | 13
355
analysis of subsea blowout preventer with condition-based maintenance. The authors concluded that fault-coverage and redundancy have significant impact on reliability, availability and mean-time-between failures. A condition-based maintenance decision-framework is developed using a rolling-horizon approach for a multi-component system subjected to a system reliability requirement. The approach identifies components for maintenance to meet reliability goal at least cost (Shi et al., 2020). To address the likely risk caused by environment on long-life assets, a condition-based maintenance is proposed by authors in (Liang et al., 2020), which resulted in lower asset lifecycle costs. Reliability centered maintenance is a cost effective maintenance strategy which focuses on improving reliability by identifying failure modes and developing maintenance procedure to tackle them. Availability of pipeline, which is very critical in transfer of natural gas is studied using the proposed availability-based reliability centered maintenance planning procedure (Zakikhani et al., 2020). The framework developed took external corrosion of pipeline into consideration and has also considered reliability profile obtained from Monte Carlo simulation. Authors in (Rahmati et al., 2018) developed a novel stochastic reliability centered maintenance mechanism for a joint maintenance and production planning problem. The method also makes use of stochastic condition-based maintenance approach. Reliability-based approach is used in maintenance optimizations of different design strategies for bridge decks in coastal environments (Navarro et al., 2019). In maintenance research studies, various soft computing techniques have also been used such as particle swarm-based approach (Chalabi et al., 2016), fuzzy approach (Zhong et al., 2019), simulated annealing (Schlünz and Vuuren, 2013), and genetic algorithm (Sin and Chung, 2019). Industrial Engineering tools and techniques have been used by various authors for optimizing maintenance parameters along with production and quality issues. Simultaneous use of a statistical process control chart and a preventive maintenance policy is implemented for manufacturing processes that go outof-control due to an equipment failure (Cassady et al., 2000). Taguchi loss function was used to model optimization of preventive maintenance interval and control parameters, simultaneously (Pandey et al., 2012). Joint optimization of quality control and maintenance strategies for a two-machine production line was performed using Taguchi method to minimize average total cost (Azadeh et al., 2017). A control policy of integrated production, maintenance, and quality control planning for a continuous production system subject to quality deterioration is determined in order to minimize the expected average incurred cost and satisfy quality constraint (Rivera-Gómez et al., 2020). A joint optimization problem of opportunistic preventive maintenance and production scheduling in a batch production system under varying operating conditions is solved using an improved genetic Algorithm based on random keys, convex set theory, and the Jaya algorithm (Xiao et al., 2020). Joint maintenance and spare parts inventory optimization for multiunit systems with the consideration of imperfect maintenance actions as random improvement factors is attempted using Stochastic
356
Safety and reliability modeling and its applications
dynamic programming (Yan et al., 2020). Joint production and maintenance optimization in flexible hybrid manufacturing–remanufacturing systems under age-dependent deterioration using stochastic dynamic programming approach, leading to numerical solution of Hamilton-Jacobi-Bellman equations (Polotski et al., 2019). A model that integrates and optimizes production, maintenance and process control decisions simultaneously for a single machine is developed which finds an optimal preventive maintenance schedule and then determines the decision variables to optimize the total cost per unit time (Duffuaa et al., 2020). An integrated model of production, quality, and maintenance is developed by Wang et al. (2020) to minimize the total cost using simulation-based optimization method. A reinforcement learning-based approach is proposed in order to obtain the optimal joint production, maintenance and product quality control policies (Paraschos et al., 2020). Optimal trade-off between conflicting performance metrics was found in the optimization of the total expected profit of the system. The problem of integrated production, quality, and maintenance control of production lines where machines are subject to quality and reliability operation-dependent degradation is dealt with considering a two-machine line model in order to minimize the total cost incurred under a constraint on the outgoing quality (Bouslah et al., 2018). The chapter is organized as follows: Section 13.2 discusses the concepts used to develop the strategy. Methodology adopted in the multi-response optimization work is presented in Section 13.3. Implementation of case study in pharmaceutical industry is discussed in Section 13.4. Section 13.5 presents the analysis of the results. Section 13.6 concludes the work and provides directions for future research.
13.2 Related works 13.2.1
Taguchi method
Genichi Taguchi developed the concept of quality engineering wherein variability reduction was the key philosophy (Taguchi, 1990). Taguchi’s viewpoint on quality is that in a situation where quality characteristic value is different from the targeted value there will be loss to the society. Taguchi proposed use of quadratic loss function approach to explain the loss to the society due to variation in quality characteristic from the target. This method uses fewer numbers of trials in experimentation stage in order to obtain optimal results. The objective of this approach is to set levels of control variables in such a way that effect of noise variables on the response is minimum (Ross, 1998). The approach used important tools like orthogonal array (OA), signal-to-noise (S/N) ratio, and analysis of variance (ANOVA). Orthogonal Arrays tests each column of the array with other columns, with equal measures. Based on the number of variables to be studied, OAs are classified as L4, L8, L12, L16, etc. for twolevel variables; L9, L18, L27, etc. for three-level variables, and so on and so forth
Multiresponse maintenance modeling using desirability Chapter | 13
357
(Phadke, 1989). S/N ratio is used to capture variability in the data. Depending on the type of quality characteristic, there are three types of S/N ratios; lower-thebetter, higher-the-better and nominal-the-better and are expressed in Eqs. (13.1)– (13.3), respectively. n 1 2 y S/NLB = −10 log (13.1) n i=1 i
S/NHB
S/NNB
1 1 = −10 log n i=1 y2i n
n (yi − y) ¯2 = −10 log n−1 i=1
(13.2)
(13.3)
where yi is the ith data value, y¯ is the average of the data and n is the total number of data points. ANOVA technique is used to identify significant variable which affects the response, statistically. The F-value obtained from the experiment is compared with standard F-value at different confidence levels and significance of variables is checked. Variables which affect average will be shown as significant in ANOVA on means, whereas variables which affect variation will be shown significant in ANOVA on S/N ratio. The technique uses others metrics like sum of squares, degree of freedom and mean square error. The significant variables have to be necessarily controlled at their optimal levels else the solution will not be optimal.
13.2.2
Desirability function
There are several techniques to optimize single response for different sets of variables. When there is a need to optimize for more than one response simultaneously, a technique called desirability function approach is widely used. The desirability function approach developed by Harrington (1965) and later modified by Derringer and Suich (1980), adopts computation of desirability values, di (lying between 0 and 1) of individual responses, which are computed using Eqs. (13.4) and (13.5) in case where response is to be maximized and minimized, respectively. ⎧ 0 Yi ≤ YLi ⎪ ⎨ r Yi −YLi di = (13.4) YLi < Yi < YTi YTi −YLi ⎪ ⎩ 1 Yi ≥ YTi
358
Safety and reliability modeling and its applications
di =
⎧ ⎪ ⎨ ⎪ ⎩
YUi −Yi YUi −YT i
1 Yi ≤ YTi r YTi < Yi < YUi 0 Yi ≥ YUi
(13.5)
where Yi is the observed value of ith response, YLi is the lower limit, YTi is the upper limit, YUi is the upper limit, r is the index used to describe the form of desirability function. Eq. (13.6) provides overall desirability value for n responses. D = (d1 ∗ d2 ∗ . . . ∗ dn )1/n
(13.6)
Integrated approach of using desirability function and Taguchi methods have found wide acceptance among research community. (Hsieh et al., 2005) has proposed a procedure involving regression analysis and desirability function to optimize the multi-response problem with Taguchi’s dynamic system. The effectiveness of this approach is demonstrated using biological reduction of ethyl acetoacetate. An application of using desirability function and Taguchi methods for optimizing the machining parameters on turning glass-fibre reinforced plastic (GFRP) pipes is demonstrated by Sait et al. (2009). This approach has also been successfully applied in software engineering which studied the number of defects in software and the effort required in correcting these defects (Pai et al., 2019).
13.2.3
Regression analysis
Regression analysis is an important statistical tool used to establish relationship between independent input variables and the dependent responses. Based on the underlying relation between input variables and responses, the regression models can be classified as simple linear regression and multiple linear regression. A first order regression model with two independent variables (x1 andx2 ) is shown in Eq. (13.7). y = β0 + β1 x1 + β2 x2 + ε
(13.7)
where y is the dependent response, β s are regression coefficients and is the error term (Myers et al., 2016). The value of R2 adj obtained from the model will be used to check how best the dependent variables are related to the response. Analysis of Variance (ANOVA) is performed on the model to check for its significance. ANOVA decomposes the variability in the dependent response. Regression analysis is attempted by various authors on applications such as optimization of maintenance cost (Edwards et al., 2000, Kim et al., 2018), drilling process (Mondal et al., 2019), product analysis (Jalal et al.), etc.
Multiresponse maintenance modeling using desirability Chapter | 13
359
13.3 Methodology The approach used in this work integrates Taguchi method with Desirability Function. The optimization process involves working on two responses simultaneously. It is very important to select the appropriate responses that affect the maintenance process. Next, the input variables, which need to be controlled in order to obtain those responses are identified using quality control tools like brainstorming and cause-and-effect (C-E) diagram. The variables obtained CE diagram are then subjected to Pareto analysis to identify the critical few input variables to be shortlisted for further study. The levels of the input variables are determined based on previous operating procedures or experience. Based on the number of input variables and their levels, appropriate orthogonal array (OA) is selected. Using these OAs, experiments are run and the values of responses are recorded for analysis. The responses are converted into desirability values using Eqs. (13.4) or (13.5) depending upon the type of responses. These desirability values are then merged into single response using Eq. (13.6) and overall desirability value is obtained. Finally, analysis of variance (ANOVA) and regression analysis will be performed on overall desirability value. Thus, the methodology used in the work will involve following steps: 1. 2. 3. 4. 5.
Identify the responses and input variables Identify the levels of input variables Obtain an appropriate orthogonal array Conduct trials and record response values Transform response value to desirability value and obtain overall desirability value 6. Perform ANOVA on overall desirability value 7. Obtain optimal level of input variables. 8. Establish relation between input variables and response.
13.4 Case study Pharmaceutical companies require that equipment are always available for producing various formulations with no margin of error. The pharmaceutical company considered in this work produces 51 different types of formulations in solid and liquid form. The stringent quality expected from the products requires that the machines producing them should also be in a very good condition. With age and usage, any machine will show signs of deterioration. A small variation in equipment performance will have a high negative impact on product quality. It is in these circumstances the maintenance crew plays an important role. The working environment of maintenance crew is always a stressed environment due to constraints on time, resources and procedures. The challenge is to restore the failed/degraded machine into functionally ‘as-good-as-new’ condition. Assumptions considered in this work are:
360
Safety and reliability modeling and its applications
TABLE 13.1 Input variable and their levels Sr. No.
Variable description
Levels 1
2
1
Number of machines (A)
8
12
2
Number of standby machines (B)
2
4
3
Number of repair crew (C)
3
5
4
Mean time between failure (days) (D)
34
40
(i) After maintenance the machine is restored to functionally ‘as-good-asnew’ condition. (ii) The time-to-failure and time-to-repair of the machine follows exponential distribution. (iii) Repair is identical in all the machines considered in this study. Most of the times maintenance activities are performed when degradation is observed during the operation of the equipment. The degradation in performance results in lower productivity and poor quality parts. Before the equipment completely breaks down, maintenance needs to be performed. Also, the more time it takes to repair the equipment, the lesser will be the availability of the equipment for production. Thus the condition of the equipment at the time of diagnosis of degradation is very critical for its overall utilization post maintenance. Hence, equipment utilization is considered as one of the response in the current study, which will focus on effect of maintenance procedures on utilization. Time taken to perform maintenance activity is a random variable, whose magnitude will be determined by various factors such as status and age of equipment, type of component under observation, skill set of crew and availability of resources. Each of these factors in turn depends on sub factors which are out of control for the maintenance crew. The only recourse for the crew is to bring the equipment to working condition or improve its condition as early as possible within given constraints. The equipment will be up for operations at the earliest only if it is repaired in least possible time, thus improving its utilization. Hence, Mean-Time-to-Repair is taken as second response in the study. Thus the two responses of interest are Equipment Utilization and Mean Time-To-Repair (hours), considering the production area as the inspection unit. It is worthwhile noting that the above two parameters have an inverse relation between them. The data is recorded by observing machine downtime and uptime over a period of eight consecutive months. The input variables are number of machines, number of standby machines, crew size and mean time between failure (days). All the machines considered are identical to each other. The number of levels for each of the input variables is given in Table 13.1.
Multiresponse maintenance modeling using desirability Chapter | 13
361
Since there are four input variables at two level each, Taguchi L8 orthogonal array (OA) is a very appropriate choice (Ross, 1998). The experiment will also study interactions between input variables. As per the combination given for each trial, maintenance is performed and the two responses; Equipment utilization and mean time-to-repair (MTTR) are recorded as shown in Table 13.2. Two observations are recorded for each experiment. Desirability values (di ) is computed using Eqs. (13.4) and (13.5), for equipment utilization and mean time-to-repair, since they are to be maximized and minimized, respectively. The lower limit and target value of Equipment Utilization are taken as 0.65 and 0.9, respectively. The target value and upper limit of MTTR are taken as 3 and 6.1, respectively. These values were obtained either from past data or operational requirements. Value of r is taken as unity for a linear desirability function. Subsequently, overall desirability value (D) is computed using Eq. (13.6).
13.5 Result analysis Overall desirability values from the experiment are analyzed to obtain optimal levels and significant input variables. The data is analyzed for Means and S/N ratios. Optimal levels are found using response plots. Fig. 13.1 and Fig. 13.2 shows the response plots for Means and S/N ratios, respectively. Since, Overall Desirability value and S/N ratio are higher-the-better type characteristic, the higher of the two levels will be the optimal level. Based on these two plots, the optimal levels for the input variables are Level 1, Level 2, Level 2 and Level 1 for number of machines, number of standby machines, crew size, and mean time between failure, respectively. When there are lesser number of machines dedicated to the production process and more standby machines available, the overall desirability value is more because time spent on repair will be less and standby machines can replace failed machine. This will also ensure that utilization of machine is improved. Maintenance crew size has direct effect on improved equipment utilization and reduced mean time-to-repair, which is reflected in terms of increase in overall desirability value. There is not much change in the value of overall desirability when levels of mean time between failure is changed from Level 1 to Level 2. However, the small decrease in overall desirability value is observed when mean time between failure increases and this can be attributed to the combined effect of machine running for more time which increases utilization resulting in more time-to-repair in the event of failure. Whichever response is having dominant effect, overall desirability value will tilt on its side. Robustness which is defined as least variability in the response due to uncontrollable variables, is captured by S/N ratio. Response plots of S/N ratio provide optimal levels of variables which will lead to robustness. S/N ratio is a maximization type metric. The pattern of response plots of S/N ratio is same as that of response plots for means. Thus indicating that if these optimal levels
362
Expt. no.
A
B
AB
C
AC BC
D
Equipment utilization
MTTR (hrs)
d1
1
2
1
2
1
2
d2 1
2
D 1
2
1
1
1
1
1
1
1
1
0.865
0.862
3.5
5.5
0.860
0.848
0.839
0.194
0.849
0.405
2
1
1
1
2
2
2
2
0.882
0.861
3.1
5.1
0.928
0.844
0.968
0.323
0.948
0.522
3
1
2
2
1
1
2
2
0.879
0.781
3.4
5.4
0.916
0.524
0.871
0.226
0.893
0.344
4
1
2
2
2
2
1
1
0.881
0.884
3.0
5.1
0.924
0.936
1.000
0.323
0.961
0.549
5
2
1
2
1
2
1
2
0.760
0.670
4.0
6.0
0.440
0.080
0.677
0.032
0.546
0.051
6
2
1
2
2
1
2
1
0.790
0.740
3.4
5.4
0.560
0.360
0.871
0.226
0.698
0.285
7
2
2
1
1
2
2
1
0.884
0.832
3.9
5.9
0.936
0.728
0.710
0.065
0.815
0.217
8
2
2
1
2
1
1
2
0.888
0.870
3.1
5.2
0.952
0.880
0.968
0.290
0.960
0.505
Safety and reliability modeling and its applications
TABLE 13.2 Orthogonal Array, Response values and Overall Desirability Values
Response plots on means
363
FIG. 13.1
Multiresponse maintenance modeling using desirability Chapter | 13
Response plots on S/N ratios
Safety and reliability modeling and its applications
FIG. 13.2
364
Multiresponse maintenance modeling using desirability Chapter | 13
365
TABLE 13.3 ANOVA results for mean-overall desirability value Source
Sum of squares
Degree of freedom
Mean square
F-value
Number of machines
0.243
1
0.243
4.660a
Number of standby machines
0.111
1
0.111
2.121
Crew size
0.214
1
0.214
4.105a
0.00002
1
0.00002
0.00033
Number of number of standby machines
0.100
1
0.100
1.910
Number of machinesa crew size
0.014
1
0.014
0.264
Number of standby machinesa crew size
0.001
1
0.001
0.027
Error
0.574
11
0.052
Total
1.256
15
Mean time between failure machinesa
a
Significant at 90% confidence level
are strictly followed, overall desirability value will be maximized as well as robustness will be achieved. From the ANOVA on means shown in Table 13.3, it is clear that only two variables i.e. number of machines and crew sizes are significant at 90% confidence level. This decision is arrived at based on the F-value. There appears to be no significant variables as far as ANOVA on S/N ratios as shown in Table 13.4, is considered. The results obtained require verification which is done using estimate from the data and its confidence interval. The optimal combination of Level 1, Level 2, Level 2 and Level 1 for number of machines, number of standby machines, crew size and mean time between failure, respectively is available in the orthogonal array at experiment no. 4. The estimated value of overall desirability from the experiment based on significant variables is 0.766. The confidence interval for this estimated mean statistically found to 0.544. This indicates that the verification test value should lie between 0.222 and 1.31. Using the optimal combination, verification tests were run for two months and the equipment utilization and mean time-to-repair were found to be around 0.91 and 3.2 hours, respectively. The overall desirability value for this verification results is 0.967, which is the highest among our experimental results. Thus improving the equipment utilization and reducing mean time-to-repair. This validated our experimental results. Last seven maintenance activities related to machine failure
366
Safety and reliability modeling and its applications
TABLE 13.4 Pooled ANOVA on S/N ratio-overall desirability value Sum of squares
Degree of freedom
Mean square
F-value
Number of machines
171.870
1
171.870
1.137
Number of standby machines
65.296
1
65.296
0.432
Mean time between failure
21.457
1
21.457
0.142
Number of machines∗ number of standby machines
78.156
1
78.156
0.517 0.401
Source
Number of machines∗ crew size
60.536
1
60.536
Pooled error
302.227
2
151.114
Total
560.851
7
FIG. 13.3
Observations based on optimal levels (MTTR in hrs)
of similar nature and performed using optimal settings was observed. Equipment Utilization was lying between 0.90 to 0.92, whereas mean time-to-repair was lying between 3.1 to 3.3 hours. Fig. 13.3 shows the data of these observations superimposed on a single graph. The relationship between response and input variables is established using regression analysis. The regression model obtained for Mean-Overall Desirability value is given in Eq. (13.8). Table 13.5 presents the ANOVA on this model which resulted in R2 =99.80% and R2adj =98.58%. The high value of R2 indicates better fit for the model on the data. From the F-value we can conclude that the Mean-Overall Desirability value is significantly related to the input variables, at 90% confidence level. Overall desirability value = 1.790 − 0.1687 ∗ Number of Machines − 0.2202 ∗ Number of Standby Machines − 0.0219 ∗ Crew Size − 0.00024 ∗ Mean Time Between Failure + 0.02790 ∗ Number of Machines ∗ Number of Standby Machines + 0.01037 ∗ Number of Machines ∗ Crew Size
(13.8)
Multiresponse maintenance modeling using desirability Chapter | 13
367
TABLE 13.5 ANOVA results for the regression model on mean-overall desirability
value Source
Sum of squares
Degree of freedom
Mean square
F-value
P-value
Regression
0.170315
Error
0.000347
6
0.0284
81.87
0.084
1
0.000347
Total
0.170662
7
TABLE 13.6 ANOVA results for the regression model on S/N ratio-overall desirability value Source
Sum of squares
Degree of freedom
Mean square
F-value
P-value
Regression
285.250
6
47.542
9.71
0.241
Error
4.897
1
4.897
Total
290.147
7
The regression model on S/N ratio-overall desirability value is presented in Eq. (13.9). This model will help in achieving robustness in the decision making process in maintenance. Table 13.6 presents the ANOVA on this model which resulted in R2 =98.31% and R2adj =88.19%. However, from F-value we can state that the input variables are not significant in relation with the S/N ratio. This was also observed earlier in case on ANOVA on S/N ratio. S/N = 75.3 − 8.84∗Number of Machines − 9.03∗Number of Standby Machines − 6.44 ∗ Crew Size − 0.386∗Mean Time Between Failure + 1.105∗Number of Machines ∗ Number of Standby Machines + 0.973 ∗ Number of Machines ∗ Crew Size
(13.9)
Developing such models will help the maintenance team to understand the sensitive nature of controlling the input variables so that the responses in the form of equipment utilization and mean time-to-repair can be obtained within desired limits.
13.6 Conclusion and future research directions This chapter deliberates on the use of Taguchi methods and desirability function approach in multi-response optimization of preventive maintenance process. The input variables studied are number of machines, number of standby machines,
368
Safety and reliability modeling and its applications
crew size and mean time between failure (days) whereas the two responses are equipment utilization and mean time-to-repair. The two responses are converted into single response called overall desirability value using desirability function approach. Analysis of variance was performed on the overall desirability value to obtain significant variables affecting response. Using response plots optimal levels of variables were obtained. The optimal combination was found to be number of machines-8, number of standby machines-4, crew size-5 and mean time between failure-34 days. Verification tests run using optimal combination resulted in improved equipment utilization and reduced mean time-to-repair. Regression analysis was performed to relate overall desirability value to the input variables. The approach will help in studying many independent factors with an aim to find their effect on different dependent responses, simultaneously. The more the number of independent factors as well as dependent responses, the more will be the effectiveness and benefits of this approach. The approach presented in this chapter can be extended to different types of preventive maintenance techniques. In case of risk-based maintenance, evaluation can be performed in terms of maintenance cost and equipment utilization. The dependent variables can be probability of failure and cost of failure. The health parameters of critical equipment/s can be considered as input parameters in condition-based maintenance studies. The responses like expected time to fail, expected time to repair and maintenance cost can be evaluated with our approach. In reliability centered maintenance the possible input parameters are failure modes, failure probabilities, probability of occurrence of failure and failure risk. The responses can be maintenance cost and equipment utilization. The approach can be integrated in the multi-criteria decision-making methodology in arriving at maintenance decisions. This approach can be applied in joint optimization of production and maintenance processes. In this case, raw material quality, production effectiveness and production scheduling can be considered as factors on production area, whereas time-to-repair and resource constraints can be considered as factors from maintenance area. These two aspects in joint optimization can also be studied in conjunction with supply chain management with an overall goal of delivery high quality products with least lead time. Eventually, the approach can be integrated with Industry 4.0 by using concepts like artificial intelligence, machine learning, industrial internet of things through analytical, practical or simulation studies. These concepts will enable stakeholders to automate maintenance process and improve its effectiveness. The integrated proactive approach will reduce failure related costs, improve crew morale and reduce wastages on shop-floor. Industry 4.0 needs smart maintenance strategies coupled with real time decision making abilities. The challenges likely to be faced by maintenance decision makers are availability of skilled manpower, programming skills, automation capabilities and integration related bottlenecks. The successful journey in overcoming these challenges will lead to predictive intelligent maintenance strategy.
Multiresponse maintenance modeling using desirability Chapter | 13
369
References Ahmad, R., Kamaruddin, S., 2012. An overview of time-based and condition-based maintenance in industrial application. Comput. Ind. Eng. 63, 135–149. Azadeh, A., Sheikhalishahi, M., Mortazavi, S., Jooghi, E.A., 2017. Joint quality control and preventive maintenance strategy: A unique Taguchi approach. Int. J. Syst. Assurance Eng. Manag. 8, 123–134. Bouslah, B., Gharbi, A., Pellerin, R., 2018. Joint production, quality and maintenance control of a two-machine line subject to operation-dependent and quality-dependent failures. Int. J. Produc. Econ. 195, 210–226. Cassady, C.R., Bowden, R.O., Liew, L., Pohl, E.A., 2000. Combining preventive maintenance and statistical process control: a preliminary investigation. IIE Trans. 32, 471–478. Chalabi, N., Dahane, M., Beldjilali, B., Neki, A., 2016. Optimisation of preventive maintenance grouping strategy for multi-component series systems: Particle swarm based approach. Comp. Ind. Eng. 102, 440–451. Derringer, G., Suich, R., 1980. Simultaneous Optimization of Several Response Variables. J. Qual. Technol. 12, 214–219. Duffuaa, S., Kolus, A., Al-Turki, U., El-Khalifa, A., 2020. An integrated model of production scheduling, maintenance and quality for a single machine. Comp. Ind. Eng. 142. doi:10.1016/j.cie.2019.106239. Edwards, D.J., Holt, G.D., Harris, F.C., 2000. A comparative analysis between the multilayer perceptron "neural network” and multiple regression analysis for predicting construction plant maintenance costs. J. Qual. Maint. Eng. 6, 45–60. Harrington, E.C., 1965. The desirability function. Ind. Qual. Control 21, 494–498. Hsieh, K.L., Tong, L.I., Chiu, H.P., Yeh, H.Y., 2005. Optimization of a multi-response problem in Taguchi’s dynamic system. Comput. Ind. Eng. 49, 556–571. Hwang, H.J., Lee, J.H., HWANG, J.S., JUN, H.B, 2018. A study of the development of a conditionbased maintenance system for an LNG FPSO. Ocean Eng. 164, 604–615. Jalal, M., Arabali, P., Grasley, Z., Bullard, J. W. & Jalal, H., 2020. Behavior Assessment, Regression Analysis and Support Vector Machine (SVM) Modeling of Waste Tire Rubberized Concrete. J. Cleaner Prod., doi: 10.1016/j.jclepro.2020.122960. Jonge, B.D.S., Dijkstra, A., Romeijnders, W., 2015. Cost benefits of postponing time-based maintenance under lifetime distribution uncertainty. Reliab. Eng. Syst. Safe. 140, 15–21. Kim, J.-M., Kim, T., Yu, Y.-J., Son, K., 2018. Development of a Maintenance and Repair Cost Estimation Model for Educational Buildings Using Regression Analysis. J. Asian Architect. Build Eng. 17, 307–312. Leoni, L., Bahootoroody, A., Carlo, F.D., Paltrinieri, N., 2019. Developing a risk-based maintenance model for a Natural Gas Regulating and Metering Station using Bayesian Network. J. Loss Prev. Process Ind. 57, 17–24. Liang, Z., Liu, B., Xie, M., Parlikad, A.K., 2020. Condition-based maintenance for longlife assets with exposure to operational and environmental risks. Int. J. Prod. Econ. doi:10.1016/j.ijpe.2019.09.00. Mondal, N., Mandal, S., Mandal, M.C., 2019. FPA Based Optimization of Drilling Burr using Regression Analysis and ANN Model. Measurement doi:10.1016/j.measurement.2019.107327. Myers, R.H., Montgomery, D.C., Anderson-Cook, C.M., 2016. Response surface methodology: Process and product optimization using designed experiments. John Wiley & Sons, New Jersey. Navarro, I.J., Martí, J.V., Yepes, V, 2019. Reliability-based maintenance optimization of corrosion preventive designs under a life cycle perspective. Environ. Impact Assess. Rev. 74, 23–34.
370
Safety and reliability modeling and its applications
Nguyen, K.T.P., Do, P., Huynh, K.T., Bérenguer, C., Grall, A, 2019. Joint optimization of monitoring quality and replacement decisions in condition-based maintenance. Reliab. Eng. Syst. Safe. 189, 177–195. Pai, A., Joshi, G., Rane, S., 2019. Multi-response optimization based on desirability function and Taguchi method in agile software development. Int. J. Syst. Assurance Eng. Manag. 10, 1444– 1452. Pandey, D., Kulkarni, M.S., Vrat, P., 2012. A methodology for simultaneous optimisation of design parameters for the preventive maintenance and quality policy incorporating Taguchi loss function. Int. J. Prod. Res. 50, 2030–2045. Paraschos, P.D., Koulinas, G.K., KOULOURIOTIS, D.E., 2020. Reinforcement learning for combined production-maintenance and quality control of a manufacturing system with deterioration failures. J. Manuf. Syst. 56, 470–483. Phadke, M.S., 1989. Quality Engineering Using Robust Design. Prentice Hall, Englewood. Polotski, V., Kenne, J.-P., Gharbi, A., 2019. Joint production and maintenance optimization in flexible hybrid Manufacturing–Remanufacturing systems under age-dependent deterioration. Int. J. Prod. Econ. 216, 239–254. Pui, G., Bhandari, J., Arzaghi, E., Abbassi, R., GARANIYA, V., 2017. Risk-based maintenance of offshore managed pressure drilling (MPD) operation. J. Petroleum Sci. Eng. 159, 513– 521. Rahmati, S.H.A., Ahmadi, A., Karimi, B, 2018. Multi-objective evolutionary simulation based optimization mechanism for a novel stochastic reliability centered maintenance problem. Swarm Evol. Comput. 40, 255–271. Rivera-Gómez, H., Gharbi, A., Kenné, J.-P., Montañoarango, O., Corona-Armenta, J.R, 2020. Joint optimization of production and maintenance strategies considering a dynamic sampling strategy for a deteriorating system. Comput. Ind. Eng. 140. doi:10.1016/j.cie.2020.106273. Ross, P.J., 1998. Taguchi Techniques for Quality Engineering. McGraw-Hill Publication, New York. Rusin, A., Wojaczek, A., 2019. Improving the availability and lengthening the life of power unit elements through the use of risk-based maintenance planning. Energy 180, 28–35. Sacco, T., Compare, M., Zio, E., Sansavini, G., 2019. Portfolio decision analysis for risk-based maintenance of gas networks. J. Loss Prev. Process Ind. 60, 269–281. Sait, A.N., Aravindan, S., Haq, A.N., 2009. Optimisation of machining parameters of glass-fibrereinforced plastic (GFRP) pipes by desirability function analysis using Taguchi technique. Int. J. Adv. Manuf. Technol. 43, 581–589. Schlünz, E.B., Vuuren, J.H.V, 2013. An investigation into the effectiveness of simulated annealing as a solution approach for the generator maintenance scheduling problem. Electr. Power Energy Syst. 53, 166–174. Shi, Y., Zhu, W., Xiang, Y., Feng, Q., 2020. Condition-based maintenance optimization for multicomponent systems subject to a system reliability requirement. Reliab. Eng. System Safe. 202. doi:10.1016/j.ress.2020.107042. Sin, I. H. & Chung, B. D. 2019. Bi-objective optimization approach for energy aware scheduling considering electricity cost and preventive maintenance using genetic algorithm. doi: 10.1016/j.jclepro.2019.118869. Taguchi, G., 1990. Introduction to Quality Engineering. Asian Productivity Organization, Tokyo, Japan. Wang, L., lu, Z., Ren, Y., 2020. Joint production control and maintenance policy for a serial system with quality deterioration and stochastic demand. Reliab. Syst. Safe. 199. doi:10.1016/j.ress.2020.106918.
Multiresponse maintenance modeling using desirability Chapter | 13
371
Xiao, L., Zhang, X., Tang, J., Zhou, Y., 2020. Joint optimization of opportunistic maintenance and production scheduling considering batch production mode and varying operational conditions. Reliab. Eng. Syst. Safe. 202. doi:10.1016/j.ress.2020.107047. Yan, T., Lei, Y., Wang, B., Han, T., Si, X., Li, N., 2020. Joint maintenance and spare parts inventory optimization for multi-unit systems considering imperfect maintenance actions. Reliab. Eng. Syst. Safe. 202. doi:10.1016/j.ress.2020.106994. Zakikhani, K., Nasiri, F., Zayed, T., 2020. Availability-based reliability-centered maintenance planning for gas transmission pipelines. Int. J. Pressure Vessels Piping 183. doi:10.1016/j.ijpvp.2020.104105. Zhong, S., Pantelous, A.A., Goh, M., Zhou, J., 2019. A reliability-and-cost-based fuzzy approach to optimize preventive maintenance scheduling for offshore wind farms. Mech. Syst. Sig. Process. 124, 643–663.
Non-Print Items Abstract Successful maintenance work goes a long way in deciding both the life and quality of machine performance. There is a need to satisfy many objectives simultaneously while taking decision regarding the maintenance task. This chapter presents an application involving study of multiple responses simultaneously in optimization of preventive maintenance process using desirability function approach and Taguchi method. The approach is illustrated with a case from a pharmaceutical industry. The input variables studied in this work are number of machines, number of standby machines, crew size and mean time between failure, for optimizing the two responses; equipment utilization and mean time-to-repair. The study resulted in improvement of response values. A regression model is developed by relating input variables with the overall desirability value. Keywords Desirability function; Equipment utilization; Preventive maintenance; Regression analysis; Taguchi method
Chapter 14
Signature-based reliability study of r-within-consecutive-k-outof-n: F systems Ioannis S. Triantafyllou Department of Computer Science & Biomedical Informatics, University of Thessaly, Lamia, Greece
14.1 Introduction In the field of Reliability Modeling, an intriguing goal calls for the design of appropriate structures that are related to real-life applications or existing devices and contrivances. A particular group of reliability models, which seems to reel in the scientists during the last decades, is the family of consecutivetype systems. Due to the abundance of their applications in engineering, the so-called consecutive-type structures comprise an engrossing scope of research activity. The general framework of constructing a consecutive-type system requires n linearly or circularly ordered components. The resulting system stops its operation, whenever a pre-specified consecutive-type condition (or more than one condition) is fulfilled. The potential and time placement of the failure rule’s activation for each structure are strongly related to its reliability characteristics, such as the reliability function, mean residual lifetime, or signature vector. Following the above-mentioned infrastructure, several systems have been already proposed in the literature. For example, a consecutive-k-out-of-n: F system consists of n linearly ordered components and stops its operation if and only if at least k consecutive units break down (see, e.g., (Derman et al., 1982), (Triantafyllou and Koutras 2008a) or (Triantafyllou and Koutras, 2008b). Moreover, the so-called the m-consecutive-k-out-of-n: F system seems to be a direct generalization of the traditional m-out-of-n: F system and the consecutive-k-out-of-n: F structure; it consists of n linearly ordered components such that the system stops its operation if and only if there are at least m nonoverlapping runs of k consecutive failed units (see, e.g., (Griffith, 1986), Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00011-8 Copyright © 2021 Elsevier Inc. All rights reserved.
373
374
Safety and reliability modeling and its applications
(Makri and Philippou, 1996) or (Eryilmaz et al., 2011)). An additional modification of the common consecutive-k-out-of-n: F system is known as r-withinconsecutive k-out-of-n: F structure. The aforementioned system was introduced by (Tong, 1985) and fails if and only if there exist k consecutive components that include among them, at least r failed units (see also (Griffith, 1986, Makri and Psillakis, 1996a, 1996b or Triantafyllou and Koutras, 2011). It is evident that plenty variations of the above structures have been suggested to accommodate more flexible operation principles. For some recent contributions in the field of consecutive-type structures, the interested reader is referred to (Dafnis et al., 2019, Kumar and Ram, 2018, 2019, Triantafyllou, 2020a or Kumar et al., 2019). On the other hand, several real-life applications involve two different criteria that can lead to the failure of the corresponding device. To cover the abovementioned application field, a variety of reliability structures have been proposed and studied in the literature. For instance, the (n, f, k) structure proposed by (Chang et al., 1999), fails if, and only if, there exist at least f failed units or at least k consecutive failed units. Several reliability characteristics of the so-called (n, f, k) systems are studied in detail by (Zuo et al. (2000)) or (Triantafyllou and Koutras, 2014). Among others, the < n, f, k > structure (see, e.g., (Cui et al., 2006) or (Triantafyllou, 2020b), the constrained (k, d)-out-of-n: F system (see, e.g., (Eryilmaz and Zuo, 2010) or (Triantafyllou, 2020c) and the ((n1 ,n2 ,..., nN ),f, k) structure involving N modules (see, e.g., (Cui and Xie, 2005) or (Eryilmaz and Tuncel, 2015)) could be reported as indicative paradigms of consecutivetype reliability systems with two failure criteria. For a detailed and up-todate survey on the consecutive-type systems, we refer to the detailed reviews offered by (Chao et al., 1995) or (Triantafyllou, 2015) and the well-documented monographs devised by (Chang et al., 2000, Kuo and Zuo, 2003) or (Kumar et al., 2019). A survey of reliability approaches in various fields of engineering and physical sciences is also provided by (Ram, 2013). Throughout the lines of the present chapter, the reliability characteristics of the r-within-consecutive k-out-of-n: F structure are investigated. In (Section 14.2), an algorithmic procedure for evaluating the coordinates of the signature vector of the aforementioned reliability structures is discussed, while a numerical experimentation under several choices of the design parameters is carried out. In (Section 14.3), we present signature-based expressions for the reliability function, the mean residual lifetime, and the conditional mean residual lifetime of the r-within-consecutive k-out-of-n: F systems. Several numerical results, which are produced by the aid of the proposed algorithmic procedure, are also displayed and discussed in some detail. (Section 14.4) provides signature-based comparisons of the underlying r-within-consecutive k-out-of-n: F structure versus several well-known members of the class of consecutive-type systems. Finally, the Discussion section summarizes the outcomes provided in the present chapter, while some interesting conclusions drawn in previous sections are highlighted.
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
375
14.2 The signature vector of the r-within-consecutive-k-out-of-n: F structure In this section, we develop the step-by-step algorithmic approach for computing the coordinates of the signature of the r-within-consecutive k-out-of-n: F structures with independent and identically distributed (i.i.d.) components. Let us first denote by T the lifetime of a reliability structure consisting of n components with respective lifetimes T1 ,T2 ,..., Tn . Under the assumption that lifetimes T1 ,T2 ,..., Tn are i.i.d., the signature vector of the r-within-consecutive k-out-of-n: F structure is given as (s1 (r,k, n), s2 (r,k, n), ..., sn (r,k, n)) with si (r, k, n) = P(T = Ti:n ), i = 1, 2, ..., n
(14.1)
where T1: n ≤ T2: n ≤ ... ≤ Tn: n expresses the respective ordered random lifetimes. In other words, the probability si (r,k, n), i = 1, 2, ..., n is defined as the ratio Ai /n!, where Ai indicates the number of those permutations of the components’ lifetimes of the structure, for which the ith component failure leads to the stoppage of the reliability system. It is well known that the signature of a coherent system is strongly related to some important reliability characteristics (see, e.g., Samaniego, 1985, 2007) and that turns it to a useful tool for evaluating the structure’s performance. For illustration purposes, let us next depict the so-called r-within-consecutive k-out-of-n: F structure by the aid of a diagram. As already mentioned, the system consists of n components and fails if, and only if, there exist k consecutive components that include among them, at least r failed units. Fig. 14.1 illustrates the failure scenarios for the r-within-consecutive k-out-of-n: F structure for specific values of its design parameters. Note that a gray-filled box indicates a failed component, while a blank box indicates a working one. Each of the above scenarios, result in the overall failure of the underlying structure. Indeed, one may readily observe that if any of the above-mentioned schemes occurs, the failure criterion of the 2-within-consecutive-3-out-of-6: F structure is met. We next describe the algorithmic process for computing the coordinates of the signature vector of the r-within-consecutive k-out-of-n: F system. Step 0. Define the vector w = (w1 ,w2 ,..., wn ) with initial values wi = 0, i = 1, 2, ..., n. Step 1. Generate a random sample of size n from an arbitrary continuous distribution F. The resulting sample indicates the components’ lifetimes of the underlying r-within-consecutive k-out-of-n: F structure. Step 2. Determine the input parameters of the algorithm, namely the design parameters r, k, where 1 ≤ r ≤ k ≤ n. Step 3. Define the random variable W as the chronologically ordered lifetime that results in the stoppage of the r-within-consecutive k-out-of-n: F model. The random quantity W ranges from 1 to n, according to which unit arises to be the destructive one for the functioning of the reliability system.
376
Safety and reliability modeling and its applications
FIG. 14.1
The r-within-consecutive k-out-of-n: F structure system for n = 6, r = 2, k = 3
Step 4. Find out the value of W for the particular sample of order n that has been produced in Step 1. Each time the random variable W takes on a specific value i (i = 1, 2, ..., n), the respective coordinate of vector w increases appropriately, namely the quantity ai becomes ai + 1. All steps 1–4 are repeated m times and the probability that the r-withinconsecutive k-out-of-n: F model fails at the chronologically ith ordered unit failure is equal to ai divided by m, namely si (r, k, n) = ai /m, i = 1, 2, ..., n. It goes without saying that the amount of repetitions is suggested to be as large as possible. We first apply the above-mentioned algorithmic procedure for the special case r = 2, namely for the 2-within-consecutive k-out-of-n: F system. Despite
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
377
the fact that for this case, (Triantafyllou and Koutras, 2011) offered a recursive scheme for computing the corresponding signature, we next proceed to calculate independently the signature of the 2-within-consecutive k-out-of-n: F system by the aid of the proposed simulation to verify its validity. Please note that for the simulation study carried out throughout the lines of the present manuscript, the MATLAB package has been used, while 10,000 replications take place for producing each numerical result. To sum up, we decided first to implement the above algorithm in cases where the numerical results were already known (by applying an alternative approach), so the correctness of the proposed method to be numerically confirmed. Table 14.1 displays the signature vector of the corresponding 2-within-consecutive k-out-of-n: F model for different values of its design parameters. Both exact and simulation-based results are presented in each cell of the following table. More precisely, the upper entries have been calculated by the aid of the proposed algorithm, while the lower entries depict the exact values of the respective signature vector. As we examine the same cases as those considered by (Triantafyllou and Koutras, 2011), the lower entries of all cells in Table 14.1 have been reproduced from their Table I (p. 318). Based on (Table 14.1), one may readily deduce that the simulation-based outcomes are very close to the exact results in all cases considered. For instance, let us suppose that a 2-within-consecutive 3-out-of-7: F structure is implemented. Under the aforementioned model, the exact nonzero signatures, namely the second, third, and fourth coordinate of the respective vector equals to 52.4%, 44.8%, and 2.8% correspondingly, while the simulation-based probability values, which are produced by the aid of the proposed procedure, equal to 52.3%, 44.8%, and 2.9% respectively. Please note that for each simulation-based numerical result displayed in Table 14.1, we implemented the step-by-step algorithmic procedure described earlier. For instance, let us next consider the case r = 2, k = 3, n = 4. To compute the corresponding coordinates of the signature vector of the resulting 2-withinconsecutive 3-out-of-4: F system, we apply the above-mentioned algorithm as follows Step 0. Define the vector w = (w1 ,w2 ,w3 ,w4 ) with initial values wi = 0, i = 1, 2, 3, 4. Step 1. Generate a random sample of size n = 4 from the uniform distribution in (0,1). The resulting sample indicates the components’ lifetimes of the underlying r-within-consecutive k-out-of-n: F structure. Step 2. Determine the input parameters of the algorithm, namely r = 2, k = 3. Step 3. Define the random variable W as the chronologically ordered lifetime that results in the stoppage of the 2-within-consecutive 3-out-of-4: F model. The random quantity W ranges from 1 to 4 according to which component’s failure leads to the overall failure. Step 4. Find out the value of W for the particular sample of order n which has been produced in Step 1. Each time the random variable W takes on a
378
TABLE 14.1 Exact and simulation-based signatures of the 2-within-consecutive k-out-of-n: F structure k
i=1
i=2
2
2
00
11
3
2
0
0.6690.667
3
0
11
00
4
2
00
0.4990.500
0.5010.500
00
3
00
0.8340.833
0.1660.167
00
5
2
00
0.4010.400
0.5010.500
0.0980.100
3
00
0.6970.700
0.3030.300
00
00
6
2
00
0.3340.333
0.4660.467
0.2000.200
00
00
3
00
0.6020.600
0.3980.400
00
00
00
2
00
0.2880.286
0.4290.428
0.2580.257
0.0250.029
00
00
3
00
0.5230.524
0.4480.448
0.0290.028
00
00
00
2
00
0.2490.250
0.3920.393
0.2890.286
0.0700.071
00
00
00
3
00
0.4670.464
0.4620.464
0.0710.071
00
00
00
00
2
00
0.2200.222
0.3580.360
0.0310.030
0.1120.111
0.0090.007
00
00
00
3
00
0.4190.417
0.4640.464
0.1170.119
00
00
00
00
00
2
00
0.1990.200
0.3320.333
0.3000.300
0.1440.143
0.0250.024
00
00
00
00
3
00
0.3800.378
0.4550.456
0.1610.162
0.0040.004
00
00
00
00
00
7
8
9
10
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
0.3310.333
00
Each cell contains the simulation-based signature (upper entry) and the exact signature (lower entry)
Safety and reliability modeling and its applications
n
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
379
specific value i (i = 1, 2, 3, 4), the respective coordinate of vector w increases appropriately, namely the quantity ai becomes ai + 1. After 10,000 replications of Steps 1–4, the vector takes the following form w = (0, 0, 8340, 1660) In other words, 8340 times out of 10,000 replications, the system fails upon the third (in a chronological order) component’s failure probability, while the remaining times the structure stops whenever its fourth (in a chronological order) component fails. Consequently, the 2-within-consecutive 3-out-of-4: F model fails at the chronologically third (fourth) component’s failure with probability 83.4% (16.6%), namely the signature vector is determined as s = (0, 0, 0.834, 0.166). We next apply the algorithmic process for calculating the signatures of rwithin-consecutive k-out-of-n: F models under several designs. More precisely, we now extend our computations for larger values of parameter r, namely for r > 2. Table 14.2 presents the numerical outcomes for the signature vector of the r-within-consecutive k-out-of-n: F systems for r = 3, k > r, and n = 5,6,…,10. For illustration purposes, let us consider the 3-within-consecutive 5-out-of-7: F model, namely the design parameters are defined as r = 3, k = 5, n = 7. The resulting system breaks down upon the third (fourth) ordered failure unit with probability 62.626%, (34.526%), while its fifth failed component leads to the failure of the whole structure with probability 2.848%. In addition, Table 14.2 can be proved useful for reaching some interesting conclusions about the impact of the design parameters k, n over the performance of the corresponding model. More precisely, Fig. 14.2 depicts the signatures of all 3-within-consecutive 4-out-of-n: F systems for n = 5,6,…, 10. As it is easily observed, the larger the parameter n, the longer the life expectancy of the corresponding structure becomes. On the other hand, Fig. 14.3 reveals the influence of parameter k over the signatures of the rwithin-consecutive k-out-of-n: F models. More specifically, we next consider all possible cases of 3-within-consecutive k-out-of-9: F designs and the resulting graphical representation of the corresponding signatures is given in the following figure. Based on Fig. 14.3, we readily observe that as parameter k increases, the failure of the resulting structure tends to take place sooner. In other words, it is preferable for the practitioner to design the suitable 3-within-consecutive kout-of-n: F system for its application, by determining the value of parameter k as smaller as possible. Furthermore, Table 14.3 presents the numerical outcomes for the signature vector of the r-within-consecutive k-out-of-n: F systems for r = 4, k > r and n = 6, 7,…,10. For illustration purposes, let us consider the 4-within-consecutive 6-out-of9: F model, namely the design parameters are defined as r = 4, k = 6, n = 9. The resulting system breaks down at the 4th ordered failure unit with probability
TABLE 14.2 The signatures of the 3-within-consecutive k-out-of-n: F systems under several designs i=1
i=2
i=3
i=4
i=5
5
4
0
0
0.69866
0.30134
0
6
4
0
0
0.50008
0.43184
0.06808
5
0
0
0.80184
0.19816
0
0
7
4
0
0
0.37040
0.45962
0.16998
0
0
5
0
0
0.62626
0.34526
0.02848
0
0
8
9
10
i=6
i=7
i=8
i=9
i=10
Safety and reliability modeling and its applications
k
0
6
0
0
0.85830
0.14170
0
0
0
4
0
0
0.29118
0.42376
0.28506
0
0
0
5
0
0
0.40338
0.43568
0.16094
0
0
0
6
0
0
0.71216
0.27348
0.01436
0
0
0
7
0
0
0.89350
0.10650
0
0
0
0
4
0
0
0.22416
0.38844
0.34040
0.04700
0
0
0
5
0
0
0.40338
0.43568
0.16094
0
0
0
0
6
0
0
0.59602
0.35532
0.04866
0
0
0
0
7
0
0
0.77340
0.21888
0.00772
0
0
0
0
380
N
8
0
0
0.91760
0.08240
0
0
0
0
0
4
0
0
0.18110
0.34090
0.36002
0.11284
0.00514
0
0
0
5
0
0
0.33362
0.42572
0.24066
0
0
0
0
0
6
0
0
0.50138
0.40322
0.09540
0
0
0
0
0
7
0
0
0.66754
0.30414
0.02832
0
0
0
0
0
8
0
0
0.81876
0.17704
0.00420
0
0
0
0
0
9
0
0
0.93268
0.06732
0
0
0
0
0
0
N
k
i=1
i=2
i=3
i=4
i=5
i=6
6
5
0
0
0
0.60040
0.39960
0
7
8
9
10
i=7
5
0
0
0
0.36966
0.48324
0.14710
0
6
0
0
0
0.71518
0.28482
0
0
i=8
5
0
0
0
0.24276
0.43702
0.28396
0.03626
0
6
0
0
0
0.49318
0.43508
0.07174
0
0
7
0
0
0
0.78612
0.21388
0
0
0
i=9
i=10
5
0
0
0
0.16892
0.35710
0.35692
0.11706
0
0
6
0
0
0
0.35942
0.45180
0.17646
0.01232
0
0
7
0
0
0
0.59590
0.36520
0.03890
0
0
0
8
0
0
0
0.83208
0.16792
0
0
0
0
5
0
0
0
0.11792
0.28950
0.35432
0.23826
0
0
0
6
0
0
0
0.26186
0.42074
0.26926
0.04814
0
0
0
7
0
0
0
0.45284
0.42760
0.11450
0.00506
0
0
0
8
0
0
0
0.66352
0.31262
0.02386
0
0
0
0
9
0
0
0
0.86710
0.13290
0
0
0
0
0
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
TABLE 14.3 The signatures of the 4-within-consecutive k-out-of-n: F systems under several designs
381
382
Safety and reliability modeling and its applications si(3,4,n)
0.7 0.6 0.5 0.4 n = 5 (thickest line) n = 6 (2nd thickest line) n = 7 (3rd thickest line) n = 8 (4 th thickest line) n = 9 (5 th thickest line) n = 10 (dashed line)
0.3 0.2 0.1
i:n 2 FIG. 14.2
4
6
8
10
The signatures of 3-within-consecutive 4-out-of-n: F models.
si(3,k,9)
0.8 0.6 k = 4 (thickest line) k = 5 (2nd thickest line) k = 6 (3rd thickest line) k = 7 (4th thickest line) k = 8 (dashed line)
0.4 0.2
2 FIG. 14.3
4
6
8
i:n
The signatures of 3-within-consecutive k-out-of-9: F models
35.942%, at the 5th ordered component lifetime with probability 45.18%, while its 6th and 7th failed component leads to the failure of the whole structure with probability 17.646% and 1.232% respectively. In addition, Table 14.3 can be proved useful for reaching some interesting conclusions about the influence of the design parameters k, n over the
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
383
si(4,6,n) 0.7 0.6 0.5 0.4
n = 7 (thickest line) n = 8 (2nd thickest line) n = 9 (3rd thickest line) n = 10 (4th thickest line)
0.3 0.2 0.1 2 FIG. 14.4
4
6
8
10
i:n
The signatures of 4-within-consecutive 6-out-of-n: F models
si(4,k,9) 0.8
0.6 k = 5 (thickest line) k = 6 (2nd thickest line) k = 7 (3rd thickest line) k = 8 (4th thickest line)
0.4
0.2
2 FIG. 14.5
4
6
8
i:n
The signatures of 4-within-consecutive k-out-of-9: F models
performance of the corresponding model. More precisely, Fig. 14.4 depicts the signatures of 4-within-consecutive 6-out-of-n: F systems for n = 7, 8, 9, 10. As it is easily observed, the larger the parameter n, the longer the operation expectancy of the respective system tends to be. On the other hand, Fig. 14.5 depicts the impact of parameter k over the signatures of the r-within-consecutive
384
Safety and reliability modeling and its applications
R(p) 1.0 0.8
0.6
k = 4 (thickest line) k = 5 (2nd thickest line) k = 6 (3rd thickest line) k = 7 (4th thickest line) k = 8 (5th thickest line) k = 9 (dashed line)
0.4 0.2
0.2 FIG. 14.6
0.4
0.6
0.8
1.0
p
The reliability polynomials of 3-within-consecutive k-out-of-10: F models
k-out-of-n: F structures. More specifically, we next consider all possible cases of 4-within-consecutive k-out-of-9: F models and the resulting graph is given below. Based on Fig. 14.6, we conclude that as parameter k increases, the failure of the resulting structure tends to take place sooner. In other words, it is preferable for the practitioner to design the suitable 4-within-consecutive k-out-of-n: F system by determining the parameter k as smaller as possible.
14.3 Further reliability characteristics of the r-within-consecutive-k-out-of-n: F structure The signature vector of a reliability structure is strongly connected to several well-known performance characteristics, a fact turning it to a crucial tool for investigating coherent structures. For instance, the reliability polynomial of a system can be readily expressed by the aid of its signature, while stochastic relationships between structures’ lifetimes can be also established by comparing the corresponding signatures (see, e.g., Samaniego, 1985, Koutras et al., 2016 or Kochar et al.,1999). Let us next consider the r-within-consecutive k-out-of-n: F structure consisting of i.i.d. components with common reliability p. Then, the reliability polynomial of the above-mentioned system can be expressed in terms of its signature as (see, e.g., Samaniego, 2007) ⎞ ⎛ n n n ⎝ p j (1 − p)n− j si (r, k, n)⎠ (14.2) R(p) = j j=1
i=n− j+1
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
385
Combining Eq. (14.2) and the algorithmic procedure presented previously, we may reach closed formulae for the reliability polynomial of any member of the family of r-within-consecutive k-out-of-n: F models. Table 14.4 displays the reliability polynomial of r-within-consecutive k-out-of-n: F structures of order n = 9 or 10 for different values of the remaining design parameters. To investigate the impact of the design parameter k over the performance of the resulting structure, we next depict the reliability polynomials of r-withinconsecutive k-out-of-n: F structures consisting of n = 10 i.i.d. components under the prespecified design parameter r = 3 versus the common reliability of its components p. Indeed, Fig. 14.6 shows the influence of the design parameter k on the reliability of the resulting system. Fig. 14.6 presents six different designs, namely the 3-within-consecutive kout-of-10: F structures under the choices k = 4 (thickest line), k = 5 (2nd thickest line), k = 6 (3rd thickest line), k = 7 (4th thickest line), k = 8 (dashed line) and k = 9 (brown line). It is evident that the reliability polynomial under the design parameter k = 4 (red color line) exceeds the remaining polynomials displayed in Fig. 14.6. In other words, based on the above figure, it is easily deduced that the reliability of the 3-within-consecutive k-out-of-10: F structure decreases as the design parameter k increases. Similar conclusions could be reached by looking at the reliability polynomials of the structures of order n = 9 which are displayed at Table 14.4. We next study the residual lifetime of the r-within-consecutive k-out-of-n: F structure consisting of n i.i.d. components. More precisely, signature-based formulae for the mean residual lifetime (MRL, hereafter) and the conditional mean residual lifetime (CMRL, hereafter) of the r-within-consecutive k-out-ofn: F model are discussed, while for illustration purposes an application is also presented. The MRL function of the r-within-consecutive k-out-of-n: F structure can be viewed as
mr,k,n (t ) = E(T − t|T > t ) =
1 P(T > t )
∞
P(T > x)dx
(14.3)
t
where T corresponds to the structure’s lifetime. In other words, the MRL function denotes actually the expected (additional) survival time of the underlying structure of age t. As the system’s reliability can be viewed as a mixture of the reliabilities of i-out-of-n: F structures (see Samaniego, 1985), we deduce that
P(T > t ) =
n i=1
si (r, k, n) P(Ti:n > t ).
(14.4)
TABLE 14.4 Reliability polynomial of r-within-consecutive k-out-of-n: F structures for different values of n, r, k k
Reliability polynomial
9
3
4
5.922p4 + 19.2024p5 − 70.859p6 + 74.1427p7 − 33.1279p8 + 5.71984p9
3
5
20.2784p5 − 30.9977p6 + 7.3224p7 + 6.23448p8 − 1.83764p9
3
6
6.13116p5 + 9.40968p6 − 29.016p7 + 14.2783p8 + 0.19864p9
3
7
0.97272p5 + 15.1435p6 − 15.2669p7 − 9.78768p8 + 9.93832p9
3
8
6.9216p6 + 15.2352p7 − 42.2352p8 + 21.0784p9
4
5
9.83304p3 + 0.72342p4 − 46.3957p5 + 65.6897p6 − 37.4227p7 + 9.74484p8 − 1.17236p9
4
6
1.03488p3 + 17.577p4 − 22.6951p5 − 21.6871p6 + 45.9389p7 − 21.1302p8 + 1.96168p9
4
7
4.9014p4 + 26.4096p5 − 70.6524p6 + 40.4856p7 + 9.8406p8 − 9.9848p9
4
8
21.1579p5 − 0.63168p6 − 89.0525p7 + 104.368p8 − 34.8421p9
3
4
10.794p4 − 23.3755p5 + 65.0622p6 − 138.554p7 + 144.217p8 − 70.4432p9 + 13.3463p10
3
5
50.5386p6 − 122.189p7 + 108.335p8 − 42.2576p9 + 6.573p10
3
6
20.034p6 − 20.3016p7 − 14.2992p8 + 19.3672p9 − 3.8004p10
3
7
5.9472p6 + 16.1064p7 − 39.0024p8 + 15.8968p9 + 2.052p10
3
8
0.882p6 + 18.2208p7 − 14.9544p8 − 18.2816p9 + 15.1332p10
3
9
8.0784p7 + 20.7468p8 − 55.7648p9 + 27.9216p10
4
5
50.0346p4 − 150.877p5 + 189.105p6 − 128.338p7 + 53.6382p8 − 14.504p9 + 1.94124p10
4
6
10.1094p4 + 19.3284p5 − 93.2736p6 + 97.6224p7 − 33.1506p8 − 0.77p9 + 1.134p10
4
7
1.0626p4 + 23.7535p5 − 19.803p6 − 59.5752p7 + 89.0694p8 − 35.3444p9 + 1.83708p10
4
8
6.01272p5 + 40.5972p6 − 102.516p7 + 48.8376p8 + 27.4204p9 − 19.3519p10
4
9
27.909p6 + 8.364p7 − 147.546p8 + 168.364p9 − 56.091p10
10
Safety and reliability modeling and its applications
r
386
N
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
387
Combining formulae (2) and (3), we may express the MRL function as n
mr,k,n (t ) =
i=1
si (r, k, n)P(Ti:n > t )mi:n (t ) n
(14.5) si (r, k, n) P(Ti:n > t )
i=1
where mi: n (t) corresponds to the MRL function of an i-out-of-n: F model consisting of components with i.i.d. lifetimes T1 ,T2 ,..., Ti • (1 ≤ i ≤ n) and can be written as ∞ 1 mi:n (t ) = P(Ti:n > x)dx (14.6) P(Ti:n > t ) t As the components are assumed to be i.i.d., it is known that the probability appeared in the above integral, can be determined as (see, e.g., David and Nagaraja (2003)) P(Ti:n > t ) = 1 −
n
(−1)
j−n+i−1
j=n−i+1
j−1 n−i
n P T1: j ≤ t j
where P(T1: j ≤ t ) = 1 − P(T1 > t, . . . , Tj > t ) = 1 − F¯j (t, ..., t ), while F¯j (t1 , ..., t j ) = P(T1 > t1 , ..., Tj > t j ) corresponds to the joint survival function of lifetimes T1 ,T2 ,..., Tj picked out from the i.i.d. random lifetimes T1 ,T2 ,..., Tn . Therefore, under the i.i.d. assumption, the following holds true P(Ti:n > t ) = 1−
n
(−1) j−n+i−1
j=n−i+1
j−1 n−i
n 1 − F¯j (t, . . . , t ) . (14.7) j
On the other hand, the so-called CMRL, namely the average value of lifetime T − t under the restriction that Tz: n > t corresponds to the expected remaining lifetime of a structure of age t given that, at least n − z + 1 components of the structure are still working at that time (1 ≤ z ≤ n). Consequently, the CMRL function of the r-within-consecutive k-out-of-n: F structure is given as ∞ P(T > t + x|Tz:n > t )dx (14.8) mr,k,n (t; z) = E(T − t| Tz:n > t ) = 0
As P(T > s |Tz:n > t ) =
n i=1
si (r, k, n)P(Ti:n > s |Tz:n > t )fort < s
388
Safety and reliability modeling and its applications
it is straightforward that the CMRL function can be computed as mr,k,n (t; z) =
n
si (r, k, n)
i=1
= P(Tz:n1 >t )
n
i=1
∞ 0
P(Ti:n > t + x|Tz:n > t )dx
si (r, k, n)
∞ 0
P(Ti:n > t + x, Tz:n > t )dx
(14.9)
A similar argumentation for the determination of MRL and CMRL functions of other reliability models has been followed by (Triantafyllou and Koutras, 2014, Eryilmaz et al., 2011 and Triantafyllou, 2020c. For illustration purposes, we next implement the above expressions under the Pareto model. More precisely, we assume that the random vector (T1 ,T2 ,..., Tn ) follows a multivariate Pareto distribution, namely n −a ¯ Fn (t1 , ..., tn ) = ti − n + 1 , ti > 1, f ori = 1, 2, ..., n i=1
where a is a positive parameter. Under the Pareto distribution, we have F¯j (t, t, ..., t ) = ( j(t − 1) + 1)−a ,
t>1
while the corresponding MRL function of an i-out-of-n: F model can now be expressed as n
∞
j−1 n j−n+i−1 −a (−1) 1−( j(x − 1)+1) dx 1− t n−i j j=n−i+1 . mi:n (t )= n
j−1 n j−n+i−1 −a (−1) 1− 1 − ( j(t − 1) + 1) n−i j j=n−i+1 Consequently, the MRL and the CMRL function of the r-within-consecutive k-out-of-n: F model can now be viewed as n j−1 1 − ( j(t − 1) + 1)−a mi:n (t ) j n−i i=1 j=n−i+1 mr,k,n (t )= n n
j−1 n j−n+i−1 −a (−1) 1 − ( j(t − 1) + 1) si (r, k, n) 1 − n−i j i=1 j=n−i+1 n
si (r, k, n) 1 −
n
(−1) j−n+i−1
and E Tr,k,n − t| T1:n > t ) j n i−1 1 1 + n(t − 1) j l n (−1) = ,a > 1 si (r, k, n) a−1 n− j+l j l i=1 j=0 l=0
respectively. Table 14.5 presents the MRL and CMRL values of the r-withinconsecutive k-out-of-n: F structure for different values of the design parameters r, n, k, a and t > 0. Based on the above numerical results (Table 14.5), we may reach some interesting conclusions. More precisely, the MRL function of the r-within-
a = 1.2
a = 1.8
a = 2.4
a=3
N
(r, k)
t
MRL
CMRL
MRL
CMRL
MRL
CMRL
MRL
CMRL
6
(3, 4)
2
7.9058
28.6071
1.91571
7.15178
1.07267
4.08673
0.740779
2.86071
3
12.8201
53.1275
3.14708
13.2819
1.77835
7.58965
1.2362
5.31275
4
17.7896
77.6479
4.39072
19.412
2.48984
11.0926
1.73481
7.76479
5
22.774
102.168
5.63752
25.5421
3.20275
14.5955
2.23416
10.2168
6
27.7645
126.689
6.8856
31.6722
3.91622
18.0984
2.73379
12.6689
7
32.7581
151.209
8.13432
37.8023
4.62997
21.6013
3.23354
15.1209
8
37.7535
175.73
9.38341
43.9324
5.34388
25.1042
3.73337
17.573
6
(3,5)
9
42.7501
200.25
10.6327
50.0625
6.05788
28.6071
4.23324
20.025
10
47.7474
224.77
11.8822
56.1926
6.77194
32.11
4.73315
22.477
2
7.37351
23.8952
1.78967
5.9738
1.00264
3.4136
2.07215
2.38952
3
12.307
44.3768
3.02466
11.0942
1.7095
6.33954
4.96993
4.43768
4
17.2834
64.8584
4.26946
16.2146
2.42126
9.26549
8.73908
6.48584
5
22.2713
85.34
5.51682
21.335
3.13427
12.1914
13.2579
8.534
6
27.2639
105.822
6.76522
26.4554
3.84779
15.1174
18.45
10.5822
7
32.259
126.303
8.01415
31.5758
4.56156
18.0433
24.2613
12.6303
8
37.2555
146.785
9.26338
36.6962
5.27548
20.9693
30.6503
14.6785
9
42.2528
167.266
10.5128
41.8166
5.98949
23.8952
37.5841
16.7266
10
47.2507
187.748
11.7624
46.937
6.70357
26.8211
45.0352
18.7748
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
TABLE 14.5 The MRL and CMRL of the r-within-consecutive k-out-of-n: F model under the multivariate Pareto model.
389
a = 1.2
a = 1.8
a = 2.4
390
TABLE 14.5 Continued. a=3
(r, k)
t
MRL
CMRL
MRL
CMRL
MRL
CMRL
MRL
CMRL
6
(4,5)
2
8.96488
40.243
2.13917
10.0607
1.18343
5.749
0.809355
4.0243
3
13.7817
74.737
3.34749
18.6843
1.87703
10.6767
1.29697
7.4737
4
18.715
109.231
4.58283
27.3078
2.58422
15.6044
1.79281
10.9231
5
23.6805
143.725
5.85236
35.9313
3.29493
20.5321
2.29073
14.3725
6
28.6594
178.219
7.07083
44.5548
4.00706
25.4599
2.78948
17.8219
7
33.6452
212.713
8.3178
53.1783
4.71991
30.3153
3.28865
21.2713
8
38.635
247.207
9.56562
61.8018
5.43316
35.3153
3.78806
24.7207
7
(3,4)
9
43.6273
281.701
10.014
70.4252
6.14667
40.243
4.28761
28.1701
10
48.6213
316.195
12.0627
79.0488
6.86036
45.1707
4.78727
31.6195
2
7.51198
28.9434
1.81981
7.23584
1.01796
4.13476
0.701869
2.89434
3
12.4364
54.2688
3.05258
13.5672
1.72361
7.75268
1.19663
5.42688
4
17.4095
79.5942
4.29658
19.8986
2.43494
11.3706
1.69485
7.95942
5
22.3957
104.92
5.54352
26.2299
3.14773
14.9885
2.19396
10.492
6
27.3872
130.245
6.79168
32.5613
3.86111
18.6064
2.69343
13.0245
7
32.3816
155.571
8.04044
38.8926
4.57479
22.2244
3.19307
15.5571
8
37.3775
180.896
9.28955
45.224
5.28865
25.8423
3.69281
18.0896
9
42.3745
206.221
10.5359
41.5553
6.00261
29.4602
4.19262
20.6221
10
47.3721
231.547
11.7884
57.8867
6.71664
33.0781
4.69247
23.1547
Safety and reliability modeling and its applications
N
a = 1.2
a = 1.8
a = 2.4
a=3
N
(r, k)
t
MRL
CMRL
MRL
CMRL
MRL
CMRL
MRL
CMRL
7
(3,5)
2
7.11198
24.4981
1.72957
6.12452
0.970326
3.49973
0.670584
2.44981
3
12.0572
45.9339
2.96726
11.4835
1.67856
6.56199
1.16701
4.59339
4
17.0378
67.3697
4.21301
16.8424
2.39079
9.62425
1.66582
6.73697
5
22.028
88.8056
5.46086
22.2014
3.10405
12.6865
2.16523
8.88056
6
27.022
110.241
6.70957
27.5603
3.81772
15.7488
2.66488
11.0241
7
32.018
131.677
7.9587
32.9193
4.53159
18.811
3.16464
13.1677
8
37.0151
153.113
9.20808
38.2783
5.24558
21.8733
3.66448
15.3113
7
(4,5)
9
42.0129
174.549
10.4576
43.6372
5.95965
24.9356
4.16435
17.4549
10
47.0112
195.985
11.7072
48.9962
6.67376
27.9978
4.66425
19.5985
2
8.56974
41.7275
2.05123
10.4319
1.13798
5.96107
0.780286
4.17275
3
13.4186
78.239
3.26763
19.5598
1.83609
11.177
1.27101
7.8239
4
18.3643
114.751
4.5061
28.6876
2.54496
16.3929
1.768
11.4751
5
23.3364
151.262
5.75012
37.8155
3.25654
21.6089
2.26653
15.1262
6
28.3194
187.774
6.99657
46.9434
3.96921
26.8248
2.76565
18.7774
7
33.3079
224.285
8.2442
56.0713
4.68242
32.0407
3.26507
22.4285
8
38.2997
260.797
9.4925
65.1992
5.39594
37.2567
3.76466
26.0797
9
43.2935
297.308
10.7412
74.3271
6.10965
42.4726
4.26435
29.7308
10
48.2887
333.82
11.9902
83.455
6.82349
47.6886
4.76411
33.382
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
TABLE 14.5 Continued.
391
392
Safety and reliability modeling and its applications
consecutive k-out-of-n: F structure, under the multivariate Pareto model with parameter a, seems to: • drop off in regard to n (for a fixed group of the remaining design parameters t, r, k, a) • increase in regard to r (for a fixed group of the remaining design parameters t, n, k, a) • increase in regard to t (for a fixed group of the remaining design parameters n, r, k, a) • drop off in regard to k (for a fixed group of the remaining design parameters t, r, n, a) • drop off in regard to a (for a fixed group of the remaining design parameters t, r, k, n). In addition, Table 14.5 reveals that, under multivariate Pareto model with parameter a, the CMRL function of the r-within-consecutive k-out-of-n: F system seems to: • increase in regard to n (for a fixed group of the remaining design parameters t, r, k, a) • increase in regard to r (for a fixed group of the remaining design parameters t, n, k, a) • increase in regard to t (for a fixed group of the remaining design parameters n, r, k, a) • drop off in regard to k (for a fixed group of the remaining design parameters t, r, n, a) • drop off in regard to a (for a fixed group of the remaining design parameters t, r, k, n).
14.4 Signature-based comparisons among consecutive-type systems In this section, we will illustrate how the signature vectors can be exploited for comparing the lifetimes of well-known reliability structures. We will focus on results pertaining to the usual stochastic order. More precisely, if T1 and T2 denote the lifetimes of two systems with cumulative distribution functions F1 and F2 , respectively, then T1 will be said to be stochastically smaller than T2 in the usual stochastic order (denoted by T1 ≤st T2 ) if the following inequality holds true P(T1 > t ) ≤ P(T2 > t ),
t ∈ (−∞, +∞)
Generally speaking, T1 ≤st T2 if and only if T1 is less likely than T2 to take on values beyond t. (Kochar et al.,1999) offered a sufficient condition for the signature-based stochastic ordering of structures’ lifetimes. More specifically, denoting by s1j (n),•s2j (n),•j = 1, 2, ..., n the signature coordinates of two
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
393
reliability systems, they proved that if n j=i
s1 j (n) ≤
n
s2 j (n)
(14.10)
j=i
for all i = 1, 2, ..., n, then T1 ≤st T2 . It is noteworthy that the aforementioned ordering attribute has been extended by (Navarro et al., 2005) to coherent structures with (possibly) dependent units. Taking into advantage the numerical experimentation, which has been carried out previously, we next compare the rwithin-consecutive k-out-of-n: F system versus several well-known consecutivetype structures. More precisely, we first consider six different reliability systems of order n = 8 and we then compare stochastically their lifetimes by the aid of (9). Table 14.6 presents the stochastic orderings among the following structures • • • • • •
the 3-within-consecutive 4-out-of-8: F system the consecutive 3-out-of-8: F system (see, e.g., Derman et al., 1982) the 2- consecutive 2-out-of-8: F system (see, e.g., Eryilmaz et al., 2011) the (8,3,2) system (see, e.g., Triantafyllou and Koutras, 2014) the system (see, e.g., Triantafyllou, 2020b) the 3-out-of-8: F system.
By recalling (9) we establish several stochastic relationships between the consecutive-type structures, which have been considered. Based on Table 14.6, it is straightforward that, among the underlying structures, the 2-consecutive 2out-of-8: F system and the consecutive 3-out-of-8: F system seem to perform better than the remaining competitors. On the other hand, the (8,3,2) system and the 3-out-of-8 structure are stochastically worse than the remaining competitive models.
14.5 Discussion In the present chapter, the r-within-consecutive k-out-of-n: F system with i.i.d. components ordered in a line has been studied. An algorithmic procedure for computing the signature vector of the r-within-consecutive k-out-of-n: F model has been presented in detail. An extensive numerical experimentation carried out, offers to the reader the signatures of several members of the aforementioned class under specified designs. In addition, a signature-based reliability analysis of the performance of the r-within-consecutive k-out-of-n: F structures is accomplished. More precisely, the reliability function, the mean residual lifetime, and the conditional mean residual lifetime of these consecutive-type systems are studied in some detail, while several numerical and graphical results reveal the impact of the design parameters over their performance. It is concluded that the r-within-consecutive k-out-of-n: F system exhibits better performance for larger values of the design parameter r and n, while it seems that its competency weakens as the parameter k increases. Furthermore, the r-within-consecutive
394
3-within-consecutive-4-out-of-8: F Consecutive-3-out-of-8: F 2-consecutive-2-out-of-8: F (8,3,2) system system 3-out-of-8: F
(8,3,2) system
system
3-out-of-8: F
≥st
≥st
≥st
–
≥st
≥st
≥st
=st
≥st
≥st
≥st
=st
≤st
≤st
3-within-consecutive-F
Consecutive-3-
2-consecutive-2-
4-out-of-8: F
out-of-8: F
out-of-8: F
=st
≤st
≤st
=st
=st
≥st =st
Safety and reliability modeling and its applications
TABLE 14.6 Stochastic orderings among consecutive-type structures of order n = 8.
Signature-based reliability study of r-within-consecutive-k-out-of-n Chapter | 14
395
k-out-of-n: F system is stochastically compared to several consecutive-type reliability models of the same order and has been proved to be quite competitive. Finally, the reliability study of structures with two common failure criteria which have not yet been fully covered, could be an interesting topic for future research.
References Chang, J.G., Cui, L., Hwang, F.K., 1999. Reliabilities for systems. Stat. Probab. Lett. 43 (3), 237– 242. Chang, J.G., Cui, L., Hwang, F.K., 2000. Reliabilities of Consecutive-k Systems. Kluwer Academic Publishers, The Netherlands. Chao, M.T., Fu, J.C., Koutras, M.V., 1995. Survey of reliability studies of consecutive-k-out-of-n: F & related systems. IEEE Trans. Reliab. 44 (1), 120–127. Cui, L., šuo, W., Li, J., Xie, M., 2006. On the dual reliability systems of (n,f,k) and s. Stat. Probab. Lett. 76 (11), 1081–1088. Cui, L., Xie, M., 2005. On a generalized k-out-of-n system and its reliability. Int. J. Syst. Sci. 36, 267–274. Dafnis, S.D., Makri, F.S., Philippou, A.N., 2019. The reliability of a generalized consecutive system. Appl. Math. Comput. 359, 186–193. David, H.A., Nagaraja, H.N., 2003. Order Statistics, 3rd Edition, John Wiley & Sons, Hoboken, New Jersey. Derman, C., Lieberman, G.J., Ross, S.M., 1982. On the consecutive-k-out-of-n: F system. IEEE Trans. Reliab. 31 (1), 57–63. Eryilmaz, S., Koutras, M.V., Triantafyllou, I.S., 2011. Signature based analysis of m-consecutive k-out-of-n: F systems with exchangeable components. Nav. Res. Logist. 58 (4), 344–354. Eryilmaz, S., Tuncel, A., 2015. Computing the signature of a generalized k-out-of-n system. IEEE Trans. Reliab. 64, 766–771. Eryilmaz, S., Zuo, M.J., 2010. Constrained (k,d)-out-of-n systems. Int. J. Syst. Sci. 41 (3), 679–685. Griffith, W.S., 1986. On consecutive-k-out-of-n: failure systems and their generalizations. In: Basu, A.P. (Ed.), Reliability and Quality Control. Elsevier, Amsterdam, pp. 157–165. Kochar, S., Mukerjee, H., Samaniego, F.J., 1999. The signature of a coherent system and its application to comparison among systems. Nav. Res. Logist. 46 (5), 507–523. Koutras, M.V., Triantafyllou, I.S., Eryilmaz, S., 2016. Stochastic comparisons between lifetimes of reliability systems with exchangeable components. Meth. Comput. Appl. Probab. 18, 1081–1095. Kumar, A., Ram, M., 2018. Signature reliability of k-out-of-n sliding window system. In: Ram, M. (Ed.), Modeling and Simulation based Analysis in Reliability Engineering. Taylor & Francis Group, CRC Press, pp. 233–247. Kumar, A., Ram, M., 2019. Signature of linear consecutive k-out-of-n systems. In: Ram, M., Dohi, T. (Eds.), Systems Engineering: Reliability Analysis using k-out-of-n structures. Taylor & Francis Group, CRC Press, pp. 207–216. Kumar, A., Ram, M., Singh, S.B., 2019. Signature reliability evaluations: an overview of different systems. In: Ram, M. (Ed.), Reliability Engineering: Methods and Applications. Taylor & Francis Group, CRC Press, pp. 421–438. Kuo, W., Zuo, M.J., 2003. Optimal Reliability Modeling: Principles and Applications. John Wiley & Sons, New Jersey. Makri, F.S., Philippou, A.N., 1996. Exact reliability formulas for linear and circular m-consecutivek-out-of-n:F systems. Microelectron. Reliab. 36, 657–660.
396
Safety and reliability modeling and its applications
Makri, F.S., Psillakis, Z.M., 1996a. Bounds for reliability of k-within two dimensional consecutiver-out-of-n: failure systems. Microelectron. Reliab. 36, 341–345. Makri, F.S., Psillakis, Z.M., 1996b. On consecutive k-out-of-r-from-n:F systems: a simulation approach. Int. J. Modelling Simul. 16, 15–20. Navarro, J., Ruiz, J.M., Sandoval, C.J., 2005. A note on comparisons among coherent systems with dependent components using signatures. Stat. Prob. Letts. 72, 179–185. Ram, M., 2013. On system reliability approaches: a brief survey. Int. J. Syst. Assur. Eng. Manage. 4 (2), 101–117. Samaniego, F.J., 1985. On closure of the IFR class under formation of coherent systems. IEEE Trans. Reliab. 34 (1), 69–72. Samaniego, F.J., 2007. System Signatures and Their applications in Engineering Reliability. Springer„ New York. Tong, Y.L., 1985. A rearrangement inequality for the longest run, with an application to network reliability. J. Appl. Probab. 22, 386–393. Triantafyllou, I.S., 2015. Consecutive-type reliability systems: an overview and some applications. J. Qual. Reliab. Eng 2015Article ID 21230320 pages. Triantafyllou, I.S., Koutras, M.V., 2014. Reliability properties of systems. IEEE Trans. Reliab. 63 (1), 357–366. Triantafyllou, I.S., Koutras, M.V., 2008a. On the signature of coherent systems and applications for consecutive-k-out-of-n: F systems. In: Bedford, T., Quigley, J., Walls, L., Alkali, B., Daneshkhah, A., Hardman, G. (Eds.), Advances in Mathematical Modeling for Reliability. IOS Press, Amsterdam, pp. 119–128. Triantafyllou, I.S., Koutras, M.V., 2008b. On the signature of coherent systems and applications. Probab. Eng. Inf. Sci. 22 (1), 19–35. Triantafyllou, I.S., Koutras, M.V., 2011. Signature and IFR preservation of 2-within-consecutive kout-of-n:F systems. IEEE Trans. Reliab. 60 (1), 315–322. Triantafyllou, I.S., 2020a. On consecutive k1 and k2 -out-of-n: F reliability systems. Mathematics 8, 630. Triantafyllou, I.S., 2020b. Reliability study of systems: a generating function approach. Int. J. Math. Eng. Manag. Sci accepted for publication. Triantafyllou, I.S., 2020c. On the lifetime and signature of the constrained (k,d) out-of-n: F reliability systems. Int. J. Math. Eng. Manag. Sci. accepted for publication. Zuo, M.J., Lin, D., Wu, Y, 2000. Reliability evaluation of combined k-out-of-n:F, consecutive-kout-of-n:F and linear connected-(r,s)-out-of-(m,n):F system structures. IEEE Trans. Reliab. 49, 99–104.
Non-Print Items Abstract In the present chapter, we carry out a reliability study of the r-within-consecutivek-out-of-n: F systems with independent and identically distributed components. A simulation algorithm is proposed for determining the coordinates of the signature vector of the r-within-consecutive-k-out-of-n: F structures, while several stochastic orderings among the lifetimes of several consecutive-type structures are also established. In addition, explicit signature-based expressions for the reliability function, mean residual lifetime, and conditional mean residual lifetime of the aforementioned models are provided, while several numerical results are also presented. For illustration purposes, well-known multivariate distributions for modeling the lifetimes of the components of the r-within-consecutive-k-out-of-n: F structure are considered in some details. Keywords conditional mean residual lifetime; mean residual lifetime; reliability function; r-within-consecutive-k-out-of-n: F systems; Samaniego’s signature, Monte-Carlo simulation; stochastic orderings
Chapter 15
Assessment of fuzzy reliability and signature of series–parallel multistate system Akshay Kumar a, Meenakshi Garia b, Mangey Ram c and S.C. Dimri c a Department
of Mathematics, Graphic Era Hill University, Uttarakhand, India. b Department of Mathematics, M.B.P.G. College, Haldwani , Nainital, Uttarakhand, India. c Department of Mathematics, Computer Sciences and Engineering, Graphic Era (Deemed to be University), Uttarakhand, India.
15.1 Introduction Availability is the very important tool of any engineering and social systems, there are many factor and techniques for analysis system availability and performance. In present era, researchers worked various kind of binary and multistate system and discussed reliability with different factors using universal generating function (UGF), supplementary technique, Markov chain methods, and so many others techniques. Levitin (2005) evaluated the availability and optimized the binary and multistate systems in real world problems. Author describes various examples based on binary and multistate systems using UGF technique. Levitin (2007) defined diverse systems and its solution using UGF technique and also discussed the application of considered systems in different engineering fields such as solar system, defense area and evaluated reliability with optimization of multistate systems. Lisnianski et al. (2010) discussed the reliability and optimization analysis of multistate system, which depends on state performance of its components and also explains the lifetime cost analysis and decisionmaking problems in daily life. Yingkui and Jing (2012) reviewed the reliability evaluation based on multistate system and states performance of the components. Authors discussed different methods and techniques on multistate in systematic ways. Liu et al. (2015) calculated the reliability of multistate systems from linear algebra representation and also used Monte Carlo simulation of multistate dynamic fault tree analysis, which expressed structure function of the multistate system. Sagayaraj et al. (2016) discussed the reliability of multistate system with systematic approaches of real life health status of population using Safety and Reliability Modeling and Its Applications. DOI: 10.1016/B978-0-12-823323-8.00006-4 Copyright © 2021 Elsevier Inc. All rights reserved.
397
398
Safety and reliability modeling and its applications
performance-free failure operation. Zaitseva and Levashenko (2017) given the new technique for evaluating multistate system reliability and structure function using multiple valued logic, which is modification of Boolean algebra. Qiu and Ming (2019) determined the reliability of multistate system with performance bases using belief UGF of uncertainty. Samaniego (2007) examined the signature and properties of systems applications based on reliability analysis, comparison, and practical approaches. Author also calculated the signature and reliability of communication network system, tested, and discussed its application. Samaniego et al. (2009) discussed the dynamic signature and use of new system on the basis of reliability analysis and computed the signature of system having independent identically distributed (IID) component. Authors also compared various new and working engineering systems. Navarro and Rychlik (2010) discussed the cost analysis of coherent system having identical and nonidentical distributed component and compared results of expectations lifetime component with others methods. Lisnianski and Frenkel (2012) discussed some major techniques and methods for signature, multistate system based on statistical order. They computed signature reliability, cost availability using Markov stochastic process, random process, etc. in application of power generation and production systems. Da Costa Bueno (2013) determined the system signature of multistate systems using Samaniego properties having exchangeable lifetime components. Marichal et al. (2017) evaluated the signature and joint structure of two or more components in case of multistate systems having continuous and IID lifetimes. Authors discussed signature-based decomposition considering multistate system into two state various systems. Kumar and Singh (2017, 2018, 2019) evaluated signature reliability, Barlow Proschan index, and expected cost of lifetime components of multistate and complex sliding window coherent systems with the help of some algorithms based on signature analysis and UGF techniques. In the context of membership and non-membership function, Zadeh (1965) first discussed the theory of fuzzy numbers and its applications in various fields. Author compare the set theory and fuzzy set theory within interval zero to one on the basis of membership and nonmembership function. Atanassov (1983, 1986) discussed and explained the some basic operations of intuitionistic fuzzy set theory such measures addition, subtraction, multiplication, and division. Authors also described the properties of geometrical interpretation, fuzzy graph, and ordinary graph of intuitionistic fuzzy set. Bustince and Burillo (1996) studied the structure of intuitionistic fuzzy set theory relations and discussed between existent relations and structure relations in intuitionistic fuzzy relations. Kumar and Yadav (2012) determined the fuzzy reliability of the considered system from various intuitionistic fuzzy number having failure rate of all the elements of the system and introduced a new technique for evaluating fuzzy reliability from various fuzzy number based on membership and nonmembership failure rate. Kumar et al. (2013) analyzed the fuzzy reliability of the proposed system series and parallel from using triangular intuitionistic
Assessment of fuzzy reliability and signature of series Chapter | 15
399
fuzzy number in the form of membership and nonmembership function having time-dependent fuzzy number. Ejegwa et al. (2014) discussed an overview about the intuitionistic fuzzy set theory based on some operations, model operations, algebra, and some basic definition with its applications in real world. Kumar and Ram (2018, 2019) determined the fuzzy reliability of series, parallel, series– parallel, and complex systems applied Weibull distribution, dual hesitant and hesitant fuzzy set theory of membership and nonmembership function. Song et al. (2018) calculated the sensor dynamic reliability analysis from various method of intuitionistic fuzzy set theory and a new evidence combination rule proposed. Kumar et al. (2019, 2020) calculated the fuzzy reliability of linear and circular consecutive failure system having nonidentical elements and complex systems introducing intuitionistic and hesitant fuzzy set theory with Weibull distribution. In the above discussion, this chapter computes the reliability function and various factors such as signature, tail signature, mean time to failure and cost also need to find fuzzy reliability from Weibull distribution and triangular fuzzy number in the form of lower and upper membership and nonmembership function with the help of system UGF of the consider series–parallel multistate system.
15.2 Fuzzy Weibull distribution In case of analysis of system reliability, Weibull distribution used as a technique to evaluate good result having structure function and failure rate of component is ⎧ β β−1 t ⎪ ⎨β t exp − θ x=β=θ >0 f (t ) = t θ ⎪ ⎩0, otherwise Having, θ is scale and β is shape parameter. Hence, the structure function of the system within Weibull distribution is
x−η β . = exp − θ Here, η = 0 is failure rate
15.2.1
x β . = exp − θ
Fuzzy set
In the context of fuzzy set theory, Zadeh (1965) first discussed the concept of fuzzy set and its importance in case of membership function of their α-cut. Considered a set Y which have nonempty set of the universe of discourse of
400
Safety and reliability modeling and its applications
Y = (y1 ,y2 ,…yn ). A fuzzy set Z is explained with membership function μZ : Y → [0, 1], where μZ (Y) is the degree of membership component of z in a fuzzy set Y for ally ∈ Y.
15.2.2
Intuitionistic fuzzy set
Atanassov (1983) first introduced the concept of intuitionistic fuzzy set theory, the set of each element having degree of membership and nonmembership function of their α-cut and β-cut set. Let us suppose A be a subset of b is defined as A = {< b, μA (b), υA (b) >: a < A} where μA : b → [0, 1] and υA : b → [0, 1] are with μA (b) + υA (b) ≤ 1, ∀b ∈ B.
15.2.3
Triangular fuzzy number
A triangular fuzzy number x˜ is defined a subset of fuzzy set in A which belongs membership and nonmembership function explained as ⎧ ς − x1 ⎪ ⎪ for x1 ≤ ς ≤ x2 ⎪ ⎪ ⎨ x2 − x1 μx˜ (ς ) = x3 − ς for x2 ≤ ς ≤ x3 and ⎪ ⎪ x − x ⎪ 3 2 ⎪ ⎩ 0 otherwise ⎧ x − ς 2 ⎪ ⎪ for x 1 ≤ ς ≤ x2 ⎪ ⎪ x2 − x 1 ⎨ νx˜ (ς ) = ς − x2 for x2 ≤ ς ≤ x 3 ⎪ ⎪ x 3 − x2 ⎪ ⎪ ⎩ 0 otherwise where x 1 ≤ x1 ≤ x2 ≤ x3 ≤ x 3 and TIFN are denoted by xTIFN = (x1 , x2 , x3 ; x1 , x2 , x3 ).
15.3 Evolution of signature, tail signature, minimal signature, and cost from structure function of the system To compute signature and its measures of the different coherent systems which are IID element such as order statistics and reliability function methods (Boland, 2001; Navarro et al., 2007a, 2007b) defined as Sl =
1 1 ϕ H¯ − ϕ H¯ n n ¯ ¯ H⊆[n] H⊆[n] n − l + 1 |H¯ |=n−l+1 n − l + 1 |H¯ |=n−1
(15.1)
Assessment of fuzzy reliability and signature of series Chapter | 15
401
with a polynomial form from structure function of the system with IID components m n m e n−e ¯ Pq Cj and Ce = si , e = 1, 2, .., n (15.2) H(P) = e e=1
i=n−e+1
n To find tail signature and signature of the system Sl = i=l+1 si = 1 ¯ |H¯ |=n−l φ H having n-tuples set function such as S = (S0 ,…, Sn ), change n )
(n−l
¯ 1 ) by Taylor expansion and signature polynomial function into P(X ) = X n H( X Sl =
(n − l) ! l D P(1), l = 0, ..., n n!
(15.3)
of proposed method (Marichal and Mathonet, 2013) is s = Sl−1 − Sl , l = 1, ..., n
(15.4)
and determining cost and expected lifetime of system from reliability function (Navarro and Rubio, 2009; Eryilmaz, 2012) defined E(T ) = μ ni=1 Cii and n E(X ) = i=1 i.si , i = 1, 2, . . . , n based on minimal signature and number of failed components of the system having mean value is one.
15.4 Algorithm for computing the system availability (see Levitin, 2005) as (i) Find the u-functions Ui (z) of entire series components. (ii) Compute availability of every component having rate performance i like term Ai (w) = ∂w (Ui (z))
(15.5)
(iii) Evaluate the system availability such as A=
n i=1
Ai (w) =
n
∂w (Ui (z))
(15.6)
i=1
15.5 Example Considered a series–parallel multistate system which is reducible in a binary system having three component in series manner. Components 1 and 2 consist of two elements in parallel and component 3 also consists three elements in parallel form, each states have two operations either working or failed. Here performance rate of every components is given as 2, 2, 1, 1, 1, 1 and 1.
402
Safety and reliability modeling and its applications
u-function of each elements of proposed system is u1 (z) = p1 z2 + q1 z0 u2 (z) = p2 z2 + q2 z0 u3 (z) = p3 z1 + q3 z0 u4 (z) = p4 z1 + q4 z0 u5 (z) = p5 z1 + q5 z0 u6 (z) = p6 z1 + q6 z0 u7 (z) = p7 z1 + q7 z0 Now using u-function of the considered system is U1 (z) = u1 (z) u2 (z) = p1 p2 z2 + p1 q2 z2 + p1 q1 z2 + q1 q2 z0 Max
U2 (z) = u3 (z) u4 (z) = p3 p4 z1 + p3 q4 z1 + p4 q4 z1 + q3 q4 z0 Max
U3 (z) = u5 (z) u6 (z) u7 (z) = p5 p6 p7 z1 + p5 p6 q7 z1 + p5 p7 q6 z1 Max
Max
+ p5 p6 p7 z + q5 p6 p7 z1 + q5 p6 q7 z1 + q5 p7 q6 z1 + q5 q6 p7 z1 1
For given demand w = 2, using Eq. (15.5), we have U1 (z) = 2p − p2 U2 (z) = 0 U3 (z) = 0
(15.7)
Again for demand w=1, from Eq. (15.5), we get U1 (z) = 2p − p2
(15.8)
U2 (z) = 2p − p
(15.9)
U3 (z) = 3p − 3p2 + p3
(15.10)
2
Now calculating availability of the system when demand is w=1 using Eq. (15.6) as A(p) = U1 (z) U2 (z) U3 (z) = 12p3 − 24p4 + 19p5 − 7p6 + p7
(15.11)
Evaluate minimal signature of the system using Eq. (15.11) as = (0,0,0,12,−24,19, −7,1). Now calculate the tail signature and signature of the U1 (z) = 2p − p2 from Eq. (15.3) and Eq. (15.4) as s=(1,1,0) ¯ and s=(0,1). Similarly, find the tail signature and signature of the UGF U2 (z) = 2p − p2 using Eqs. (15.3)–(15.4) is s=(1,1,0) ¯ and s=(0,1).
Assessment of fuzzy reliability and signature of series Chapter | 15
403
Hence, tail signature and signature from Eqs. (15.3) to (15.4) and UGF U3 (z) = 3p − 3p2 + p3 is s=(1,1,1,0) ¯ and s=(0,0,1). Using Eq. (15.11), again calculate tail signature and signature is using Eqs. (15.3)–(15.4) such as s=(1,1,19/21,24/35,12/35,0,0,0) ¯ and s=(0,2/21,161/735,12/32,12/35,0,0) Now, find the minimum signature, expected X and expected T from structure function of Eq. (15.11) is t F¯ (t ) is Minimum signature (0, 0, 12, −24, 19, −1) and E(T ) have =
o 1
12p3 − 24p4 + 19p5 − 7p6 + p7 d p can be change into
0
= E(X ) =
1
0 7
12e−3t −24e−4t +19e−5t −7e−6t +e−7t dt = 0.776 and also find the i.si from signature analysis as 3.932.
i=1
Expected (t) and expected (X) given the cost of the system is 5.067.
15.5.1
Example
Considered the lifetime elements of the proposed system having Weibull distribution to the triangular membership function with intuitionistic fuzzy parameter ψ are defined as ψ = ψ 1 ; (1, 1.5, 2; 0.5, 1.5, 2.5), then the expression for αand βcut in terms of the membership and nonmembership function is ψ = ψ 1 ; (1 + 0.5α, 2 − 0.5α; 1.5 + β, 1.5 − β). For all α, taking β = 0.5 and t = 5 from using Eqs. (15.5)–(15.8) having weight w = 1 and w = 2 such as β β t t − exp −2 R(t )(α) = 2 exp − 1 + 0.5α 1 + 0.5β β β t t 2 exp − − exp −2 ; 2 − 0.5α 2 − 0.5β β β t t R(t )(β ) = 2 exp − − exp −2 ; 1.5 − β 1.5 − β β β t t 2 exp − − exp −2 . 1.5 + β 1.5 + β β α t t R(t )(α) = 3 exp − − 3 exp −2 1 + 0.5α 1 + 0.5α
404
Safety and reliability modeling and its applications
TABLE 15.1 α and βcut of the intuitionistic fuzzy membership function. (α, β) cut
R[α]
R[β]
0.0
(0.2023, 0.3692)
(0.0847, 0.4844)
0.1
(0.2129, 0.3626)
(0.1115, 0.4707)
0.2
(0.2231, 0.3559)
(0.1381, 0.4565)
0.3
(0.2331, 0.3491)
(0.1642, 0.4417)
0.4
(0.2429, 0.3421)
(0.1894, 0.4265)
0.5
(0.2524, 0.3349)
(0.2137, 0.4106)
0.6
(0.2616, 0.3275)
(0.2371, 0.3942)
0.7
(0.2706, 0.3200)
(0.2596, 0.3772)
0.8
(0.2794, 0.3123)
(0.2812, 0.3595)
0.9
(0.2879, 0.3044)
(0.3020, 0.3411)
1.0
(0.2962, 0.2962)
(0.3219, 0.3219)
+ exp −3 3 exp −
t 1 + 0.5α β
β
α t t − 3 exp −2 2 − 0.5α 2 − 0.5α β t + exp −3 ; 2 − 0.5α β β t t R(t )(α) = 3 exp − − 3 exp −2 1.5 − β 1.5 − β β t + exp −3 ; 1.5 − β β β t t 3 exp − − 3 exp −2 1.5 + β 1.5 + β β t + exp −3 . 1.5 + β Therefore, Tables 15.1, 15.2 and Figs. 15.1, 15.2 explain the fuzzy reliability of considered system from u-function of the components and also compute intuitionistic fuzzy membership and nonmembership grade from using above equations are
Assessment of fuzzy reliability and signature of series Chapter | 15
405
TABLE 15.2 Fuzzy reliability of system in upper and lower (α, β) cut form. (α, β) cut
R[α]
R[β]
0.0
(0.2876, 0.4989)
(0.1217, 0.5664)
0.1
(0.3017, 0.4911)
(0.1581, 0.5543)
0.2
(0.3153, 0.4831)
(0.1932, 0.5415)
0.3
(0.3284, 0.4748)
(0.2266, 0.5281)
0.4
(0.3412, 0.4663)
(0.2580, 0.5139)
0.5
(0.3535, 0.4576)
(0.2876, 0.4989)
0.6
(0.3655, 0.4486)
(0.3153, 0.4831)
0.7
(0.3771, 0.4393)
(0.3412, 0.4663)
0.8
(0.3883, 0.4297)
(0.3655, 0.4486)
0.9
(0.3991, 0.4198)
(0.3883, 0.4297)
1.0
(0.4096, 0.4096)
(0.4096, 0.4096)
FIGURE 15.1
Fuzzy reliability in (α, β) cut.
406
Safety and reliability modeling and its applications
FIGURE 15.2
Fuzzy reliability membership functions in upper and lower cut.
15.6 Conclusion In this chapter, evaluated various measures of multistate series–parallel system using algorithms and fundamental results based on signature reliability of coherent systems with IID components. All possible states of coherent system analysis via signature and minimal signature containing path sets. This chapter contains signature s = (0,2/21,161/735,12/35,12/35,0,0) and tail signature (1,1,19/21,24/35,12/35,0,0,0) having weight is one and two also determine minimal signature (0,0,12,−24,19,−1), lifetime cost analysis 5.06 from expected X and expected lifetime components of possible working states of multistate system. In second example compute the results of performance states in the form of interval using intuitionistic fuzzy number in Tables 15.1 and 15.2 shown the membership and nonmembership function of α and β cut set having Weibull distribution and triangular fuzzy number. Figs. 15.1 and 15.2 show increasing and decreasing order of fuzzy reliability in the form of upper and lower cut.
References Atanassov, K.T., 1983. Intuitionistic fuzzy sets, VI ITKR’s session, Sofia deposed in Central SciTechnical Library of Bulgaria. Acad. Sci. Bulg. 1697, 84. Atanassov, K.T., 1986. Intuitionistic fuzzy sets, Fuzzy Sets and Systems, Vol. 20, No. 1, pp. 87–96.
Assessment of fuzzy reliability and signature of series Chapter | 15
407
Boland, P.J., 2001. Signatures of indirect majority systems. J. Appl. Probab. 38 (2), 597–603. Bustince, H., Burillo, P., 1996. Structures on intuitionistic fuzzy relations. Fuzzy Sets Syst. 78 (3), 293–303. da Costa Bueno, V., 2013. A multistate monotone system signature. Stat. Probab. Lett. 83 (11), 2583– 2591. Ejegwa, P.A., Akowe, S.O., Otene, P.M., Ikyule, J.M., 2014. An overview on intuitionistic fuzzy sets. Int. J. Sci. Technol. Res. 3 (3), 142–145. Eryilmaz, S., 2012. The number of failed elements in a coherent system with exchangeable elements. IEEE Trans. Reliab. 61 (1), 203–207. Kumar, A., Ram, M., 2018. System reliability analysis based on weibull distribution and hesitant fuzzy set. Int. J. Math. Eng. Manage. Sci. 3 (4), 513–521. Kumar, A., Ram, M., 2019. Reliability analysis for environment systems using dual hesitant fuzzy set. Advanced Fuzzy Logic Approaches in Engineering Science. IGI Global, Headquartered in Hershey, Pennsylvania, USA, pp. 162–173. Kumar, A., Singh, S.B., 2017. Computations of the signature reliability of the coherent system. Int. J. Qual. Reliab. Manage. 34 (6), 785–797. Kumar, A., Singh, S.B., 2018. Signature reliability of linear multi-state sliding window system. Int. J. Qual. Reliab. Manage. 35 (10), 2403–2413. Kumar, A., Singh, S.B., 2019. Signature A-within-B-from-D/G sliding window system. Int. J. Math. Eng. Manage. Sci. 4 (1), 95–107. Kumar, A., Singh, S.B., Ram, M., 2019. Reliability appraisal for consecutive-k-out-of-n: F system of non-identical components with intuitionistic fuzzy set. Int. J. Oper. Res. 36 (3), 362–374. Kumar, A., Singh, S.B., Ram, M., 2020. Systems reliability assessment using hesitant fuzzy set. Int. J. Oper. Res. 38 (1), 1–18. Kumar, M., Yadav, S.P., 2012. A novel approach for analyzing fuzzy system reliability using different types of intuitionistic fuzzy failure rates of components. ISA Trans. 51 (2), 288–297. Kumar, M., Yadav, S.P., Kumar, S., 2013. Fuzzy system reliability evaluation using time-dependent intuitionistic fuzzy set. Int. J. Syst. Sci. 44 (1), 50–66. Levitin, G., 2005. The Universal Generating Function in Reliability Analysis and Optimization. Springer, London, p. 442. doi:10.1007/1-84628-245-4. Levitin, G., 2007. Block diagram method for analyzing multi-state systems with uncovered failures. Reliab. Eng. Syst. Safety, 92 (6), 727–734. Lisnianski, A., Frenkel, I., 2012. Recent Advances in System Reliability. Springer, Berlin ISBN 978-1-4471-2207-4. doi:10.1007/978-1-4471-2207-4. Lisnianski, A., Frenkel, I., Ding, Y., 2010. Multi-state System Reliability Analysis and Optimization for Engineers and Industrial Managers. Springer Science & Business Media, Springer-Verlag London, ISBN 978-1-84996-319-0. Liu, C., Chen, N., Yang, J., 2015. New method for multi-state system reliability analysis based on linear algebraic representation. Proc. Inst. Mech. Eng. Part O: J. Risk Reliab. 229 (5), 469–482. Marichal, J.L., Mathonet, P., 2013. Computing system signatures through reliability functions. Stat. Probab. Lett. 83 (3), 710–717. Marichal, J.L., Mathonet, P., Navarro, J., Paroissin, C., 2017. Joint signature of two or more systems with applications to multistate systems made up of two-state components. Eur. J. Oper. Res. 263 (2), 559–570. Navarro, J., Rubio, R., 2009. Computations of signatures of coherent systems with five components. Commun. Stat. Simul. Comput. 39 (1), 68–84. Navarro, J., Ruiz, J.M., Sandoval, C.J., 2007a. Properties of coherent systems with dependent components. Commun. Stat.—Theory Meth. 36 (1), 175–191.
408
Safety and reliability modeling and its applications
Navarro, J., Rychlik, T., 2010. Comparisons and bounds for expected lifetimes of reliability systems. Eur. J. Oper. Res. 207 (1), 309–317. Navarro, J., Rychlik, T., Shaked, M., 2007b. Are the order statistics ordered? A survey of recent results. Commun. Stat.—Theory Meth. 36 (7), 1273–1290. Qiu, S., Ming, H.X., 2019. Reliability analysis of multi-state series systems with performance sharing mechanism under epistemic uncertainty. Qual. Reliab. Eng. Int. 35 (6), 1998–2015. Sagayaraj, M.R., Anita, A.M., Babu, A.C., Thirumoorthi, N., 2016. Reliability analysis of multi-state series system. Int. J. Math. Appl. 4 (2), 37–43. Samaniego, F.J., 2007. System Signatures and Their Applications in Engineering Reliability, 110. Springer Science & Business Media, Springer-Verlag USA, ISBN 978-0-387-71796-8. Samaniego, F.J., Balakrishnan, N., Navarro, J., 2009. Dynamic signatures and their use in comparing the reliability of new and used systems. Naval Res. Logistics (NRL) 56 (6), 577–591. Song, Y., Wang, X., Zhu, J., Lei, L., 2018. Sensor dynamic reliability evaluation based on evidence theory and intuitionistic fuzzy sets. Appl. Intell. 48 (11), 3950–3962. Yingkui, G., Jing, L., 2012. Multi-state system reliability: a new and systematic review. Proc. Eng. 29, 531–536. Zadeh, L.A., 1965. Fuzzy sets. Inform. Control 8 (3), 338–353. Zaitseva, E., Levashenko, V., 2017. Reliability analysis of multi-state system with application of multiple-valued logic. Int. J. Qual. Reliab. Manage 34 (6), 862–878.
Non-Print Items Abstract A series–parallel multistate system is considered, which is reducible in binary system and connected in three series components having independent identically distributed elements. This chapter evaluated the reliability function of individual component with the help of universal generating function given some performance rate and analyzes various measures such as signature, tail signature, and cost. In case of fuzzy assessment, intuitionistic fuzzy number of membership and non-membership form with Weibull distribution and triangular fuzzy number of multistate components are introduced. Numerical problems are also discussed at the end of this chapter. Keywords Intuitionistic fuzzy number; Multistate system; Series–parallel system; Tail signature; Universal generating function; Weibull distribution
Index
Page numbers followed by “f” and “t” indicate, figures and tables respectively.
A Aeroflot Superjet 100 crashed, 204 Airbus A380 wing-spar inspections, 201 Aircraft Accident Investigation Board (AAIB), 199, 213 Algorithmic process, 375 All Nippon Airways (ANA), 201 ANA grounded Boeing 787, 201 Analysis of variance (ANOVA), 359 Analytical hierarchical process (AHP), 266 Annulus access line (AAL), 125 Annulus leakage (AL), 127 ANOVA technique, 357 Asphalt pavements, 1 reliability concepts for, 4 Asymptotic Theory, 328 Average annual pavement temperature (AAPT), 11 Average probability of failure on demand (PFDavg), 47
B Bayesian networks, 354 BDD, analysis of the static module, 37 Birnbaum importance (BI), 130, 131 Bituminous concrete (BC), 11 Bituminous macadam (BM), 11 Bituminous materials, 13
C Civil Aviation Administration of China (CAAC), 205 Code of event, 34 Coefficient of variation (COV), 9 Common cause failure (CCF), 52 Continuous-time Markov chains (CTMC), 101, 96 Criticality importance measure (CI), 130
D Deep learning, 162
vulnerability based on, 163 Defense Contract Management Agency (DCMA), 194 Desirability function, 357 Differential importance measure (DIM), 131 Discrete-time Markov chains, 97, 96 Downhole safety valve (DHSV), 125 Dynamic fault tree (DFT), 31, 82 method, 32 model, 33
E Electrical appliances, 295 Electrical equipment, 295, 299 Electronic centralized aircraft monitor (ECAM), 205 Embraer business jet, 208 Emergency Location Transmitter (ELT), 199 Emirates b777, gear retracted landing of, 205 Environmental control and life support system (ECLSS), 254, 258 ANN model for, 273 Equipment under control (EUC), 43 Ethiopian B787 fire, 199 Event tree for poisoning, 286f Extra vehicular activities (EVA), 255
F Failure modes and effects analysis (FMEA), 245 Failure to close (FTC), 127 Fatigue model, 11 Fault analysis (FTA), 79 Fault trees (FT), 93 Fire safety, 296 First order reliability method (FORM), 7 Flexible pavements, 1 Flight crew operations manual (FCOM), 200 Flying Scholarship for a Disabled (FSD), 185 Functional dependency (FDEP), 82 gate, 86
410
Index
Functional linear-measurement error (FLME), 12 Fussell–Vesely (FV), 131, 133 Fuzzy analytic hierarchy process, 266 Fuzzy extent analysis method, 249 Fuzzy set theory, 249
G Gas-lift valve (GLV), 125 Generalized renewal processes, 306 General Motors, 195, 195, 196 GRP equation, 308 GRP model, 310 GRP models, 307 GRP parameters, 309 Gumbel distribution, 337 Gumbel GRP (GuGRP) modeling, 337
H High-integrity temperature-protection system (HITPS), 60 HRA techniques, 261, 266 Human reliability analysis, 261
I Importance measures (IM), 92 Industrial Engineering tools and techniques, 355 In-flight service manager (ISM), 199 International Electrotechnical Commission (IEC), 227 Intuitionistic fuzzy set, 400
K KI model, 253 Kolmogorov equations, 16 KooN architecture, 48 configurations, 54
L Leakage in closed position, 62, 127 Leak-in-close position, 70 Load-sharing systems, 114 Log-likelihood function, 248
M Main fan installation (MFI), 283 Maintainability, 68 design features, 69f Markov analysis (MA), 139
Markov chain methods, 397 Markov chains, application of, 107 load-sharing systems, 114 reliability engineering, 107 repairable systems, 117 series and parallel configurations, 109 standby systems, 112 state-space reduction, 124 state-space reduction for reliability analysis, 120 Markov chains, importance measures, 133 traditional importance measures, 130 Birnbaum importance, 131 criticality importance measure, 132 differential importance measure, 133 Fussell–Vesely, 133 risk-achievement worth, 132 risk reduction worth, 132 Markov chains theoretical foundation, 96 continuous-time Markov chains, 101 discrete-time Markov chains, 97 Markov chains, uncertainty propagation in, 137 failure rates and their uncertainties, 139 data collection and parameters estimation, 140 parameter uncertainty, 142 procedure, 144 Markov model, 16 assumptions, notation and definitions, 17 background, 20 dynamic logic gates to, 33 intensities of transitions, 26 of onboard computer, 35 reliability parameters, 27 state transition rate matrix of, 36 Mathematical reality of reliability, 181, 189 continuous operation, 190 errors during system transportation, 190 failure function, 181 fixed operational scenario, 191 quality of components production, 189 reliability is independent of calendar time, 192 reliability is independent of maintenance actions, 191 reliability is independent of the location in space, 191 reliability is independent of the natural environment, 192 reliability model, 182, 183 time counts from the, 190 Maximum likelihood estimation, 309
Index Maximum likelihood estimators, 247 Mean time between failures (MTBF), 65, 180 Mean time to failure (MTTF), 108 Mean time to restore (MTTR), 65 Modern industrial system, 261 Multioperator message (MOM), 200 Multiphase Markov chains (MPMC), 148 concepts of, 149 safety integrity level, 152 SIL assessment, 153
N Neural network, 269 Nonhomogeneous Poisson process (NHPP), 247
O Orbital space stations (OSS), 271 Ordinary least-square estimation (OLSE) technique, 12 Orthogonal Arrays tests, 356
P Partial stroke test, 70 Partial stroke testing (PST), 57, 49 Pavement material, 10 10t Perfect renewal process (PRP), 247 Performance models, 10 Physical reality of reliability, 193 Aeroflot Superjet 100, 204 after in-flight diversion Boeing 777 production, 194 Air Asia diversion, 209 AirAsia flight QZ850, 198 Airbus A320, 202 Airbus A319 safely landed, 205 Airbus A380 wing-spar inspections, 201 Air India flight to return to Mumbai, 217 Air India’s aircraft engine, 210 A400M crashed, 193 asteroid activities in Europe, 219 B737 MAX’s automatic stall prevention system, 200 Boeing 787 returns back to China, 216 cold weather operations, 206 Delta flights grounded worldwide, 207 design errors, 195 Embraer business jet, 208 Emirates crash, 212 Ethiopian B787 fire, 199 Fatal WC130h crash, 213 GPS sensors data, 206
411
hard landing of Wings Air ATR 72-600, 204 impact of bird strikes, 220 International space station, 203 japanese rocket start-up blow, 194 launch pad, SpaceX explosion at, 196 Leonardo calls, 196 maintenance induced failures, 214 MD-83’s rejected takeoff, 217 NASA’s low-density supersonic demonstrator, 216 national grid in UK, 218 near loss of A330, 211 Northeast Airlines cancelled 1,900 U.S. flights, 219 nose gear snap on B737-300, 211 oil system flaw, 197 plant’s inlet cowl, Power, 197 Qantas A380 turn back, 208 quality control issue halted f-35 deliveries, 194 Ryanair’s Boeing 737, 209 smoke and fumes, 199 smoke event, 211 SpaceX delays launch, 207 weather scrubs spaceship, 205 Williams F1 car, 218 Poisson process, 105 Primary failure mode of failures, 229 Probabilistic safety assessment (PSA), 92 Probability of dangerous failure per hour (PFH), 47 Probability of failure, 3 Probability of failure on demand (PFD), 67 calculation, 75f Production adapter base (PAB), 125 Production casing (PC), 125 Production lines (PL), 124
R Regression analysis, 358 Reliability, 65 defined, 2 estimation of pavement, 9 input parameters variability, 9 interaction between the failure modes, 12 material strength degradation, 13 performance models, 10 mathematical notations, 2 methods, levels of, 4 level III methods, 4
412
Index
level II methods, 4 level I methods, 4 level IV methods, 4 Reliability analysis, state-space reduction for, 120 Reliability, availability, and maintainability (RAM), 92 Reliability block diagrams (RBD), 93 Reliability design factor, 6 Reliability factor (RF), 8 Reliability index, 7 Reliability models, 388 Reliability of design, 5 Repairable systems and imperfect repair, 247 Risk achievement worth (RAW), 130 Risk reduction worth (RRW), 131 ROADENT, 7 Robust portfolio modeling, 354 Rutting Model, 11 Ryanair’s Boeing 737, 209
S Safety instrumented function (SIF), 43 frequency of dangerous failures of, 48 48t loop, 62f Safety instrumented system (SIS), 72, 43 aging and external demands, 60 case study on reliability and maintainability of, 70 assumptions, 73 failure scenarios, 72 hazard, 70 hazard analysis, 73 hazard consequences and targets, 71 installation and commissioning, 74 likelihood of fatality, 74 PFD calculation, 74 protective arrangements, 72 short manufacturing procedure, 70 causes of spurious shutdowns, 53 design optimization, 48 dynamic fault tree of, 88 failure modes and failure rates, 44 KooN configurations, 54 maintainability, 68 partial stroke testing, 57 probabilistic evaluation of, 47 probability of failure on demand, 49 problem formulation, 63 reliability, 65
solution methodology, 64 spurious activation, 50 subtypes of, 43 Safety integrity level (SIL), 67, 152, 152 Safety margin (SM), 7, 7 Sate-space reduction, application of, 124 Sequence enforcing (SEQ), 83, 82 Simulation-based probability values, 377 Solar array drive assembly (SADA), 31 DFT modeling for, 32 failure analysis of, 33 fault tree of, 35 parts of, 33 reliability analysis of, 34 structure and principle of, 32 system reliability calculation, 37 SpaceShipTwo glide flight test, 205 SpaceX delays launch, 207 SPARE gate, 84 Spurious activation, 46, 50 types of, 50 Spurious operation (SO), 43, 46, 50 causes of, 51 Spurious shutdown, 47, 51 causes of, 53 Spurious trip, 46, 44, 47, 51 causes of, 52 Spurious trip rate (STR), 49 Staggered testing, 235 Standard deviation (SD), 7 Standby or spare (SPARE), 82 Standby systems, 112 Static electricity, 296 Subsea wellhead system (SWHS), 125 Systematic errors, 7
T Taguchi method, 356, 359 Technical operations maintenance controller (TOMC), 199 Testing time points, 232 Time to failure (TTF), 181 Toyota, (2012), 196, 196 Triangular fuzzy numbers, 250, 400
U Uncertainties of design parameters, 7
V Variance, 6 Verification, 52
Index Volkswagen (2016), 196 Voyage to the ice (VTTI), 185, 185 impact of, 188 Vulnerability, estimation of, 164
W Weibull distribution, 399
413
Weibull model, 311 WGRP model, 330 confidence intervals, 330 WGRP parameters, 328, 305 Wings Air ATR 72-600, hard landing of, 204