183 104 10MB
English Pages 425 [420] Year 2023
Youchao Sun · Longbiao Li · Dmytro Tiniakov
Reliability Engineering
Reliability Engineering
Youchao Sun · Longbiao Li · Dmytro Tiniakov
Reliability Engineering
Youchao Sun College of Civil Aviation Nanjing University of Aeronautics and Astronautics Nanjing, Jiangsu, China
Longbiao Li College of Civil Aviation Nanjing University of Aeronautics and Astronautics Nanjing, Jiangsu, China
Dmytro Tiniakov College of Civil Aviation Nanjing University of Aeronautics and Astronautics Nanjing, Jiangsu, China
ISBN 978-981-99-5977-8 ISBN 978-981-99-5978-5 (eBook) https://doi.org/10.1007/978-981-99-5978-5 This is a textbook for the Fourteenth Five-Year Plan of Ministry of Industry and Information Technology, China. Jointly published with Science Press The print edition is not for sale in China mainland. Customers from China mainland please order the print book from: Science Press. © Science Press 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
Preface
This book on reliability deals with fundamental issues of modern reliability theory and practice. The continuous increase in speed, load capacity, productivity, accuracy, and energy intensity of machinery in the twentieth century, the development of air, space, ocean floor, and coastal shelf, and the concentration of numerous manufacturing equipment and vehicles on small territories have complicated the operation of machinery, increased the danger of man-made environmental disasters and loss of life. The volume of losses due to man-made disasters has become equal to the gross domestic product of some countries. This has demanded more and more attention to improve the reliability of machinery and contributed to the rapid development of reliability theory in recent decades. In the future, machinery reliability will become even more important, both because of the need for high productivity and technogenic safety (increased reliability, availability, and survivability), and because of the need to conserve natural resources and preserve the environment on the earth, in the air and space, and in the oceans and seas. The main directions of development of the theory and practice of reliability assurance are creation of mathematical and physical models of reliability and approaches to their use in design, manufacture, operation, and storage of a database of emergency situations; the occurrence of defects, failures, malfunctions and emergencies, their diagnostics at all stages of the life of machinery; rationing of reliability indicators; forecasting the reliability and life span of machinery; development of methods and means of technical diagnostics; optimization and implementation of certification of the main components of the machinery. These directions of reliability assurance are reflected in the following chapters of this book. Chapter 1 describes the basic concepts of reliability theory and reliability indices. The peculiarities of reliability determination for different operating conditions and the main mathematical dependencies are also described in this chapter. Chapter 2 describes the mathematical foundations of reliability theory, such as random events and their properties, characteristics of random events, and different types of distributions for random events.
v
vi
Preface
Chapter 3 discusses the peculiarities of reliability analysis for non-repairable and repairable systems, the peculiarities of reliability function generation based on the structural method, and also the peculiarities of redundancy and reliability estimation for redundant systems. Parametric and nonparametric methods of analysis of such systems are described. Chapter 4 is devoted to the methods of reliability analysis of complex systems. A brief description of the Failure Mode, Effect, and Criticality Analysis method is provided. A summary of FMECA and its applications is described. Information on fault tree analysis is provided. Basic concepts and qualitative and quantitative analysis approaches based on FTA are given. Chapter 5 deals with model-based reliability assessment. An aircraft system was used as an example to illustrate approaches to system reliability assessment. Multiphase mission reliability analysis is also explained. In Chap. 6, some aspects of reliability assignment and prediction were considered. Using an aircraft structure, system configurations, and redundancy features were considered. Reliability assignment was briefly explained with some real examples. Chapter 7 discusses reliability analysis for mechanical systems. The main causes that affect the reliability of mechanical systems are described and briefly analyzed. A special feature of this chapter is an example of how to compute the reliability of the aircraft fuel system based on the theory described above. Chapter 8 provides information on performing reliability testing. It describes procedures for performing such tests, some types of tests, and procedures for verifying the reliability test results. Chapter 9 discusses probabilistic risk analysis of aircraft systems. It explains the background and examples of research on probabilistic risk assessment of aircraft engine life-limited parts. It also explains the theory and examples of risk warning technology to improve the reliability of commercial aircraft bleed air systems. This book is the result of the author’s years of teaching and scientific research in the field of reliability theory, methods, engineering, and technological applications. We thank the National Natural Science Foundation of China and the Airworthiness Engineering and Technology Research Center for Civil Aircraft Airborne Systems for their financial support. Nanjing, China May 2023
Youchao Sun Longbiao Li Dmytro Tiniakov
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Basic Concepts of Reliability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Main Indicators of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Failure-Free Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Durability Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Indicators of Maintainability and Preservation . . . . . . . . . . . 1.2.4 Complex Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Base Dependencies of the Reliability Theory . . . . . . . . . . . . . . . . . . . 1.4 Reliability During Regular Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Reliability During Gradual Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 The Truncated Normal Distribution . . . . . . . . . . . . . . . . . . . . 1.5.3 In a Log-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 The Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Combined Action of Sudden and Gradual Failures . . . . . . . . . . . . . . 1.7 Features of the Reliability of Recoverable Products . . . . . . . . . . . . . . 1.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 5 5 6 6 6 9 12 12 16 17 19 22 23 24 25
2 Mathematical Basis of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Random Events and Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Numerical Features and Distributions of Random Variables . . . . . . . 2.3.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Average Square-Law Deviation . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Moments of Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Excess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.7 Mode of Discrete Random Variable . . . . . . . . . . . . . . . . . . . . 2.3.8 Median of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . .
27 27 27 30 30 31 32 32 33 34 34 35
vii
viii
Contents
2.4 Distribution Functions of Random Variables . . . . . . . . . . . . . . . . . . . . 2.4.1 Distribution Functions for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Distribution Functions for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 35 37 41 42
3 System Reliability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Non-repair System Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Classification of System Reliability Function . . . . . . . . . . . 44 3.2.2 Analysis of the Reliability Function . . . . . . . . . . . . . . . . . . . 45 3.2.3 Redundancy System Reliability . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.4 Methods of the Reliability Function . . . . . . . . . . . . . . . . . . . 53 3.3 Repair System Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.2 System Reliability Function . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 Non-parametric Analysis Methods . . . . . . . . . . . . . . . . . . . . . 75 3.3.4 Parametric Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3.5 Markov Chain and Its Application . . . . . . . . . . . . . . . . . . . . . 100 3.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4 Complex System Reliability Analysis Methods . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Failure Mode and Effect Critical Analysis . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Types of FMECA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Steps for FMECA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Criticality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 FMECA Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Fault Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Reliability Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 FTA Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Comparison FMECA and FTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
125 125 126 126 127 128 131 135 139 140 142 145 147 156 157 159 160 161
Contents
ix
5 Model-Based Reliability Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Multi-Source Reliability Data Processing System for Mechanical Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Comprehensive Reliability Assessment Model Based on Multi-Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Multi-Phase Mission Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . 5.3.1 PMS System Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 PMS Reliability Modeling Process . . . . . . . . . . . . . . . . . . . . 5.3.3 Static Reliability Analysis Method Based on BDD . . . . . . . 5.3.4 Dynamic Reliability Analysis Method Based on Semi-Markov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Modular Analysis Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163 163
6 System Reliability Prediction and Allocation . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Reliability Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Reliability Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Series Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Combined Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 k-Out-of-n Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Redundant System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.7 Reliability Evaluation of Complex Systems . . . . . . . . . . . . . 6.2.8 Confidence Ranges for System Reliability . . . . . . . . . . . . . . 6.2.9 Component Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Reliability Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Criteria for Reliability Allocation . . . . . . . . . . . . . . . . . . . . . 6.3.2 Equal Allocation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 ARINC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 AGREE Allocation Approach . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Customer-Driven Allocation Approach . . . . . . . . . . . . . . . . . 6.3.6 Optimal Allocation Approaches . . . . . . . . . . . . . . . . . . . . . . . 6.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
193 193 194 194 195 199 201 204 207 212 217 226 232 233 234 235 237 240 242 244 245
7 Mechanical Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Types of Operational Loads . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Specifics of Static Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Specifics of Cycling Loading . . . . . . . . . . . . . . . . . . . . . . . . .
247 247 248 248 249 251
164 165 170 174 174 175 176 182 185 185 191 192
x
Contents
7.2.4 Specifics of Temperature Effect . . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Creep and Relaxation of Stresses . . . . . . . . . . . . . . . . . . . . . . 7.3 Factors Affect Mechanical Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Corrosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Chemical Medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Fields Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Space Medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Structural Materials’ Selection . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Short Observe of Structural Elements . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Probabilistic Calculations of Machinery Parts . . . . . . . . . . . . . . . . . . . 7.5.1 Some Aspects of Probability Theory for Machinery Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Specifics of Probabilistic Strength Computing of Shafts and Axles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Computing Examples for Various Machinery Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Civil Aircraft Jettison Fuel Subsystem Reliability Analysis . . . . . . . 7.6.1 Background for the Computation . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Risks and Challenges of Aircraft Fuel Jettison Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Function Hazard Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.4 Primary System Safety Assessment Facing on Reliability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.5 Reliability Analysis of Each Independent Fuel Tank for Dumping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.6 Reliability Analysis of the Whole Fuel Jettison Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
255 259 261 261 263 264 266 267 269 272
8 Reliability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Targets of Reliability and Life Tests . . . . . . . . . . . . . . . . . . . 8.1.2 Specifics of Life Tests and Reliability . . . . . . . . . . . . . . . . . . 8.1.3 Test Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Test Planning and Documentation . . . . . . . . . . . . . . . . . . . . . 8.2 Reliability Tests Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Types of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Test Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Test Stresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Test Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Zero-Failure Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Binomial Zero-Failure Testing . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Weibull Zero-Failure Testing . . . . . . . . . . . . . . . . . . . . . . . . .
307 307 308 309 310 315 317 318 320 320 321 322 323 325
273 276 279 284 284 288 289 292 298 302 304 305 305
Contents
xi
8.4 Life Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Theoretical Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Binomial Series Life Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Exponential Series Life Testing . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Weibull Series Life Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Accelerated Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Principles for Accelerating Tests . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Operational Cycles Compressing . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Extrapolation in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Revision of Loading Range . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 Increased Frequency of Operational Cycles . . . . . . . . . . . . . 8.5.6 Loading Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.7 Break Completely . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.8 Progressive Load Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.9 Alternating Modes Loading . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.10 Some Specifics of the Accelerated Tests . . . . . . . . . . . . . . . . 8.6 Verification of Reliability Based on Prior Information . . . . . . . . . . . . 8.7 Verification of Reliability by Means of Degradation Tests . . . . . . . . 8.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
329 329 332 337 339 341 341 342 343 344 345 345 347 347 350 353 354 356 358 359
9 Risk Analysis of Aircraft Structure and Systems . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Risk Assessment of Aeroengine Life-Limited Parts . . . . . . . . . . . . . . 9.2.1 ELLPS Airworthiness Regulations and AC Analysis . . . . . 9.2.2 Determination Method of ELLPs Based on FMEA . . . . . . . 9.2.3 Probabilistic Risk Assessment of ELLPs . . . . . . . . . . . . . . . 9.2.4 Case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Risk Warning of Aircraft Bleed Air System . . . . . . . . . . . . . . . . . . . . 9.3.1 Risk Warning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Application of Risk Warning on the BAS . . . . . . . . . . . . . . . 9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
361 361 362 363 368 373 382 386 387 394 411 413
Chapter 1
Introduction
1.1 Basic Concepts of Reliability Theory In general, reliability is the property of an object to maintain in time, within the established limits, all parameters that ensure the performance of the required functions in the specified operating conditions. The paramount importance of reliability in technology is due to the fact that the level of reliability largely determines the development of technology in the main areas of automation of production, intensification of operational processes and transport, and saving of materials and energy. Modern technical facilities consist of many interacting mechanisms, devices, and units. For example, in modern automated rolling complexes there are more than one million parts, modern radio-controlled UAVs have tens of millions of elements, while the first simple cars and radios consisted only of tens or hundreds of parts. Failure of at least one critical element of a complex system without redundancy can cause the entire system to fail. Inadequate reliability of equipment leads to huge repair costs, equipment downtime, interruption of power, water, gas, vehicle supply, failure to perform important tasks, sometimes accidents with large economic losses, destruction of large facilities, and human casualties. The rapid development of reliability science in the period of the scientific and technological revolution is connected with: (1) Automation, multiple complications of devices and their assembly into large complexes; (2) Tasks of unmanned technology; (3) Continuous improvement of devices, decrease in their metal consumption, and increase in their power, thermal, and electrical intensities. Reliability theory is a complex discipline and consists of such sections as mathematical theory of reliability, reliability for individual physical failure criteria, calculation and prediction of reliability, measures to improve reliability, reliability © Science Press 2024 Y. Sun et al., Reliability Engineering, https://doi.org/10.1007/978-981-99-5978-5_1
1
2
1 Introduction
control (testing, statistical monitoring, organization of observations), and technical diagnostics, recovery theory, and economics of reliability. In reliability theory, the following objects are generalized [1–3]: A product is a unit of products manufactured by a given enterprise, workshop, etc., e.g., a bearing, a machine tool, a car; An element is the simplest component of a product. In reliability tasks it can consist of many parts; A system is a set of interacting elements designed to perform specific functions independently. The concepts of an element and a system are transformed according to the task. For example, a piece of equipment is considered as a system consisting of separate elements: mechanisms, parts, etc., when determining its own reliability, and as an element when studying the reliability of an automatic line. The products have two categories: Non-recoverable, which cannot be recovered by the consumer and must be replaced, e.g., screws, electronic lamps, bearings, etc.; Recoverable, which can be repaired by the consumer, e.g., a machine tool, a car, and a radio. A number of products classified as non-recoverable, such as bearings, are sometimes repaired, but not by consumers, but by specialized companies. Complex products consisting of many elements are usually recoverable, since failures are usually related to damage to one or a few elements, while others remain functional. Simple items, especially externally purchased and mass-produced items, are often not recoverable. The basic concepts and terms of reliability are standardized. Reliability is characterized by the following key conditions and events: State of serviceability is the condition of the product in which it can normally perform the specified functions (with the parameters established in the technical documentation). Serviceability does not apply to requirements that do not directly affect performance, such as paint damage, etc. State of health is the state of the product in which it meets not only all basic but also auxiliary requirements. A serviceable product is necessarily a state of health. Failure is an event consisting of complete or partial loss of serviceability. Failures are divided into the following: Functional failures, in which the product stops performing its functions (e.g., gear breakage). Parametric failures, in which some parameters of the product change within unacceptable limits (e.g., loss of machine accuracy). The causes of failures are divided into random and systematic. Random causes are unintentional overloads, material defects, and manufacturing tolerances not detected by inspection, operator errors, or control system malfunctions. For examples, solid inclusions in the medium to be processed, inadmissible deviations in the dimensions of the workpieces or their incorrect clamping, cavities, and hardening cracks. Random factors mainly cause failures when they act in unfavorable combinations.
1.1 Basic Concepts of Reliability Theory
3
Systematic causes are natural phenomena that cause a gradual accumulation of damage: the influence of the environment, time, temperature, radiation–corrosion, aging; loads and friction–fatigue, creep, wear; functional influences—clogging, sticking, and leakage. According to these causes and the manner of development and manifestation, failures are divided into sudden (failures due to overload, seizure), gradual in development and sudden in manifestation (fatigue, lamp burnout, and short circuits due to aging of insulation), and gradual (wear, aging, corrosion, and sticking). Sudden failures are more dangerous than gradual failures because they are unexpected. Gradual failures are caused by parameters outside the tolerance limits during operation or storage. According to the causes of occurrence, failures can also be divided into structural, caused by design defects, manufacturing, caused by imperfection or violation of technology, and operational, caused by improper operation of products. According to their physical nature, failures are related to damage to parts or their surfaces (breakage, chipping, wear, corrosion, and aging) or are not related to destruction (clogging of fuel supply channels, lubrication or working fluid supply in hydraulic drives, loosening of connections, contamination or weakening of electrical contacts). Accordingly, failures are corrected by replacing parts; adjusting or cleaning. In terms of their consequences, failures can be minor—easily remedied, moderate—not causing destruction of other units, and major—leading to severe secondary damage and sometimes human loss. As far as possible further use of the product is concerned, failures are divided into complete ones, which exclude the possibility of the product’s operation until they are eliminated, and partial ones, in which the product can be partially used, e.g., with incomplete power or at a reduced speed. According to the complexity of removal, there are failures that can be removed in the order of maintenance, in the order of medium or major repairs. At the place of failure removal—failures that can be removed in operating and stationary conditions, which is especially important for transport vehicles. There are also self-repairing failures, e.g., in automatic workpiece feed systems on machine tools. According to the time of occurrence, failures are divided into running-in, occurring in the initial period of operation, related to the absence of running-in and getting into the assembly of defective elements that were not rejected by the inspection; failures during normal operation (for the period before the appearance of wear failures) and wear. Failures of parts and assemblies in different equipment and under different conditions can have completely different consequences. The consequences of failures of multipurpose equipment available in the park in the amount of several samples, in the presence of a repair shop, can be removed by the enterprise, and the functions during the repair are distributed among other equipment. Failure of the equipment built into an automatic line, or a special equipment installed at the factory in one sample, will additionally cause large losses related to the downtime of many other
4
1 Introduction
equipment, non-fulfillment of the plan by the shop and the plant. Failure of aircraft parts can cause fatal accidents. The reliability of products is determined by their failure-free performance, durability, maintainability, and preservation. Thus, reliability is characterized by characteristics that appear in operation and make it possible to judge how the product will meet the expectations of its manufacturers and consumers. Faultless operation (or reliability in a narrower sense) is the ability of an object to continuously maintain its operability for a given time or period of operation. This property is particularly important for equipment, the failure of which is related to danger to human life or interruption of operation of a large complex of equipment, with a stop of automated production. Durability is the property of a product to maintain its serviceability for a long time up to the limit state with an established system of maintenance and repairs. The limit state of a product is characterized by the impossibility of its further operation, decrease in efficiency or safety. For non-recoverable products, the concepts of durability and failure-free are practically the same. Maintainability is the property of a product to prevent and detect the causes of failure, or damage, and maintain and restore serviceability through maintenance and repair. With the growing complexity of the system, it goes very hard to find the causes of failures and failed elements. For example, in complex electro-hydraulic systems of equipment, the inspection for the causes of failure can take more than 50% of the total recovery time. Therefore, the simplification of the inspection of failed elements is included in the structure of new complex automatic systems. The importance of equipment maintainability is determined by the huge costs of equipment repair in the national economy. Preservation is the property of a product to retain the value of indicators of failurefree, durability, and maintainability after storage and transportation. The practical role of this property is particularly great for tools. So, according to American sources, during the Second World War, about 50% of radio-electronic tools could be used for military needs and spare parts if they failed during storage.
1.2 Main Indicators of Reliability Reliability indicators are classified according to the characteristics of reliability for indicators of failure-free, durability, maintainability, and preservation. According to the recoverability of products, they are classified into indicators for recoverable and non-recoverable products [4]. Indicators characterizing partial properties and complex indicators are used. Specific indicators characterizing the general level of reliability and numerical indicators characterizing partial properties of equipment are used. The reliability of products can be assessed by some or all of the reliability indicators, depending on their type.
1.2 Main Indicators of Reliability
5
1.2.1 Failure-Free Indicators Failure-free probability is the probability that a failure will not occur within a given operating time. Mean Time to Failure (MTTF) is the mathematical expectation of the time to failure of a non-recoverable product. Uptime is the duration or amount of work performed by an object. Mean Time Between Failures (MTBF) is the ratio of the operating time of a recoverable item to the mathematical expectation of the number of failures during that operating time. Failure rate is an indicator of the reliability of non-recoverable products, equal to the ratio of the average number of failed items per unit of time (or operating time in other units) to the number of items that remained operational. This indicator is more sensitive than the probability of failure-free time, especially for high-reliability products. The failure frequency is an indicator of the reliability of recoverable products, which is equal to the ratio of the average number of failures of the recoverable item for its arbitrarily small operating time to the value of this operating time(corresponds to the failure rate for non-recoverable products, but includes repeated failures).
1.2.2 Durability Indicators Technical resource (abbreviated as resource) is the time of operation of an object from the beginning of its operation or resumption of operation after repair to the limit state. The resource is expressed in units of operating time (usually in hours), path length (in kilometers), and in units of product release. For non-recoverable objects, the concepts of technical resource and operating time to failure coincide. Service life is the calendar operating time to the limit state. It is usually expressed in years. For equipment parts, a technical resource is used as a criterion for durability. For equipment that operates under different conditions and has a more accurate indicator than the calendar life (in particular, for transport vehicles is run, for engines is engine hours), a technical resource is also used. For other equipment, the service life is used. Durability indicators are divided into gamma percentile, mean, mean before depreciation, and full. Gamma percentiles are those that, on average, meet or exceed the specified γ percent of a given product type. They characterize the durability of products with a given probability of maintaining their performance. In particular, the gamma percentile life is the main design indicator for rolling bearings. The main advantages of this indicator are the possibility to determine it before the end of testing of all samples, a good quantitative characterization of early failures, etc. For mass-produced products, especially rolling bearings, the 90% method is most often used. For bearings of very critical products, the γ resource is
6
1 Introduction
chosen in the amount of 95% and more. If the failure is dangerous for human life, the γ resource is closer to 100%.
1.2.3 Indicators of Maintainability and Preservation Mean recovery time is the mathematical expectation of the time it will take to restore the object to an operational state after the failure. The probability of recovering an operational state at a given time is the probability that an item will be in an operational state at all times other than during planned periods when the intended use of the object is not planned. The mean time to preservation represents a mathematical expectation of the shelf life of the product type in question to the shelf life of products of the type in question. The γ -percent shelf life is typically γ ≥ 90%.
1.2.4 Complex Indicators They are mainly used for automatic complexes and complex systems: The technical utilization factor is the ratio of the mathematical expectation of the operating state time for a certain period of operation to the sum of the mathematical expectation of the operating state time and all downtimes for repairs and maintenance. The availability factor is the probability that an object will be in an operational state at any point in time, excluding periods of non-operation. The factor is given as the ratio of the mathematical expectations of the time spent in an operational state to the mathematical expectations of the sum of that time and the time spent on unplanned repairs.
1.3 Base Dependencies of the Reliability Theory Significant dissipation of key reliability parameters predetermines the need to consider it in a probabilistic aspect. Reliability parameters are used in a statistical interpretation for condition assessment and in a probabilistic interpretation for prediction. The former is expressed in discrete numbers. In probability theory and mathematical reliability theory, they are called estimators. With a sufficient number of tests, they are considered to be true reliability characteristics[5]. Consider tests carried out to assess the reliability or operation of a significant number of N elements during time t. Let by the final of the test or the service life there remain N p operable (non-failed) elements and n failed ones. Then the specific number of failures is Q(t) = n/N.
1.3 Base Dependencies of the Reliability Theory
7
If the tests are carried out as sampling, then Q(t) can be considered as a statistical estimator of the probability of failure or, if N is very large, then as the probability of failure. Then, in cases where it is necessary to emphasize the difference between the probability estimator and the true probability value, the estimator will be additionally supplied with an asterisk, in particular Q* (t). The probability of failure-free operation is estimated by the specific number of efficient elements Pt =
Np n =1− . N N
(1.1)
Since failure-free operation and failure are mutually opposite events, the sum of their probabilities is equal to one P(t) + Q(t) = 1.
(1.2)
The same follows from the above dependencies. For t = 0 and n = 0; Q(t) = 0 and P(t) = 1. For t = ∞ and n = N; Q(t) = 1 and P(t) = 0. The distribution of failures over time is characterized by the distribution density function f (t) of the operating time up to refusal. Statistically, f (t) = Δn·(NΔt)−1 = ΔQ(t)·(Δt)−1 , in a probabilistic interpretation f (t) = dQ(t)·(dt)−1 . Here, Δn and ΔQ are the increments in the number of failed objects and, correspondently the probability of failures during the time Δt. The probability of failures and failure-free operation in the density function f (t) is expressed by the dependencies: ∫t Q(t) =
f (t)dt; 0
∫t Q(t) =
f (t)dt = 1, t = ∞; 0
∫t P(t) = 1 − Q(t) = 1 −
∫∞ f (t)dt =
0
f (t)dt.
(1.3)
t
The failure rate λ(t), in contrast to the distribution density, refers to the number of objects N p that remain operationally, and not to the total number of objects. Accordingly, in the statistical interpretation,
8
1 Introduction
Δn , N p Δt
λ(t) =
(1.4)
and in a probabilistic interpretation, taking into account the fact that N p /N = P(t) f (t) . P(t)
λ(t) =
(1.5)
We obtain an expression for the probability of a no-failure operation depending , on the failure rate. To do this, substitute in the previous expression f (t) = − dP(t) d(t) let’s divide the variables and carry out the integration: dP(t) = −λ(t)dt; d(t) ∫t ln P(t) = −
λ(t)dt; 0
∫t − λ(t)dt
P(t) = e
0
.
(1.6)
This ratio is one of the basic equations of the mathematical theory of reliability. The most important general dependencies of reliability include the dependence of the reliability of systems on the reliability of elements. Let us consider the reliability of the simplest computational model of a system of series-connected elements (Fig. 1.1), the failure of each element of which causes the system to fail, and the element’s failures are expected to be independent. We use the well-known probability multiplication theorem, according to which the probability of the product, i.e., the joint manifestation, of independent events is equal to the product of the probabilities of these events. Consequently, the probability of a system’s no-failure operation is equal to the product of the probabilities of no-failure operation of the elements, i.e., Psys (t) = P1 (t) · P2 (t) · ... · Pn (t). If P1 (t) = P2 (t) = . . . = Pn (t), thenPsys (t) = P1n (t)
Fig. 1.1 Sequential system diagram
(1.7)
1.4 Reliability During Regular Operation
9
Therefore, the reliability of complex systems is low. For example, if the system consists of 10 elements with a probability of failure-free operation of 0.9 (as in rolling bearings), then the overall probability is 0.910 ≈ 0.35. Usually, the probability of failure-free operation of elements is quite high; therefore, expressing P1 (t), P2 (t), …, Pn (t) in terms of the failure probabilities and using the theory of approximate calculations, we get Psys (t) = [1−Q 1 (t)] × [1−Q 2 (t)] × . . . [1−Q n (t)] >> 1−[Q 1 (t) + Q 2 (t) + . . . + Q n (t)],
since the products of two small quantities can be neglected. For Q1 (t) = Q2 (t) = … = Qn (t), then Psys (t) = 1–nQ1 (t). Let in a system of six identical sequential elements P1 (t) = 0.99. Then Q1 (t) = 0.01 and Psys (t) = 0.94. The probability of failure-free operation must be able to determine for any period of time. By the probability multiplication theorem, P(T + t) = P(T ) × P(t) orP(t) = P(T + t) × (P(T ))−1 ,
(1.8)
where P(T ) and P(T + t) are the probability of failure-free operation during the time T and T + t, correspondently; P(t) is the conditional probability of no-failure operation for time t (the term “conditional” is introduced here, since the probability is determined on the assumption that the products did not have a failure before the start of the time interval or operating time).
1.4 Reliability During Regular Operation During this period, gradual failures have not yet occurred and reliability is characterized by sudden failures. These failures are caused by an unfavorable coincidence of many conditions and therefore have a constant intensity that does not depend on the age of the product: λ(t) = λ = const, where λ = 1/mt , mt is the mean time to failure (usually in hours). Then λ is expressed by the number of failures per hour and, as a rule, is a small fraction. The probability of the no-failure operation (Eq. 1.6) is written as follows: P(t) = e
∫t − λdt 0
= eλt .
(1.9)
It conforms to the exponential distribution law of uptime and is the same for any equal period of time during normal operation. The graph of the probability of no-failure operation of the facility is shown in Fig. 1.2.
10
1 Introduction
Fig. 1.2 The graph of the probability of no-failure operation
The exponential distribution law can approximate the time of failure-free operation of a wide range of objects (products): especially critical facilities operated in the period before the significant development of gradual failures; elements of electronic equipment; equipment with sequential replacement of failed parts; electrical and hydraulic equipment and control systems, etc.; complex objects consisting of many elements (in this case, the uptime of each may not be distributed exponentially; it is only necessary that the failures of one element that does not conform this law do not dominate the others). An essential advantage of the exponential distribution is its simplicity: it has only one parameter. If, as usual, λ·t ≤ 0.1, then the equation for the probability of no-failure operation is simplified as a result of expanding into a series and discarding small terms P(t) = 1 − λ · t +
(λ · t)3 (λ · t)2 − + ... ≈ 1 − λ · t. 2! 3!
(1.10)
Distribution density (in general) f (t) = −
dP(t) = λ · e−λt . dt
(1.11)
The values of the probability of no-failure operation (Fig. 1.3, Table 1.1) depend on λ(t)·t ≈ t/mt . Fig. 1.3 Functions of the probability of failure-free operation P(t) curve 1, density probability f (t) curve 2, and failure rate λ(t) curve 3 of exponential distribution
1.4 Reliability During Regular Operation
11
Table 1.1 The values of the probability of no-failure operation λ(t)·t
1
0.1
0.01
0.001
0.0001
P(t)
0.368
0.9
0.99
0.999
0.9999
Since at t/mt = 1 the probability P(t) ≈ 0.37, 63% of failures occur during the time t < mt and only 37% later. From the given values, it follows that to provide the required probability of no-failure operation of 0.9 or 0.99, only a small fraction of the average service life (0.1 and 0.01, respectively) can be used. If the product operates under different modes, and, consequently, failure rates λ1 (for time t 1 ) and λ2 (for time t 2 ), then P(t) = e−(λ1 t1 +λ2 t2 ) .
(1.12)
This dependence follows from the probability multiplication theorem. To determine the failure rate on the basis of experiments, the mean time to failure is estimated as mt ≈
1 Σ ti , N
(1.13)
where N is the total quantity of events. Then λ = 1/mt . You can also use the graphical method (Fig. 1.4) is plot the experimental points in the coordinates t and –lgP(t). The minus sign is chosen because P(t) < 1 and, therefore, lgP(t) is a negative value. Then, taking the logarithm of the expression for the probability of no-failure operation lgP(t) = − λ·t and lg e = − 0.4343λ·t, we conclude that the tangent of the angle of the straight line drawn through the experimental points is tanα = 0.4343λ, whence λ = 2.3tanα. With this method, it is not necessary to complete the test of all samples. Σ For a system, Psys (t) = e− λi t . If λ1 = λ2 = … = λn , then Psys (t) = e−nλ1 t . Fig. 1.4 Graphical determination of the probability of failure-free operation from the results of experiments
12
1 Introduction
Thus, the probability of failure-free operation of a system consisting of elements whose probability of failure-free operation follows an exponential law also follows an exponential law, and the failure rates of individual elements add up. Using the exponential distribution law, it is easy to determine the average number of products n that will fail at a given time and the average number of products p·N that will remain operational. For λ·t ≤ 0.1, n ≈ N · λ · t;
N p = N · (1 − λ · t).
Example 1 Estimate the probability P(t) of the absence of sudden failures of the facility during t = 10 000 h, if the failure rate is λ = 1/mt = 10−8 ·h–1 . Solution Since λ·t = 10−8 ·104 = 10−4 < 0.1, we use the approximate dependence P(t) = 1 − λ × t = 1 − 10−4 = 0.9999. Calculation based on the exact dependence P(t) = e−λt within four decimal places leads to an exact match.
1.5 Reliability During Gradual Failures For gradual failures, the laws of distribution of service life are needed, which lead to a low distribution density, then to a maximum, and then to a decrease associated with a decrease in the number of operable elements. Due to the variety of reasons and conditions for the occurrence of failures during this period, several distribution laws are used to describe reliability, which are established by approximating the results of tests or observations in operation.
1.5.1 The Normal Distribution It is the most versatile, convenient, and widely used for practical calculations (Fig. 1.5). The distribution always conforms to the normal law if many roughly equivalent factors influence the change in the random variable. The operating time to failure of many recoverable and non-recoverable products, dimensions, and measurement errors of parts, etc., are subject to normal distribution. Distribution density (t−m t )2 1 f (t) = √ e− 2S2 . S 2π
(1.14)
1.5 Reliability During Gradual Failures
13
Fig. 1.5 Probability density function and cumulative probability function of normal distribution
The distribution has two independent parameters: mathematical expectation mt and standard deviation S. The values of the parameters mt and S are estimated according to the test results according to the equations: Σ ti ,S ≈s = mt ≈ t = N
/
1 Σ (ti − t)2 , N −1
(1.15)
where t¯ and s are the estimates of mathematical expectation and standard deviation. The convergence of parameters and their estimates increases with the number of tests. Sometimes, it is more convenient to operate with the variance D = S 2 . The mathematical expectation determines the position of the loop on the graph (see Fig. 1.6), and the standard deviation determines the width. The distribution density curve is sharper and higher, then smaller S. It starts at t = − ∞ and extends to t = ∞. This is not a significant disadvantage, especially if mt ≥ 3S, since the area outlined by the branches of the density curve extending to infinity, expressing the corresponding probability of failures, is very small. Thus,
Fig. 1.6 The main characteristics of the normal distribution for different values of the standard deviation: a is the probability density f (t); b is the probability of failure-free operation P(t); c is failure rate λ(t)
14
1 Introduction
the probability of failure for the period of time before mt –3S is only 0.135% and is usually not taken into account in the calculations. The probability of failure up to mt –2S is 2.175%. The largest ordinate of the distribution density curve is 0.399/S. The Cumulative Distribution Function (CDF) has the view ∫t F(t) =
f (t)dt.
(1.16)
−∞
Probability of failure and probability of failure-free operation, correspondently Q(t) = F(t), P(t) = 1−F(t).
(1.17)
The calculation of integrals is replaced by the use of tables. The tables for the normal distribution in the function (t − mt ) and S would be cumbersome, since they would have two independent parameters. You can get by with small tables for a normal distribution, for which mx = 0 and S x = 1. For this distribution, the density function is 1 x2 f 0 (x) = √ e− 2 , 2π
(1.18)
has one variable x. The quantity x is centered, since mx = 0, and normalized, since S x = 1. The distribution density function is written in relative coordinates with the origin on the axis of symmetry of the loop. Distribution function is integral to the distribution density ∫x F0 (x) =
f 0 (x)dx.
(1.19)
−∞
It follows from this equation that F 0 (x) + F 0 (−x) = 1, hence F 0 (–x) = 1–F 0 (x). For using tables, it is needed to apply the substitution x = (t–mt )/S. Moreover, x is called the quantile of the normalized normal distribution and is usually denoted by up . Density of distribution and probability of failure-free operation correspondently f (t) = f 0 (x)/S, Q(t) = F 0 (x); P(t) = 1–F 0 (x), where f 0 (x) and F 0 (x) are taken from tables. For example, Table 1.2 shows directly the values of P(t) depending on x = up = (t–mt )/S in the usable range. In the researches on reliability, the Laplace function is often used instead of the CFD F 0 (x)
1.5 Reliability During Gradual Failures
15
Table 1.2 Density values and distribution functions for normal distribution depending on x x
0
1
2
3
4
f 0 (x)
0.3989
0.2420
0.0540
0.0044
0.0001
F 0 (x)
0.5
0.8413
0.9772
0.9986
0.9999
∫x Φ(x) = 0
1 f 0 (x)dx = √ 2π
∫x
x2
e− 2 .
0
It is obvious that ∫0 F0 (x) =
∫x f 0 (x)dx +
−∞
f 0 (x)dx = 0.5 + Φ(x).
(1.20)
0
The probability of failure and the probability of failure-free operation, expressed in terms of the Laplace functions, differing in the limits of integration, have the view ( Q(t) = 0.5 + Φ
) ( ) t − mt t − mt , P(t) = 0.5 − Φ . S S
(1.21)
Comparing products with the same mean time to failure and different standard deviation S, it can be emphasized that, although there are large S samples with great durability, smaller S, then much better the product. In addition to the problem of estimating the probability of no-failure for a given time or operating time, there is an inverse problem: determining the time (or operating time) that corresponds to a given probability of no-failure operation. The values of this operating time are determined using the quantiles of the normalized normal distribution t = mt + up S. Quantile values are given in Table 1.3 depending on the required probability, in particular on the probability of no-failure operation. Operations with a normal distribution are simpler than with others, so they are often replaced with other distributions. For small coefficients of variation S/mt , the normal distribution is a good substitute for binomial, Poisson, and logarithmically normal. The distribution of the sum of independent random variables U = X + Y + Z, called the composition of distributions, with a normal distribution of terms is also a normal distribution. Table 1.3 Quantile values depending on the required probability P(t)
0.5
0.8
0.85
0.9
0.95
0.99
0.999
0.9999
up
0
–0.842
–1.036
–1.282
–1.645
–2.326
–3.090
–3.719
16
1 Introduction
The mathematical expectation and variance of the composition are determined by the equations: m u = m x + m y + m z , Su2 = Sx2 + S y2 + Sz2 ,
(1.22)
where mx , my , and mz are the mathematical expectations of random variables X, Y, and Z; Sx2 , S y2 , and Sz2 are the variances of the same values. Example 2 Estimate the probability P(t) of no-failure operation during t = 1.5·104 h of a wearing movable joint if the wear duration has a normal distribution with parameters mt = 4·104 h, S = 104 h. Solution Find the quantile. up = (1.5·104 –4·104 )·10–4 = –2.5. By Table 1.1, there is P(t) = 0.9938. Example 3 Estimate the 80% duration t 0.8 of the movable joint, if it is known that the durability of the movable joint is limited in terms of wear, the duration has a normal distribution with parameters mt = 104 h, S = 6·103 h. Solution For P(t) = 0.8, up = –0.84, then. t 0.8 = mt + up ·S = 104 –0.84·6·103 ≈ 5·103 h.
1.5.2 The Truncated Normal Distribution It is obtained from the normal one with the limitation of the interval of variation of the random variable. It, in particular, makes the reliability calculations more precise in comparison with the normal distribution at large values of the coefficient of variation V = S/mt . The distribution density function is written in the same way as the normal distribution density, but with the proportionality factor c (t−t0 )2 c f (t) = √ e− 2S2 , S 2π
(1.23)
where t 0 is the value of a random variable corresponding to the maximum f (t) and called a mode. Using the function F 0 of the normal distribution of the normalized and central random variable, we can write ( ) )] [ ( b − t0 a − t0 −1 − F0 . c = F0 S S
(1.24)
1.5 Reliability During Gradual Failures Table 1.4 Values of c depending on the t 0 S –1
17
t 0 S –1
1
2
3
c
1.189
1.023
1.001
The main application of the truncated normal distribution is with the parameters a = 0 and b = ∞. It reflects in reliability tasks the impossibility of failures at negative time values. Then c=
1 . F0 t0 S −1
(1.25)
Values c are present in Table 1.4. Thus, for t 0 > 2S, the factor c is very close to 1. Probability of without without-failure operation is P(t) = cF0 (t0 − t)S −1 .
(1.26)
m t = t0 + S f ∗ (t0 S −1 ),
(1.27)
Average durability is
where f * is the statistical function. An example of truncated distributions can be the distribution of a product quality parameter after the rejection of a part of products by this parameter.
1.5.3 In a Log-Normal Distribution The logarithm of a random variable is distributed according to the normal law. As the distribution of positive values, it is somewhat more accurate than normal and describes the operating time to failure of parts, in particular, in terms of fatigue. It is successfully used to describe the operating time of rolling bearings, electronics, and other products. The log-normal distribution is convenient for random variables that are the product of a significant number of random initial variables, just as the normal distribution is convenient for the sum of random variables. The distribution density (Fig. 1.7) is described by the dependence (ln t−μ)2 1 f (t) = √ e− 2S2 , S 2π
(1.28)
where μ and S are parameters that determined by tests results. For tests products N till failure,
18
1 Introduction
(a)
(b)
(c)
Fig. 1.7 The main specifics of the log-normal distribution for different parameters: a-probability density f (t); b-probability of no failure P(t); (c)-failure rate λ(t)
Σ ∗
μ≈μ =
ln ti , S≈s= N
/
1 Σ (ln ti − μ∗ )2 , N −1
(1.29)
where μ* and s are the estimations of parameters μ and S. The probability of no-failure operation can be determined from the tables for the normal distribution (depending on the value of the quantile u p = (ln t − μ)S −1 . Mean time to failure is m t = eμ+0.5S . 2
(1.30)
Standard deviation is St =
/
( ) e2μ+S 2 e S 2 − 1 .
(1.31)
Variation coefficient is Vt =
√ St = e S 2 − 1. mt
(1.32)
While V t ≤ 0.3, then suppose V t ≈ S, in this case, the tolerance is smaller than 1%. Often, the decimal logarithms are used to expense dependencies for the log-normal distribution. Accordingly, the distribution density f (t) =
0.4343 − (lg t−lg2t0 )2 2S . √ e S 2π
(1.33)
1.5 Reliability During Gradual Failures
19
Estimation of parameters lgt 0 and S are determined by test results Σ
lg ti , S≈s= N
lg t0 =
/
1 Σ (lg ti − lg t0∗ )2 . N −1
(1.34)
Mean time to failure mt , standard deviation S t , and variation coefficient V t of operating time to failure are determined by equations: 2
m t = t0 e2.651S , / ( )2 mt St = m t − 1, t0 /( ) mt 2 −1 Vt = t0
(1.35)
for V t ≤ 0.3, it is possible to take V t ≈ 2.3S. For probabilities of failure-free operation P(t) ≤ 0.99 and for V t ≤ 0.3, the lognormal law can be replaced by a normal law with parameters mt and S t and density, 2
f (t) =
1 − (t−m t ) √ e 2St2 . St 2π
The probability of no-failure operation can be found using special tables for this distribution or tables for the normal distribution. Example 4 Estimate the probability P(t) of the absence of shaft fatigue damage during t = 104 h, if the durability is distributed by log-normal law with the parameters lgt 0 = 4.5; S = 0.25. Solution ( P(t) = F0
lg t − lg t0 S
(
) = F0
lg 104 − lg 4.5 0.25
) = 0.9772.
1.5.4 The Weibull Distribution It is quite universal. It covers a wide range of cases of changing probabilities by varying the parameters. Along with the log-normal distribution, it satisfactorily describes the operating time of parts for fatigue failures and the operating time to failure of bearings and electronics. It is used to assess the reliability of machine parts and assemblies and is also used to assess the reliability of running-in failures.
20
1 Introduction
(a)
(b)
(c)
Fig. 1.8 The main specifics of the Weibull distribution for different parameters t 0 and m: a-probability density f (t); b-probability of no failure P(t); c-failure rate λ(t)
The distribution is characterized by the following function of the probability of failure-free operation (see Fig. 1.8): m
− tt
P(t) = e
0
.
(1.36)
Failure rate is m m−1 t . t0
(1.37)
m m−1 −t m t0−1 t e . t0
(1.38)
λ(t) = Density function is f (t) =
The two-parameter Weibull distribution has two parameters: the shape factor m > 0 and the scale factor t 0 > 0. Mathematical expectation and standard deviation, respectively, m t = bm t0−m , St = cm t0−m ,
(1.39)
where bm and cm are factors. If during the time t * no failures occur, then the equations for the reliability parameters are slightly modified. So, the probability of the failure-free operation P(t) = e
− (t−tt
∗ )m
0
.
(1.40)
The options and variety of purposes of the Weibull distribution can be seen from the following explanations (see Fig. 1.8). For m < 1, λ(t) and f (t) from operating time to failure are decreasing functions. For m = 1, distribution becomes exponential with the λ(t) = const and f (t) is a decreasing function.
1.5 Reliability During Gradual Failures
21
For m > 1, f (t) is a single-humped function, λ(t) is the continuously increasing function for 1 < m < 2 with upward convexity, and for m > 2, it is with downward convexity. For m = 2, λ(t) is a linear function, and the Weibull distribution turns into the so-called Rayleigh distribution. At m = 3.3, the Weibull distribution is close to normal. The graphical processing of test results for the Weibull distribution is carried out as follows: we logarithm the equation for P(t) lg P(t) = −0.4343t m t0−1 . We introduce the notation y = − lg P(t). Logarithm lg y = m lg t − A, where A = lg t0 + 0.362. We can find m = tanα; lgt 0 = A − 0.362 by putting the test results on the graph for coordinates lgt–lgy (Fig. 1.9) and drawing a straight line through the obtained points. α is the angle of inclination of the straight line to the abscissa axis and A is a segment cutoff by a straight line on the ordinate axis. The reliability of a system of serially connected similar elements that have the Weibull distribution has also the Weibull distribution. Example 5 Estimate the probability of failure-free operation P(t) of roller bearings for t = 104 h, if the bearing durability is described by the Weibull distribution with the parameters t 0 = 107 h, m = 1.5.
Fig. 1.9 Graphical determination of the parameters of the Weibull distribution
22
1 Introduction
Solution P(t) = e−t
m −1 t0
= e−10
41.5
·10−7
= 0.905.
1.6 Combined Action of Sudden and Gradual Failures The probability of failure-free operation of the product for the period t, if before that it worked for the time T, according to the probability multiplication theorem: P(t) = Ps (t) · Pg (t),
(1.41)
where Ps (t) = e−λt and Pg (t) = Pg (T + t)·Pg (T )–1 are the probabilities of the absence of sudden and gradual failures accordingly. For a system of series-connected elements, the probability of failure-free operation for the period t Psys (t) = e−t
Σ
λi
∏ Pni (T + t) . Pni (T )
(1.42)
For new products, T = 0 and Pni (T ) = 1. Figure 1.10 shows the curves of the probability of the absence of sudden failures, gradual failures, and the curve of the probability of no-failure operation with the combined action of sudden and gradual failures. Initially, when the failure rate is low, the curve follows the Ps (t) curve, and then drops sharply. During the period of gradual failures, their intensity, as a rule, is far higher than that of sudden ones. Fig. 1.10 Combined action of sudden and gradual failures, where the upper curve P(t) is for sudden failures
1.7 Features of the Reliability of Recoverable Products
23
1.7 Features of the Reliability of Recoverable Products For non-recoverable products, primary failures are considered, for recoverable products—primary and repeated. All considerations and terms for non-recoverable items apply to primary failures of recoverable items. For recoverable products, the operating and operation schedules of recoverable products are indicative. The first can show the periods of operation, repair, and maintenance (inspections), and the second are the periods of operation. Over time, the operation periods between repairs become shorter and the repair and maintenance periods increase. For recoverable products, the reliability properties are characterized by the m(t) and the average quantity of failures over period t mt =
1 Σ ni , N
(1.43)
where N is the quantity of tested items and ni is the number of items failures during time t. Statistically interpreted, the failure frequency Λ(t) characterizes the average number of failures expected in a short time interval Λ(t) =
Δm(t) , Δt
(1.44)
where Δ m(t) is the increment in the average number of failures over time Δt, i.e., the average quantity of failures out of time t to time t + Δt. As it is known, in case of sudden product failures, the distribution law of the operating time to failure is exponential with intensity λ. If the product is replaced with a new one (recoverable product) in case of failure, then a flow of failures is formed, the rate of which Λ(t) does not depend on t, i.e., Λ(t) = Λ = const and is equal to the intensity λ. The flow of sudden failures is assumed to be: (1) Stationary, that is, the average number of failures per unit of time is constant; (2) Ordinary, in which no more than one failure occurs at the same time; (3) Without aftereffect, which means the mutual independence of the appearance of failures at different (non-overlapping) time intervals. For a stationary, ordinary flow of failures Λ(t) = Λ = 1/T, where T is the mean time between failures. An independent consideration of gradual failures of recoverable products is because of interest because the recovery time after gradual failures is usually significantly longer than after sudden ones. With the combined action of sudden and gradual failures, the parameters of the failure flows add up.
24
1 Introduction
The flow of gradual (wear) failures becomes stationary when the operating time t is much greater than the average value of T. So, with a normal distribution of operating time to failure, the failure rate increases monotonically (see Fig. 6c), and the failure frequency Λ(t) initially increases, then oscillations begin, which decay at the 1/T level. The observed maxima Λ(t) corresponds to the mean time to failure of the first, second, third, etc., generations. In complex products (systems), the failure frequency is defined as the sum of the failure frequency. Component flows can be considered by nodes or by types of devices, e.g., mechanical, hydraulic, electrical, electronic, and others Λ(t) = Λ1 (t) + Λ2 (t) + …. Accordingly, the mean time between product failures (during normal operation) T = 1/Λ or 1/T = 1/T 1 + 1/T 2 + …. The probability of no-failure operation from time T to T + t has an exponential distribution −1
P(t) = e−Λt .
(1.45)
For a series-connected system, Psys = e−t
Σ
Λi
.
(1.46)
One of the main complex indicators of the reliability of the recoverable product is the utilization factor η=
To , To + Tt−o + Tr
(1.47)
where T o , T t–o, and T r are the average values of operation, time-out, and repair, respectively.
1.8 Questions 1. What is reliability? 2. The reliability of what objects is investigated in the theory of reliability? 3. What objects are called non-recoverable from the point of view of the theory of reliability? 4. Give a definition of the state serviceability of the object. 5. Give a definition of object failure. What types of failures do you know? 6. What is the difference between random and systems of causes for failure? 7. What is the difference between sudden and gradual failures? 8. Give a definition of failure-free state. 9. What is its durability? 10. Give a definition of maintainability. 11. What is this preservation?
References
25
12. 13. 14. 15. 16. 17. 18. 19. 20.
List and describe the failure-free indicators. List and describe the durability indicators. What are complex reliability indicators? Write an equation to calculate the probability of failure-free operation. Write an equation to calculate the failure rate. How to determine the probability of no-failure for a series-connected system? What are the specifics of determining reliability during the regular operation? What are the specifics of determining reliability during the gradual failures? What are the specifics of determining reliability during the combined action of sudden and gradual failures? 21. Describe the features of the reliability of the recoverable products. 22. Estimate the probability P(t) of no-failure operation during t = 2.5 × 104 h of a wearing movable joint if the wear duration has a normal distribution with parameters mt = 3 × 104 h, S = 104 h.
References 1. Bertsche B. Reliability in Automotive and Mechanical Engineering [M]. Berlin: Springer, 2008. 2. Kurenkov V, Volocuev V. Reliability of Products and Systems of Rocket and Space Industries [M]. Samara: SSASU, 2010. 3. Kapur K C, Pecht M. Reliability Engineering [M]. New York: John Wiley & Sons, 2013. 4. Choi S K, Ramana V, Grandhi R, Canfield A. Reliability-based Structural Design [M]. London: Springer, 2007. 5. Porter A. Accelerated Testing and Validation [M]. New York: Elsevier, 2004.
Chapter 2
Mathematical Basis of Reliability
2.1 Introduction The first record of probability can be gone back in sixteenth century to the creation of a guide of gambler by Girolamo Cardano [1]. In this guide, G. Cardano studied several actual topics on probability. Also, two scientists Blaise Pascal and Pierre Fermat in the seventeenth century found solution separately and accurately the task of splitting the wins in a gamble of chance. In the seventeenth century, for the first time, Christiaan Huygens composed the actual tractate on probability that was based on the Pascal-Fermat dependence. Also, Boolean algebra that has an important place in contemporary theory of a probability [1]. It was named after the researcher G. Boole. In nineteenth century, he issued a chapbook with topic “The Mathematical Analysis of Logic, Being an Essay towards a Calculus of Deductive Reasoning”. Various science bases related to reliability are present in this chapter.
2.2 Random Events and Probabilities Probability theory includes mathematically based theories and approaches for researching random events. Formally, random event occurs in the link with random tests. A random test is qualified by two specifics: (1) The tests repeating, even if realized under similar conditions, mostly have different results. (2) The possible tests results are known. Therefore, the random test results cannot be predicted with confidence. But, if random tests are repeated rather frequently under the similar conditions, stochastic or statistical trends can be found. Examples of random tests are.
© Science Press 2024 Y. Sun et al., Reliability Engineering, https://doi.org/10.1007/978-981-99-5978-5_2
27
28
2 Mathematical Basis of Reliability
(1) Counting the cars’ number visiting at a petrol station a day. (2) Counting the shooting stars’ number during a fixed time range. The possible results are, as in the previous random test, non-negative integers. (3) Registering the daily maximum wind velocity at a given place. (4) Registering the technical systems lifespans. As the cases show, in this meaning, the term “experiment” has wider context than in the normal sense. A random variable is a numeral that related to all result of a test. For example, the outcome of rolling a die once 1 ≤ n ≤ 6, an environment temperature measure x, etc. are random variables. When the test is replicated, a different value can be for the random variable on every attempt. There are two types of a random variable: discrete variable with value in the range of integers (in the case of the die) or continuous variable with value in the range of real numbers (in the case of the temperature measurement). For many options, it is interesting next fact: a random variable has a specific values’ range (as in the case of die, the rolling result of a die is 2, or it lies among 1 and 4). Such collections of random variables’ values have name events. For a replicated test, the event’s probability is calculated as the limit of the frequency of happening of that event in time the quantity of attempts gets large value P(ε) ≜ lim
Nt →∞
Nε , Nt
(2.1)
where N ε is the quantity of attempts in that this event happens and N t is the total quantity of attempts. This equation shows that the probability of any event always is among 0 and 1: 0 ≤ P(ε) ≤ 1.
(2.2)
Two events are non-crossing if the value collections of the random variable that they have are non-overlapping (Fig. 2.1). In the case of the die rolling test, the events n = 3 and n > 4 are non-crossing. The probability for ε1 and ε2 sum, which are two non-crossing events, is. P(ε1 ∪ ε2 ) = P(ε1 ) + P(ε2 ), ε1 ∩ ε2 /= 0.
(2.3)
For ε1 , ε2 , …, εN , which are non-crossing events and their sum is the confident event (the collection of all random variables’ possible values), their probabilities’ sum is 1: Fig. 2.1 Sum of two non-crossing random events
2.2 Random Events and Probabilities
29
Fig. 2.2 Sum of non-crossing events probabilities
n ∑
P(εi ) = 1.
(2.4)
i=1
In the case of the die rolling test, the events n = 1, n = 2, …, n = 6 are non-crossing and constitute all possible values of n. Their probabilities’ sum equals 1 (Fig. 2.2). For the primary condition of a correct die, all six events’ probabilities have value 1/ 6. If we take in account, a continuous random variable x, then any event probability can be determined by the Probability Density Function (PDF) [2]. The PDF assessed at X is a limit of the normalized probability that x is in the short range [X, X + ∆X]: f (X ) ≜ lim
∆X →0
1 P(X ≤ x ≤ X + ∆), ∆X
(2.5)
where the lower case x is the random variable and the upper case X is a concrete value of this variable. The random event’s probability ε specified by a random variable x is gotten by integrating the PDF over the collection of x values specifying the event: ∫ P(ε) =
f x (X )dX .
(2.6)
ε
In the case of x is between two numbers X 1 and X 2 , the probability of this event is ∫X 2 P(X 1 ≤ x ≤ X 2 ) =
f x (X )dX .
(2.7)
X1
Application Eq. (2.7) with X 1 = X, X 2 = X + ∆X and small limit value ∆X provides the PDF Eq. (2.5).
30
2 Mathematical Basis of Reliability
The PDF is commonly (but not every time) a continuous function of X, if random variables are continuous valued. But, the PDF is a weighted sum of impulses arranged at integer values of X, if a random variable is discrete valued f n (X ) =
∞ ∑
Pi δ(X − i ), with Pi ≜ P(n = i ).
(2.8)
i=−∞
For this condition, a sum can replace the integral in Eq. (2.7) P(N1 ≤ n ≤ N2 ) =
N2 ∑
Pi .
(2.9)
N1
The PDF must have the next features (for both random variables: discrete and continuous valued) ∫∞ f x (X ) ≥ 0 ,
f x (X )dX = 1.
(2.10)
−∞
2.3 Numerical Features and Distributions of Random Variables 2.3.1 Mathematical Expectation Mathematical expectation is the sum (for discrete random variable X taking a certain quantity of values x i with probabilities Pi ): () =1 P1 +2 P2 +3 P3 + ... +n Pn
(2.11)
Mathematical expectation is the integral of multiplication of random variable values x and its density function f (x) (for continuous random variable X): ∫∞ M(X ) =
x f (x)dx.
(2.12)
−∞
Not own integral Eq. (2.12) is assumed absolutely converging (otherwise, mathematical expectation M(X) does not exist).
2.3 Numerical Features and Distributions of Random Variables
31
The average value of random variable X is determined by its mathematical expectation. The mathematical expectation dimension concurs with dimension of random variable. Properties of mathematical expectation: M(cX ) = cM(X ),
c∈R
M(X + Y ) = M(X ) + M(Y ), X and Y ∈ E M(X Y ) = M(X )M(Y ), for independed random variables X and Y.
(2.13)
2.3.2 Dispersion Dispersion of random variable X is the number: [ ] D(X ) = M (X − M(X ))2 = M(X )2 − [M(X )]2 .
(2.14)
The dispersion is the parameter of distribution of random variable X values about its average value of M(X). The dispersion has a dimension of random variable square. By reference to equations of mathematical expectation Eq. (2.11) and dispersion Eq. (2.14) for both discrete random and for continuous random variables, it is possible to receive similar equations for dispersion: ⎧ n ∑ ⎪ ⎪ (xi − m)2 pi , discrete random variables, ⎪ ⎪ ⎪ ⎨ i=1 D(X ) = M(X − m)2 = ∫∞ ⎪ ⎪ ⎪ ⎪ (x − m)2 f (x)dx, continuous random variables. ⎪ ⎩ −∞
(2.15) Here m = M(X). Properties of dispersion: D(cX ) = c2 D(X ), c ∈ R D(X + Y ) = D(X ) + D(Y ),
independent random variables X and Y.
(2.16)
32
2 Mathematical Basis of Reliability
2.3.3 Average Square-Law Deviation The dimension of average square-law deviation is similar with the dimension of a random variable. So, average square-law deviation is applied as evaluation of distribution more often, than dispersion. σ =
√
D(X ).
(2.17)
2.3.4 Moments of Distribution Both mathematical expectation and dispersion are particular instances of moments of distribution. Moments of distribution is a common parameter of random variables. They are mathematical expectations of several simple functions of random variable. Thus, the k-order moment for point x 0 has named mathematical expectation M(X– x 0 )k . Moments for starting of coordinates x = 0 have named initial moments and are designated as αk = M(X k ).
(2.18)
The center value of distribution of an analyzed random variable is the initial moment of the first order α1 = M(X ) = m.
(2.19)
Central moments are moments of center value of distribution x = m and are designated μk = M(X − m)k .
(2.20)
Result from (2.13) that first-order central moment is ever equal to zero μ1 = M(X − m) = M(X ) − M(m) = m − m = 0.
(2.21)
Because displacement on constant value C of random variable value center value of distribution changes on the same value C and deviation from the center does not change, central moments do not rely on its reference grade: X − m = (X − C) − (m − C).
2.3 Numerical Features and Distributions of Random Variables
33
With this background, it is visible that second-order central moment is dispersion D(X ) = M(X − m)2 = μ2 .
(2.22)
2.3.5 Asymmetry Central moment of the third order μ3 = M(X − m)3
(2.23)
is estimating distribution asymmetry. The third-order central moment will be zero, if the distribution is symmetric for point x = m (similar for all odd order central moments). Hence, if the third-order central moment is not equal to zero, distribution is not symmetric. Dimensionless coefficient of asymmetry is an evaluation for asymmetry’s value Cs =
M(X − m)3 M(X − m)3 μ3 √ = = . (μ2 )3/2 [σ (X )]3 [D(X )]3
(2.24)
The type of asymmetry (right-hand or left-hand sided asymmetry) depends on sign of asymmetry coefficient Eq. (2.24) (Fig. 2.3).
Fig. 2.3 Types of asymmetries of distributions
34
2 Mathematical Basis of Reliability
Fig. 2.4 Distributions’ graphs with different grade of excess
2.3.6 Excess Central moment of the fourth order μ4 = M(X − m)4
(2.25)
is estimating of becoming known as excess determining grade of incline (sharp peak) of curve of distribution near its center value in regard to normal distribution curve [3]. Relating to normal distributions, an excess to the value is possible E=
μ4 − 3. [σ (X )]4
(2.26)
Cases of distribution curves with different excesses are shown in Fig. 2.4. For normal distribution, it is true E = 0. Graphs with more sharp peaks than normal distribution have E > 0 and with more smooth peaks have E < 0. Mathematical statistic higher order moments in engineering purposes commonly are not used.
2.3.7 Mode of Discrete Random Variable The highest possible value of a discrete random variable is its mode. The mode of a continuous random variable is its value with the maximum density function value (Fig. 2.3). If a distribution graph has only one maximum, it is unimodal. If a distribution graph has two or more maxima, it is polymodal. There are times when distribution graphs do not have a maximum, but they do have a minimum. These distributions are antimodal. In a general case, for a random variable, its mode and its mathematical expectation are not the same. In some special cases, for a symmetric distribution with mode and mathematical expectation, it coincides with mode and center of symmetry of the distribution.
2.4 Distribution Functions of Random Variables
35
2.3.8 Median of Random Variable The value Me, when equiprobably, that random variable X is less or more than this value, has name Median. It means that random variable’s median is the equality P{X < Me} = P{X > Me}. . In other words, median is abscissa in the point on the distribution’s graph under which the area is bisected (Fig. 2.3). For symmetric distribution with mode, median, and mathematical expectation, they are concurred.
2.4 Distribution Functions of Random Variables 2.4.1 Distribution Functions for Discrete Random Variables A random variable X has the distribution function that can be determined from its probability function. It should be pointed out that it is true for all x in (-∞, ∞), F(x) = P(X ≤ x) =
∑
f (u),
(2.27)
u≤x
where this sum is calculated for all values u taken on by X with relation u ≤ x. If X gets only a certain quantity of values x 1 , x 2 , …, x n , then the distribution function has
F(x) =
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
0, f (x1 ), f (x1 ) + f (x2 ), .. .
−∞ < x < x1 x1 ≤ x < x2 x2 ≤ x < x3 . .. .
(2.28)
f (x1 ) + · · · + f (xn ), xn ≤ x < ∞
Example 2.1 Think that a coin is flipped twice then the case range is S = [HH, HT, TH, TT]. Let X be the quantity of heads that can occur. For every case, we can relate a quantity for X (Table 2.1). (1) Find the distribution function for the random variable X; (2) Obtain its graph. Solution Table 2.1 X values for Example 2.1 Case
HH
HT
TH
TT
X
2
1
1
0
36
2 Mathematical Basis of Reliability
The probability function for the random variable X: P(HH) = 1/4; P(HT) = 1/4; P(TH) = 1/4; P(TT) = 1/4. The probability function is presented in Table 2.2. (1) The distribution function has view ⎧ 0, ⎪ ⎪ ⎨ 1/4, F(x) = ⎪ 3/4, ⎪ ⎩ 1,
−∞ < x < 0 0≤x 0, θb
(2.46)
where b and θ are the shape and scale indexes. Applying Eqs. (2.28) and (2.46), we can receive the next CFD F(t) = 1 − e−( θ ) . t
b
(2.47)
Substitution of Eq. (2.46) into Eq. (2.30) gives ) ( 1 . E(t) = θ Γ 1 + b
(2.48)
For b = 1 or 2, the exponential and Rayleigh distributions are the partial cases of this distribution. 6. Log-normal distribution The log-normal distribution has wide applications for probabilistic analysis for engineering purpose. It relates with the fact that negative numerical quantity of engineering processes is often impossible by the natural reasons. This distribution is applicable for descriptions of fatigue failure or failure rates, and other conditions regarding a significant scope of data. For example, time to failure, strength of material, variables of loads, etc.
2.5 Questions
41
A situation may arise in reliability analysis where a random variable X is the n ∏ xi . Taking the natural logarithm of product of several random variables x i : x = i=1
both sides, we get ln x = ln x1 + ln x2 + . . . + ln xn ,
(2.49)
Thus, the probability density function of y is determined by equation [ ) ] ( 1 1 y − μY 2 f y = √ exp − , −∞ < y < ∞. 2 σY σ y 2π
(2.50)
Applying Eqs. (2.28) and (2.50), we can receive the next CFD Fx (x) =
1 √
σY 2π
∫π 0
]} { [ 1 (ln x − μY )2 1 . exp − x 2 2σY2
(2.51)
2.5 Questions 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
What did Girolamo Cardano write? When did it happen? What are the specifics of random tests? Give the examples of random tests. What is the random variable? Write down an equation for determining the probability of a random event based on the limits. What events are called non-crossing events? What is the sum of the probabilities of the full complex of non-overlapping random events? What is the mathematical expectation? Write an equation for the dispersion. Write an equation for the average square-law deviation. What is the moments of distribution? Write an equation for the asymmetry. Write an equation for the excess. What is the mode of discrete random variable? What is the median of random variable? What are the differences between distribution functions for discrete random variables and continuous random variables? Describe the specifics of the Binomial distribution. Describe the specifics of the Poisson distribution.
42
19. 20. 21. 22.
2 Mathematical Basis of Reliability
Describe the specifics of the Exponential distribution. Describe the specifics of the Normal distribution. Describe the specifics of the Weibull distribution. Describe the specifics of the log-normal distribution.
References 1. Dhillon D. Reliability, Quality, and Safety for Engineers [M]. London: CRC, 2004. 2. Whitesitt E. Boolean Algebra and its Applications [M]. London: Addison-Wesley, 1962. 3. Sahoo P. Probability and Mathematical Statistics [M]. Louisville: Department of Mathematics University of Louisville, 2013. 4. Murphy E. Skewness and asymmetry of distributions [J]. Metamedicine, 1982, 3: 87-99
Chapter 3
System Reliability Models
3.1 Introduction When the end-of-life condition occurs, the system is either repaired or scrapped. In this context, end-of-life technical systems are divided into recoverable and nonrecoverable. A technical system is called non-recoverable (non-repairable) if its failure leads to irreversible consequences and the system cannot be used for its intended purpose. Operation after failure of the non-recoverable system is considered impossible or impractical. Rolling bearings, semiconductor products, gears, and other components are nonrecoverable elements of the technical system. The recovered (repaired) system is a system that can continue to perform its functions after the failure that caused the cessation of its functioning has been repaired. At the same time, the recovery of the system means not only the repair of certain elements of the system, but also the complete replacement of the failed elements with new ones. Objects consisting of many elements, such as a machine tool, a car, and electronic equipment, are usually recoverable, since their failure is related to damage to one or a few elements, which can be replaced. Sometimes the same object can be considered repairable or non-repairable, depending on its characteristics, stages of operation, or purpose.
© Science Press 2024 Y. Sun et al., Reliability Engineering, https://doi.org/10.1007/978-981-99-5978-5_3
43
44
3 System Reliability Models
3.2 Non-repair System Reliability Analysis 3.2.1 Classification of System Reliability Function Providing a high level of reliability of modern complex systems, devices, especially products of aviation industry, is currently related to the application of methods for quantitative assessment of reliability of systems and devices according to the known values of reliability of elements. A system is understood as any device consisting of parts (subsystems) or elements, the reliability of which is known. Complex systems are divided into subsystems. From the point of view of reliability, systems can be sequential, parallel, and combined. The system reliability is calculated on the basis of the system reliability function. The system reliability function is the algorithm for calculating the quantitative value of the system reliability based on the reliability of the elements. In the general case, the reliability function can take into account the method of connecting elements, possible types of failures or possible ways of normal functioning, dependence of elements in terms of reliability, sequence of failures of elements, and time of occurrence of failures. There are various ways of compiling reliability functions. The complexity of the compilation methods increases with the number of factors considered by the system reliability function (Fig. 3.1). A series connection of elements for reliability is a connection in which a necessary and sufficient condition for the failure-free operation of the system is the failure-free operation of all its elements (logical “and”). Parallel connection of elements for reliability is a connection in which the necessary and sufficient conditions for the failure-free operation of the system are failure-free operation of at least one element (logical “or”) or failure of all its elements. A mixed (combined) connection of elements for reliability is a connection in which there are elements connected in series and elements which are usually connected in parallel.
Fig. 3.1 Reliability functions of non-recoverable systems
3.2 Non-repair System Reliability Analysis
45
Fig. 3.2 Diagrams of failure-free operation: a for series events (elements); b for parallel events (elements); c for combined events (elements)
Connections of elements for reliability are drawn by structural diagrams or diagrams of failure-free operation (Fig. 3.2). The block diagram is a graphical representation of the system. It displays its structural properties, i.e., ways of connecting elements or events. In this case, the concept of a connection in structural diagrams differs from a similar concept in electrical circuits, schematic, and wiring diagrams of a structure and does not reflect a relevant, but only a conditional connection. In this case, it is necessary to think about the possible types of failures, which in this case are events.
3.2.2 Analysis of the Reliability Function The approach of block diagrams is the simplest and in the fact that the real diagram is drawn by the block diagram, and it consists of events of failure-free operation of elements (failure-free operation diagram), consisting of the set of serial and parallel connections of elements. So that, according to the structural diagram, it was possible to determine the system’s reliability, i.e., to write down the system reliability function, the structural diagram is divided into parts in which the elements are connected only in series or only in parallel. Then it is determined how the parts are connected to each other (in series or in parallel). And so, they combine all the parts into larger ones until the whole system is analyzed. The parts’ reliability of the system, consisted of elements connected only in series or only in parallel, is determined simply if the reliability of the incoming elements is known. The conditions for the reliability determination by the approach of structural diagrams are as follows: (1) The events presented in the system failure-free operation diagram must be independent. (2) One and the same event can be displayed on the failure-free operation diagram only once in the form of one link, i.e., the ordinariness of the links must be observed. (3) The system is considered to consist only of single-failure elements. (4) There should be no event in the failure-free operation diagram that is the negation of another event.
46
3 System Reliability Models
A typical procedure for the reliability determination by the approach of structural diagrams provides for the following sequence: (1) Determination of the main functions of the system and analysis of dependencies between the elements (when studying the schematic and wiring diagrams). (2) Drawing up a structural (design) diagram of the event of failure-free operation of the system and its elements. (3) Drawing up calculation equations. (4) Collection and processing of information about failures of system elements. (5) Quantification of the probability of failure-free operation of the system. (6) Conclusions and recommendations. 1. Series system A system consisting of independent elements, functionally connected in such a way that the failure of any of them causes a system failure, will be drawn by the design structural diagram of failure-free operation with series-connected events of failurefree operation of the system elements (Fig. 3.2a). The probability of failure-free operation of such a system is determined by the equation (based on the multiplication theorem for compatible and independent events): Rs = Ps(t) =
n ∏
Pi (t) =
i=1
n ∏
Ri (t),
(3.1)
i=1
where n is the number of elements or subsystems, and Pi (t) = Ri (t) is the probability of failure-free operation of the reliability of the i-th element or subsystem. If the probability of failure-free operation of the i-th element has an exponential law (i.e., the elements have constant failure rates) and is determined by the equation Pi (t) = e−λi t ,
(3.2)
so, taking into account Eqs. (3.1) and (3.2), the system’s reliability also has an exponential law Rs = P0 (t) = e
−
n ∑ i=1
λi t
= e−λs t ,
(3.3)
i.e., when connected in series, the intensities add up λs =
n ∑
λi = const,
(3.4)
i=1
and the reliability of the system is always less than the reliability of the least reliable element. Mean time between failures T 0 of the system is defined as
3.2 Non-repair System Reliability Analysis
∫∞ T0 =
−
e
n ∑ i=1
47 λi t
dt =
( n ∑
)−1 λi
.
(3.5)
i=1
0
The resulting equation shows that the mean time of failure-free operation of a system with a series connection of elements is the reciprocal of the sum of the failure rates of separate elements. If all elements of the system have the same reliability function Pi (t) = P(t), then Ps (t) = [P(t)]n and λs = nλ. For the exponential law λs = nλ; T 0s = T 0 /n, here T 0s is mean time between failures. Example 3.1 For the system (with a series connection of elements) operation at full load, two different types of pumps are required. The pumps have constant failure rates equal to λ1 = 0.0001 h–1 and λ2 = 0.0002 h–1 , respectively. It is required to determine the mean time of failure-free operation of the given system and the probability of its failure-free operation within 100 h. Solution Using Eq. (3.3), we find the probability of a failure-free Rs operation of a given system for 100 h. Rs (t) = e−(λ1 +λ2 )T s, Rs (100) = e−(0.0001+0.0002)100 = 0.9745. Using Eq. (3.5), we obtain T0s =
1 1 = 3333.3 h. = λ1 + λ2 0.0001 + 0.0002
2. Parallel system A system consisting of independent elements, functionally connected so that the failure of the system occurs due to the failure of all elements, will be drawn by a structural diagram with parallel-connected events (Fig. 3.2b). In this case, all the elements of the system are functioning and under load, and the failures of the elements are statistically independent. Failure-free operation (reliability) Rs (t) = Ps (t) of a system with parallel connection of dissimilar elements is defined as Rs (t) = 1 −
n ∏
(1 − Pi (t)) = 1 −
i=1
n ∏
Ri (ti ),
(3.6)
i=1
where n is the number of parallel-connected elements, and Pi (t) is the reliability of the i-th element or subsystem. If all elements have the same reliability (Pi (t) = R), then Eq. (3.6) takes the view Rs = 1 − (1 − R)n .
(3.7)
48
3 System Reliability Models
Thus, the system’s reliability of parallel-connected elements that are independent by reliability is always higher than the most reliable element. If the failure rates are constant λs = const, then substituting Eq. (3.2) into Eq. (3.6), we obtain n ∏ ( ) 1 − e−λi t . Rs (t) = 1 −
(3.8)
i=1
Therefore, if the reliability of each element has an exponential law, then the reliability of the system does not have this law. The mean time of system failure-free operation (the mean time between failures of the system) T 0s is determined by integrating Eq. (3.8) in the interval [0, ∞] ∫∞ T0s =
∫∞ Rs (t)dt =
0
[ [ 1 − (1 − eλt )n dt.
(3.9)
0
Let’s replace the variables 1 − e−λt = x, t =
1 dx 1 ln , dt = , for t = 0, x = 0, t = ∞, x = 1 λ 1−x λ(1 − x)
then 1 T0s = λ
∫1 0
1 1 − xn dx = 1−x λ
∫1 (1 + x + · · · + x n−1 )dx 0
( ) n 1 1 1 1∑1 = 1 + + ··· + = . λ 2 n λ i=1 i
(3.10)
For large n T0s ≈
1 (ln n + 0.577). λ
(3.11)
Thus, the mean time between failures of the system is determined and is always greater than the mean time of the elements. Example 3.2 Two engines of the same type operate in a system with redundancy, and if one of them fails, then the other is able to operate at full system load. Find the mean time between failures and system reliability over 400 h (duration of the task), provided that the intensity of the engines is constant and equal to λ = 0.0005 h–1 , the engines’ failures are statistically independent, and both engines start to operate at time t = 0.
3.2 Non-repair System Reliability Analysis
49
Solution In the case of identical elements, Eq. (3.8) takes the view Rs (t) = 2e−λt − e−2λt . Since λ = 0.0005 h–1 and t = 400 h, then Rs (400) = 2e−0.0005·400 − e2·0.0005·400 = 0.9761. The mean time between failures is determined by Eq. (3.10) T0s =
( ) 1 3 1 3 1 1+ = · = = 3000 h. λ 2 2 λ 2 · 0.0005
3. Combined System The reliability of a system consisting of combined-connected independent elements (Fig. 3.2c) is determined by the reliability of its two series-connected parts: the first part, consisting of series-connected elements 1, 2, 3, and the second part, consisting of parallel-connected elements 4, 5, 6. Reliability of the first part R1p = R1 R2 R3 . The second part of reliability R2p = 1 – (1 – R4 )(1 – R5 )(1 – R6 ). Reliability of the whole system: Rs = R1 p R2 p = R1 R2 R3 [1 − (1 − R4 )(1 − R5 )(1 − R6 )]. The combined systems are the systems with partial redundancy.
3.2.3 Redundancy System Reliability 1. General description Redundancy is the use of additional tools and/or capabilities to maintain the operational state of an object in the event of failure of one or more of its elements. Redundancy of elements, units, and subsystems of an object is a powerful means of increasing the reliability and fail-safety of any designed systems, including aircraft. For redundancy, elements are divided into main and redundant. The main element is an element of the object’s structure, the minimum necessary element for the object to perform the specified functions. Redundant element (reserve) is an element designed to ensure the operability of an object in case of failure of the main element. Depending on the mode of operation of the redundant element, there are loaded redundancy, reduced redundancy, and inactive redundancy.
50
3 System Reliability Models
The redundancy multiplicity is the ratio of the number of redundant elements of the object to the number of elements under redundancy or main elements of the object, expressed as an irreducible fraction. Duplication is one-to-one redundancy. The concept of loaded, reduced, or inactive redundancy is used to differentiate the same type of redundant elements in terms of their reliability, durability, and preservation. A loaded redundancy contains one or more redundant elements that are in major element mode. At the same time, the elements of the loaded reserve have the same level of reliability, durability, and preservation as the main elements of the object under redundancy. The division of elements or subsystems into main and redundant with loaded redundancy is conditional. For example, in the case of constant duplication, as long as both identical subsystems are in operation, it is certainly impossible to determine which is the main one and which is the redundant one. In this case, the subsystems are simply given separate serial numbers or other designations (green, blue, etc.). Reduced redundancy includes one or more redundancies that are in a less stressed mode than the main redundancy. At the same time, the elements of reduced redundancy have a higher level of reliability, durability, and preservation. In the inactive redundancy, the elements are in the inactive mode until they start to perform the functions of the main element. At the same time, it is conventionally assumed that the elements of inactive redundancy, being in this state, never fail and do not reach the marginal state. There can be component redundancy and whole system redundancy. To evaluate this type of redundancy, consider several cases of parallel-serial connection of events in the structural diagram. Example 3.3 There are m parallel lines of n identical events in each (Fig. 3.3). We set P(A) = P. Probability of failure-free operation of each lines Pl = Pn ; Ql = 1 – Pn , and for the whole system Ps = 1 – (1 – Pn )m = 1 – Ql m . For m → ∞ Ps = 1, i.e., an increase in the number of parallel lines or elements increases the reliability of the whole system. For n → ∞, Ps → 0. For n → ∞, m → ∞ Ps → 0.
Fig. 3.3 Fail-safe diagram for m parallel lines of n series events (whole system redundancy)
3.2 Non-repair System Reliability Analysis
51
Fig. 3.4 Fail-safe diagram for n series groups of m parallel events (component redundancy)
Example 3.4 There are n groups, each of which consists of m parallel event connections (Fig. 3.4). The probability of failure-free operation of each group Pg = 1 – (1 – P)m , and the whole system Ps = Pg m = [1 – (1 – P)m ]n . For m → ∞, Ps → 1. For n → ∞, Ps = 0. For m → ∞ and n → ∞, Ps → 1. Hence, we can conclude that component redundancy with an increase in the number of redundant elements increases system free-failure state better than the whole system redundancy. Redundancy can be structural, functional, temporal, and load. Important systems, whose failure affects safety, are sometimes reserved by emergency systems (systems for emergency lighting, landing gear, etc. are designed for one-time operation and have a simpler structure and lower weight than the major systems), which are switched on only in case of failure of the major system. The set of major and emergency subsystems with a device that includes the emergency subsystem after the failure of the major one is a redundant replacement. 2. System free-failure state for various types of failures Different types of failures of the same element have different effects on the reliability of the system. In the case of a parallel functional connection of two electrical elements, a shortcircuit type of failure in any of the elements will result in the failure of the entire system. The type of failure as an open circuit in one of the elements causes the failure of only that element and the system remains operational. When two elements are connected in series, either type of failure (short circuit or open circuit) in any element will cause the system to fail. To find the analytical dependence of the free-failure state of the system on the free-failure state of the elements for different types of their failures, use the next method. Each element that has two types of failure is represented as two functional elements, each of which has only one type of failure. Then, having a system of two parallel-connected elements, it can be represented by a failure-free operation diagram consisting of parallel-connected events for break circuit and series-connected events for short circuit (Fig. 3.5).
52
3 System Reliability Models
Fig. 3.5 Diagram of the failure-free operation of a system of two redundant elements with two types of break-circuit and short-circuit failures
3. Inactive redundancy In this type of redundancy, one element is under load and the remaining n elements are used as inactive redundancy. Unlike a system with elements connected in parallel, where all elements are operating, the elements of such a redundancy are inactive. The probability of fault-free operation of a system consisting of n + 1 elements (one is operating and n others are in a state of inactive redundancy until the moment of failure of an operating element) is determined as R(t) =
n ∑ (λt)i e−λt i=0
i!
.
(3.12)
This equation is valid under the next conditions: (1) (2) (3) (4) (5)
The switching device is ideal. All elements are identical. The failure rates of the elements are constant. Elements under redundancy have the same performance as redundant elements. Element failures are statistically independent.
Example 3.5 The system consists of two identical devices. The failure rates of both devices are constant. It is required to determine the probability of failure-free operation of the system for 100 h, provided that the failure rate of the devices λ = 0.001 h–1 . Solution
H = (1 + λt)e−λt = (1 + 0.001 · 100)e−0.1 = 0.9953.
4. System with k serviceable elements of n elements In such a system, another type of redundancy is used, which is usually implemented in cases where, in order to ensure the operation of the system, it is necessary that a certain number of devices remain operational. Particular cases of this system for k = n and k = 1 are, respectively, systems with series and parallel connection of elements. The probability of failure-free operation of such a system is found using the binomial distribution. The probability of failure-free operation of a system that remains operational during the operation of k of n independent and identical elements has the view
3.2 Non-repair System Reliability Analysis
Rk/n =
n ∑
53
Cni R i (1 − R)n−i .
(3.13)
i=k
With a constant failure rate λ of elements, this equation takes the view Rk/n (t) =
n ∑
Cni (e−λt )i
i=k
(1 − e−λt )n . (1 − e−λt )i
(3.14)
Example 3.6 Suppose that k = 2, n = 3 and λ = 0.0001 h−1 . Then the probability of failure-free operation of the system for 200 h is (1 − e−0.001·200 )3 + e−0.001·200 (1 − e−0.001·200 )2 = 3 · e−0.001·200·2 − 2 · e−0.001·200·3 = 0.9133.
Rk/n (200) = 3 · e−0.001·200·2 ·
3.2.4 Methods of the Reliability Function 1. Structural-logical approach In order to compute the reliability of the system, it is necessary to take into account not only the type of connection, but also the possible types of failure-free operation (or failures), then the approach of structural-logical diagrams is used for compiling the reliability functions of the system. This approach takes into account how the elements are connected with each other (series, parallel, combined), i.e., the structure of the system, the peculiarities of its operation, and its possibility of failure-free state, described by means of mathematical logic. The approach of structural-logic diagrams creates a mathematical model of reliability of a real operating system. The structural diagram takes into account the fact that a real system can have several ways of fault-free operation, which cannot be ignored in calculations if one wants to adequately reflect reality. To illustrate the principle of drawing one of the types of structural-logical diagrams, let us consider an example from the field of spacecraft design-hot separation of subparts of a rocket. Let two explosive bolts (1 and 2 explosive bolts) connect two parts of the rocket (first and second rocket stages). At the moment of separation of the first submissiles from the second, an electric current pulse (+, –) is applied to the explosive bolts to destroy them by means of gunpowder located in the body of the explosive bolt (Fig. 3.6) [1]. In this case, both one and the other explosive bolts can fail, as well as their joint failure. The operating engines of the second submissile create a force N, which with a certain probability can destroy both one of the failed explosive bolts and two failed explosive bolts.
54
3 System Reliability Models
Fig. 3.6 Rocket separation system: 1—1st submissile, 2—2nd submissile, 3—1st explosive bolt, 4—2nd explosive bolt, 5—electric current source
It is necessary to find out how to consider this additional probability of destruction of explosive bolts by the action of external forces in the determination of reliability of submissiles separation. It should be noted that the same explosive bolts turn out to be connected in parallel in terms of reliability if we consider their operation up to the submissiles separation, when they perform the function of joint elements. In this case, the failure of both explosive bolts is necessary for the failure of the two-stage joint system. Let us denote by Y re a random event that the rocket separation system operates without failures. Through A1 and A2 we denote random events, consisting in the fact that the first and second explosive bolts operate normally by means of the electric current, respectively (they will be destroyed when an electric current pulse is applied). Through B1 and B2 we denote random events, considered the fact that explosive bolts (1 and 2, respectively) are destroyed by external forces P–N. The diagram of the structural-logical scheme of the rocket separation is shown in Fig. 3.7, where the structural part shows the joint type for the reliability of the explosive bolts (A1 and A2 with series connection), and the logical part also introduces some phantom elements into the circuit (A1 , A2 , B 1 , B 2 ) and an indication of their joint types among themselves and in common scheme. Then we have Yre = (A1 ∩ A2 ) U (1 ∩ A2 ∩ B1 ) U (A1 ∩2 ∩B2 ) U (1 ∩2 ∩B1 ∩ B2 ), Fig. 3.7 Structural-logical diagram of the rocket separation system
(3.15)
3.2 Non-repair System Reliability Analysis
55
where A¯ 1 , A¯ 2 are random events consisting in the failure of the first and second explosive bolts, respectively, when an electric current pulse is applied. From Eq. (3.15), we obtain Rre = P(Yre ) = P(A1 ∩ A2) + P(1 ∩ A2 ∩ B1) + P( A1 ∩ 2 ∩ B2) + P(1 ∩ 2 ∩ B1 ∩ B2) = P(A1)P(A2|A1) + P(1)P(A2|1)P(B1|1 ∩ A2) + P( A1)P(2|A1)P(B2|A1 ∩ 2) + P(1)P(2|1)P(B2|1 ∩ 2)P(B2|1 ∩ 2 ∩ B1). Considering only independent ones in terms of reliability elements, we can find P( A2 |A1 ) = P( A2 ) = R2 ; P( A2 |1 ) = P(A2 ) = R2 ; P(B1 |1 ∩ A2 ) = P(B1 ) = r1 ; P(2 |A1 ) = P(2 ) = R 2 ; P(B2 |A1 ∩2 ) = P(B2 ) = r2 ; P(2 |A1 ) = P(2 ) = R 2 ; P(B1 |1 ∩2 ) = P(B1 ) = r1 ; P(B2 |1 ∩2 ∩B1 ) = P(B2 ) = r2 . Then Rre = R1 R2 + R 1 R2 r1 + R1 R 2 r2 + R 1 R 2 r1r2 .
(3.16)
If R1 = R2 = R; R 1 = R 2 = R; r1 = r2 = r, then Rre = R 2 + 2R Rr + R 2 r 2 .
(3.17)
These equations make it possible to consider the reliability of the system, figure on the various types of normal operating of the rocket separation system. Example 3.7 Let R = r = 0.5, then Rr e = (0.5)2 + 2 × 0.5 × 0.5 × 0.5 + (0.5)2 × (0.5)2 = 0.5625. As you can see, in this case, the increase in the reliability of the system will be more than twofold, taking into account the possibility of the separation system operation also from external forces acting on the submissiles. Due to the fact that r < 1 and R 2 + 2 · R R + R 2 = (R + R)2 = 1, the last equation will always give a reliability value less than one. Obviously, explosive bolts can be considered as independent elements from the point of view of reliability only if their operation is considered by means of supply of electric current impulse. In the same case, if the operation of explosive bolts is considered under the action of external forces, the assumption of their independence in terms of reliability will be too rough. In fact, if one explosive bolt is destroyed by
56
3 System Reliability Models
an electric current or by an external force, the reliability of the second one will change significantly (it will increase because it will be easier to destroy it by external forces in comparison with the case when both explosive bolts are intact, and the destruction of the explosive bolt in this case is its regular process). However, the approach of structural-logical diagrams cannot always pay attention to the dependence of elements in terms of reliability. To do this, it is necessary to use other approaches of the system reliability functions compilation. 2. Phantom element approach This method makes it possible to take into account how the elements are joined into the system, and various reasons for the regular operation of the system and the dependence between the elements in terms of reliability. These factors are taken into account in the same way as in the approach of structural-logical diagrams, according to the same equations and attention, with an additional dependence on the reliability of the elements. The phantom elements approach is so called because both the calculation and the structural diagrams include phantom elements that represent certain random events. Phantom elements are those structural elements that reflect one or another side of real elements. In the example considered, such elements were random events, some of which reflected the property of real elements (explosive bolts) to be triggered by an electric current, others by average external loads. In general, a phantom element is a real structural element that is considered in only one limit state (with only one type of failure). If the same element can have a different type of failure, then another phantom element is introduced that is in a different boundary state. And so, as many possible boundary states an element has, as many phantom elements will replace it. All phantom elements in the system are connected in series, since the failure of any phantom element leads to the failure of the real element. When a system is created using phantom elements and real external forces are applied to it, the reliability dependence between the phantom elements is determined. In this way, the system reliability function can be assembled and the calculation can be performed. Figure 3.8 shows the layouts for two members: two butt-welded stringers. A compressive force N is applied to this system of two members, and each individual element can lose local stability and the entire system of both elements can lose overall stability. To the right of the real system in Fig. 3.8 is a system of three phantom elements: 1pe and 2pe, which can lose local stability, and 3pe, which can lose general stability. Fig. 3.8 System for two butt-welded stringers: 1—1st stringer, 2—2nd stringer, 3—welding joint, 4—1st phantom element, 5—2nd phantom element, 6—3rd phantom element
3.3 Repair System Reliability Analysis
57
In order to determine the probability of the system free-failure operation, it is necessary to calculate conditional probabilities that reflect the dependence of reliability between phantom elements. It is rather difficult to determine the values of conditional probabilities, since it is necessary to know the correlation moments and correlation factors for random variables. They can be determined by the usual rules of probability theory. In general, the correlation factor can be determined theoretically by exact equations, by approximate equations by linearization of functions, and by the approach of statistical tests (Monte Carlo). Approximate approaches are convenient to use. With their help, it is always possible to determine the correlation factor for random variables with any (necessary) degree of approximation.
3.3 Repair System Reliability Analysis 3.3.1 General Information A system is a set of interrelated elements. They are connected to fulfill one or more given purposes. In general, a system consists of (1) Elements as functional parts. (2) The properties of the element. (3) The relationships between elements and properties. A system is not only intended to perform its given functions under its functional conditions and constraints. It is also intended to provide given requirements as characteristics and properties. Sub-elements and sub-units can never show some of the specific behavior of a system as a whole. Any system can be a component of a higher level system, and vice versa the elements of a system can be lower level systems by themselves. Every system has a function for those elements, properties, and correlations that have been intended. Everything else that is outside the system being studied is called the medium. This system has incoming signals from the medium (as data and/or material and/or energy). It makes outgoing signals to this medium, which can have a different nature than that of the incoming signals the system had received. Inside, these elements interact by incoming and outgoing signals. Moreover, the result of one element is the income to others. Functionality is the built-in specific of an element (or system) to provide mandatory functions with given features and properties in time of its defined application. This concept separates the specifics functionability and functionality. The first is just taken to attention, the level of features reached. The second as well just explains the provided function. At the beginning of the life of the system it is functional. Through this fact it is clear to understand that after using the ideal system, the highest level of technology for its creation or its structural materials, certain permanent changes will be presented due to
58
3 System Reliability Models
mutual influence of various processes, such as corrosion, deformations, overheating, and fatigue. These mutual processes are the main factors after the turn in the future performance of this system. These mutual processes are the main factors after the turn in the future performance of this system. The reason of failure is the difference of such performances from the specified ones. Therefore, the failure of a system can be specified as a case which has the following results: loss of ability to provide necessary functions and loss of ability to fulfill certain requirements (i.e., features and/or properties). A failure causes the system to change from a functional state to a failure state or to a state with unacceptable properties, regardless of the cause of these changes. For the most part of systems, the transfer to the unsatisfactory or failure state means termination. Such technical systems have name non-repairable or non-maintained system. Functionality of this type of systems cannot be renewed with rational expenditure of time, resources, and means. For example, in modern times, the biggest part of space rockets is a once launched non-repairable system. The non-repairable systems have many examples: electronic components, batteries, electric bulbs, etc. Nevertheless, there is maintenance, which is a set of processes that can renew the functional ability of many systems. Such processes can be complex, for example, a major repair, they can be simple as just replacement or adjustment and even cleaning. There are some examples of repairable systems such as cars, laptops, airplanes, and industrial equipment. For example, if a battery is damaged, the laptop may not start. In such a case, the battery is a non-repairable element and replacing it can eliminate this problem. Another example of a repairable system is radio equipment, which in the event of failure can be restored to a specified state by simply replacing the damaged electronic component and/or adjusting its settings. The system, really, transfers between specified and unspecified conditions in time of its service life until a solution is taken to recycle it. The system can be functionable along certain period. This period duration depends on the system’s in-built performances, structure, and application purpose given by the customer’s certain operations and features. The outstanding built-in performances are maintainability and supportability, and of course reliability. The frequency of failures and the complexity of maintenance directly depend on these performances. How easy it is to support this process is also based on them. Application performance is defined by the maintenance policy adopted and the customer’s functional algorithm. They are also provided by the delivery processes associated with supporting the required functional and maintenance processes. In other words, the functional capability profile is a specified type that depends on the inherent performances of the design and application of the system. The operability profile of an element or system is determined by its availability or its options. It is a highly relevant and helpful indicator for repairable systems, especially in case of technical repair, when the user has to make decisions about getting a solution among several options with different indicators of maintainability, supportability, and reliability. Availability and functionality both show how good is the system. This performance has named a technical effectiveness of system and
3.3 Repair System Reliability Analysis
59
shows the in-built performances of the system. Clearly, the great opportunity to do an effect on the system’s features is at the project stage. At this stage, adjustments and changes are achievable at almost minimal cost. Therefore, the most difficult problem for designers, researchers, and maintenance engineers has been to evaluate the effect of the structure on the maintainability at this preliminary stage of the project on the basis of their own real experience, management, planning, and analysis. Determination of reliability indicators is not the only task of repairable system analysis. Most of the complex technical systems, such as airplanes, trains, cars, engines, electronic devices, communication systems, and medical equipment, are repaired in case of failure. Actually, when a system starts to fail, it is affected by three different features that affect the reasons. They are maintenance, operation, and supply. They depend on the customer’s plans. Determination of reliability and other performance is important under these conditions. They include evaluation of the expected number of failures in time of the warranty period; providing a certain level of reliability for a given time; determining the wear factor, when to repair or replace a system (or subsystem); and, of course, reducing the life-cycle costs of a technical system. Data analysis for regular or accelerated reliability test (parametric or nonparametric) is based on a really random specimen taken from a set and equally distributed and independent assumptions on the data of reliability received from the experimental departments. These suppositions can also be legal for the initial failure of some similar devices, received in identical projects and manufacturing procedures, that set in a supposed or certain to be in a similar environment. The life data of such elements usually has the single-failure time of an element (or the earliest failure for elements that are repairable) and is called as suspension or censoring. The reliability references are sufficient to embrace all tasks for analysis of reliability data when the time of failure is simulated by certain life distributions. However, in repairable systems, there are usually multiple failures of a single system. They often violate the independent and identically distributed assumption. Therefore, it is not surprising that analysis approaches required for repairable systems differ from those applicable for statistical analysis of non-repairable elements. To determine the reliability performance of complex technical repairable systems, a procedure preferred to a distribution is often applied. The next failure time depends on two factors for a repairable system: the life distribution (the probability of time to the initial failure distribution) and the effect of maintenance procedures applied after the initial failure. The Power Law Process (PLP) is the most effective process model. Such a model is effective for several factors [2]. It has a real basis from the point of minimal possible repair: a case when the broken system repair is just sufficient to provide the system operability by changing its elements or repair. The PLP repair model covers each successive failure and correctly simulates the minimum possible repair if the time before the first failure has the Weibull distribution. Thus, the first failure is described by the Weibull distribution and each successive failure is described by the PLP model for a repairable system. Therefore, the PLP is considered to be a generalization of the Poisson process and the Weibull distribution. Moreover, the
60
3 System Reliability Models
PLP is mostly casual to compute for obtaining practical and useful results that are usually accepted and covered by the management. The routine concept and assumption of a major system repair is to return the system to “like new” condition. This concept is not correct in real life, because a repair cannot return the reliability of the system to the state when the system was new. However, all researchers agree that a repair significantly increases the reliability of the system compared to before the repair. There is only one life cycle for nonrepairable systems, and customers are worried about the reliability performance of these systems because these systems age in time of their functional life. Conversely, for systems that are repaired several times in time of their operational life, researchers will be interested in the reliability performances of these systems as its ages along its cycles. The age of the system is considered from the beginning of the cycle. Each cycle starts from a new zero time. As explained above, a repairable system is a system which is reconstructed to its operational conditions after the failure by the processes different from the change of the certain element of the system. The scope of the overhaul depends on various reasons such as the importance of the failed element, the operational status of the system, the risk rate, etc. Consequently, the customer makes a decision about the repair volume for the system. There are two opposite cases of repair: major repair and minimal repair. A system is understood as a major repair if the system is reconstructed to “as good as new” state (as if it was replaced with a new one). The replacement of critical elements that may compromise the functional capability, safety of the system, and/or the safety of personnel working with that system is a major part of a normal major repair. Conversely, if a system is renewed to operational conditions like “as bad as old” then it is defined as minimal repair. Such type of repair is carried out in cases when there is a strong request for the system operation for a limited period of time or when the system will go to maintenance soon or will be scrapped soon. Any repair other than major and minimal repairs belongs to general repairs. Most of the repairs found in the daily systems are general repairs. It means that a system is neither renewed to “as good as new” state nor to “as bad as old” state. Three types of repair diagrams are shown in Fig. 3.9. Figure 3.9 shows that for a major repair, the system is renewed “as good as new” and the lifetime starts from zero, meaning that all problems are fixed and the system performance is fully recovered. In the case of minimal repair, after this type of repair the system has the same lifetime as before the procedure and there is no recovery. In the case of general repair, some of the system’s life is recovered, and the system begins to operate after it’s repaired to a state somewhere between “as good as new” and “as bad as old”. Figure 3.10 shows the reliability analysis approaches for repairable and nonrepairable elements. A restoration process is calculations where the event’s initiation times are randomly, independent, and identically distributed with a certain life distribution. In time of the restoration process, a single distribution is specifying the time between failures (TBF), and the repair’s frequency is changeless. There is non-renewal
3.3 Repair System Reliability Analysis
61
Fig. 3.9 Types of repairs (the time in the figure is the normalized time)
Fig. 3.10 Different approaches for reliability analysis
specific, if the repairs’ frequency grows (system’s degradation) or reduces (system’s advancement), it will affect the related maintenance expenses. A series of statistically independent and unique distributed exponential random variables are depicted by the homogeneous Poisson process (HPP). Inversely, a series of random variables that are not statistically independent and unique distributed are depicted by the nonhomogeneous Poisson process (NHPP). For simulation of repairable systems that are under minimal repair, the NHPP is widely applied. The general renewal process (GRP) is within two extremities: “as good as new” repair till the “same as old repair”. It is especially helpful for simulation of the failure behavior of a certain unit. For understanding the effects of the process of repair on the lifetime of this unit, it is also applicable. The GRP is particularly relevant, for example, for a system to that is repaired after a failure and its repair does not provide the system to an “as good as new” or an “as bad as old” states, but alternatively partially regenerates this system.
62
3 System Reliability Models
Therefore, it is necessary to understand that without appearing at the real disposition of the data can lead to undervaluation or overvaluation of technical indicators. There are needed a given grade of statistical data, the ability to complete the complex mathematical statement and check of distributional approximations for the analysis by applying the systematic approaches on processes of failure and repair. Moreover, such mathematical statements cannot be completed by analytical way and require a uniquely cyclic pass or special software. Also, parametric ways require high computing power and are not intuitive to a novice researcher. The events’ analysis, regardless of the type of the system, should be modified from non-parametric to multipurpose parametric algorithm and possibly with graphical visualization. Beyond the question, the selection of approach depends on the statistical data accessible and the tasks that the researcher wishes to answer. A researcher deals with different algorithms grounded on Poisson and restoration processes and their combinations. For example, a process of restoration, which depicts a series of a system’s single-point major repairs is combined to the event of the general repair, hence it is making a part of general repairs. Performances of reliability of repairable systems in time of a general repair are of greater importance for a system operation for long period. Minimal repair is a partial case of a general repair, when a system returned to a regular condition, which it had before failure. The process of the single-point minimal repair is described by the NHPP. For reliability analysis of technical systems, the Poisson processes are widely applied. The NHPP has the autonomous growths specific that is, in substance, constraining, particularly in applications. Because the stochastic processes do not meet this approximation in many real tasks. Hence, it is possible to simplify this approximation and depict in detail the general Polya process (GPP). GPP has conditional growths and the event’s probability happening in a short time depends on the quantity of events being before. Such case, on one side, is very real in many engineering cases and, on the other side, is analytically easily processable. Main feature of engineering area that all elements are functioning in a variable (random) media. There are several ways for such processes. One of them to simulate an effect of variable media is by the outside impacts happening in conformity with certain random point processes. The word “impact” has one of sense as a certain single-point and possibly dangerous case (e.g., lightning strike, high-level electrical impulses, unlimited dynamic load, earthquakes, etc.). In theoretical and practical tasks of reliability, the impact models are widely applied. They are also applicable in other scientific fields. There are many impact models that were researched along the last years. In reliability, the impact models largely cover with engineering systems survivability performances that are depending on outside point events. Hence, under variative approximations and conditions, it will be considered a several of life models for systems functioning under such impact processes (restoration, NHPP, GPP).
3.3 Repair System Reliability Analysis
63
3.3.2 System Reliability Function 1. Definition and main properties Restoration theory based on an engineering root, depicting the quantities of renewals, which were done in time of the repairable element lifetime. The main feature of this process is the element is renewed by other (similar) one on every failure. Therefore, the routine task for this direction is to evaluate the average amount of spare parts or to evaluate the probability of spare parts. The quantity of spare parts is necessary for providing long-time operation of an engineering system. The probability of the wealthiest spare parts is important for a final purpose with a limited duration. Later this approach was transformed to a general theory for random point processes. Let {X i }i≥1 denominate a series of independent service life random variables that have the same distribution and with the common CDF F(t). Hence, X i , i ≥ 1 they are the replicates of several generics X. The related reaching times will be determined as T0 ≡ 0, Tn ≡
n ∑
X i , n = 1, 2....
i=1
where X i can be seen as cycles between series restorations. Manifestly, such condition is similar with a major, single-point repair. The same point process will be determined as N (t) = sup{n : Tn ≤ t} =
∞ ∑
I (Tn ≤ t),
n=1
where the indicator I (Tn ≤ t) is equal to 1 for Tn ≤ t and, otherwise, is equal to 0. A point process is a random process, for which any common implementation corresponds to a collection of separated points in time or geometric volume. On one side, with a view to depict the point happenings in time, a point process is in the series reaching points (0, ∞). Opposite, a calculation process depicts the random happenings of points in means of the numbers of points detected in time measures. Here, for suitability, the words “point process” and “calculation process” let to consider as interchangeably. Therefore, the restoration process can be specified in means of N(t) or in means of reaching events {Tn , n = 0, 1, 2...} as explained later, with depending on the mode of expression. Definition 3.1 The depicted calculation process {N (t), t ≥ 0} and the point process {Tn , n = 1, 2, 3, ...} are both called restoration processes. As another option, restoration process is specified by the set of independent random variables {X i }i≥1 with the same distribution. Therefore, restoration processes are specified either by N(t) or by reaching times (cycles). The random intensity with a general normalized (multiple happenings are
64
3 System Reliability Models
not present) point process is the following next random process: λt = lim
∆t→0
E(N (t, t + ∆t) = 1|Ht− ) P(N (t, t + ∆t) = 1|Ht− ) = lim , (3.18) ∆t→0 ∆t ∆t
where N (t, t + ∆t) is the quantity of points, which happen for [t, t + ∆t), Ht− = {N (s) : 0 ≤ s < t} is a point process’s inner records for [0, t), so it is the collection of all point happenings in [0, t). Equation (3.18), for the restoration process, transforms to view λt = λ(t − TN (t−) ), t ≥ 0,
(3.19)
where λ(t) is the failure rate, which is satisfied to the CDF F(t) and TN (t−) is the last restoration (with a random occurrence) in the time range (0, t). Therefore, it is accepted that the density f (t) = F ' (t) exists. Equation (3.19) can be transformed as λt =
∑
λ(t − tn )I (tn < t ≤ tn+1 ), t ≥ 0,
n≥0
where t n is a realization of the arrival time Tn , n = 0, 1, 2, .... Therefore, the restoration process has the simple records as the time passed from the past restoration. So, the past (that are before final) restorations do not affect the next restorations’ times. Notwithstanding of such “simplification”, probabilistic statement (specifics) of such process also for the starting cases is not unequivocal and requires careful procedure. It happens for the next reason N (t) ≥ n ⇔ Tn ≤ t. So, the probability of presence of n points in (0, t] is P(N (t) = n) = P(N (t) ≥ n) − P(N (t) ≥ n + 1) = P(Tn ≤ t) − P(Tn+1 ≤ t) = Fn (t) − Fn+1 (t),
(3.20)
where Fn (t) is the n-fold convolution of F(t) with itself and by expressions, F0 (t) ≡ 1 and F1 (t) ≡ F(t). It is related to the fact that the distribution of the sum of independent random variables with similar distribution is determined by the related composition. For example, the density of the sum of two independent random variables with similar distributions is given by the following expression: ∫t f n (t) =
f (x) f n−1 (t − x)dx, 0
where f n (t) is the corresponding probability density function of F n (t).
3.3 Repair System Reliability Analysis
65
The next function that determines the average quantity of restorations of (0, t] is the main task in the restoration theory. Definition 3.2 The function of a restoration is determined as H (t) = E[N (t)].
(3.21)
This function also has an important place in different applications, as it describes the average number of equipment repairs or overhauls in (0, t]. Exactly, when F(t) = 1−exp{−λt} is an exponential lifetime distribution, T n follows an Erlang distribution (i.e., gamma distribution with positive integer shaper parameter) and Eq. (3.20) is converted to P(N (t) = n) = exp{−λt}
(λt)n , n = 0, 1, 2, ... n!
(3.22)
which describes the Poisson distribution. From Eqs. (3.20) and (3.21), it follows that H(t) has view as the infinite sum of compositions: H (t) = E[N (t)] =
∞ ∑
n P(N (t) = n) =
n=1
∞ ∑
Fn (t).
(3.23)
n=1
Indeed, from Eq. (3.20) E[N (t)] =
∞ ∑
n(Fn (t) − Fn+1 (t)) =
n=1
∞ ∑
Fn (t).
n=1
Assume, as previously, that F(t) is absolutely continuous and, therefore, the density f (t) exists. Denote by ∗
∫∞
H (s) =
∗
∫∞
exp{−st}H (t)dt and f (s) = 0
exp{−st} f (t)dt 0
the Laplace transforms of H(t) and f (t), respectively. The rate of the systematic point process is based on the next limit lim
∆t→0
P(N (t, t + ∆t) − N (t) = 1) E[N (t, t + ∆t) − N (t)] = lim ∆t→0 ∆t ∆t dE[N (t)] ' = H (t). (3.24) = dt
In the certain conditions of a restoration process, this rate has named the restoration density function. Designate it by h(t). Therefore, h(t) = H' (t) and
66
3 System Reliability Models
∫t H (t) =
h(u)du. 0
Moreover, h(t)dt can be interpreted as the probability of a restoration (not necessarily the first) happening in (t, t + dt]. This basic interpretation is important, and researchers often use it in what follows. Differentiating both sides of Eq. (3.23) results in h(t) =
∞ ∑
f n (t),
(3.25)
n=1
where f n (t) = dFdtn (t) . Applying the Laplace transform to both sides of Eq. (3.24) and using the fact that the Laplace transform of a convolution of two functions is the product of the Laplace transforms of these functions, we arrive at the following equations: h ∗ (s) =
∞ ∑ ( k=1
f ∗ (s)
)k
=
f ∗ (s) . 1 − f ∗ (s)
(3.26)
However, as according to the properties of the Laplace transform. h(s) = s H (s) − H (0) = s H (s). It is possible to derive the expression for H * (s) as H ∗ (s) =
F ∗ (s) f ∗ (s) = . ∗ s(1 − f (s)) 1 − f ∗ (s)
(3.27)
As the Laplace transform uniquely defines the corresponding distribution, Eq. (3.27) means that the restoration ointment is unambiguously determined by the subjacent distribution F(t) via the Laplace transform of its density. Specifically, when F(t) = 1 – exp{–λt} is an exponential lifetime distribution, from the properties of the Poisson distribution, it follows that H (t) = λt, h(t) = λ. Therefore, for this specific Poisson process case, the restoration function and restoration density functions are trivial and exact. Though, for arbitrary inter-reaching distribution, it is not the case and the whole restoration theory was developed to account for this. It’s without doubt that restoration process does not possess the Markov specific as it has history that effects future restoration times. Thus, its increments are not independent. Though there are Markovian points in the restoration process.
3.3 Repair System Reliability Analysis
67
Those are the points of restoration after which the process restarts. This fact allows us to use the restoration-type thinking in analytical description of the main restoration indexes. Exactly, due to existence of the restoration points, it is possible to write the following integral equations for the functions H(t) and h(t): ∫t H (t) = F(t) +
H (t − x) f (x)dx,
(3.28)
h(t − x) f (x)dx.
(3.29)
0
∫t h(t) = f (t) + 0
Let us prove Eq. (3.28) by conditioning on the time of the first restoration, i.e., ∫∞ H (t) =
E[N (t)|X 1 = x] f (x)dx 0
∫t E[N (t)|X 1 = x] f (x)dx
= 0
∫∞ =
∫∞ [1 + H [t − x] f (x)dx = F(t) +
0
H [t − x] f (x)dx. 0
If the first restoration is at time x ≤ t, then the process simply restarts and the expected number of restorations after the first one in the interval (x, t] is H (t − x). Note that Eq. (3.27) can also be obtained by applying the Laplace transform to both parts of Eq. (3.28). Therefore, the solution can be found in terms of the Laplace transform that can be later inverted (analytically or numerically). In a similar way, considering the two exclusive cases when the first event happens at t or it happens before t, and conditioning on the happening time of the first event in the latter case, it is possible to obtain Eq. (3.29). In accordance with the definition, h(t)dt is the probability that a restoration happens in (t, t + dt]. Thus, the right side in Eq. (3.29), using the law of total probability, just “gets together” probabilities of the corresponding events. First, the probability that the first event will happen in (t, t +dt] is f (t)dt. Then, if the first event had happened in (with probability f (x)dx the probability that any event happens in is h(t – x)dt and the product is due to the restart of the process. Finally, the integration is with respect to the time of happening of the first event. It is possible to use extensively the similar restoration-type reasoning in what follows in this chapter. For example, it is possible to use the defined heuristic argument for deriving the distribution of the time of the last reaching before t, i.e., T N(t) . So, for x < t
68
3 System Reliability Models
∫x P(TN (t) ≤ x) = F(t) +
h(y)F(t − y)dy.
(3.30)
0
The first term in the right side is the probability that there are no actual reaches before t except the trivial one T 0 = 0. The integrand means that the last event before t had happened in (y, y + dy] because h(y)dy is the probability that some event had happened in this interval whereas F(t – y) is the probability that no other event will happen later. The integral ensures that TN (t) ≤ x when there is at least one renewal in (0, x]. It is seen that obtaining the restoration function and the restoration density function for a limited interval involves solution of the corresponding equations that in many cases should be done numerically even if the Laplace transform is applied. Nevertheless, in practice, it is often interested in asymptotic solutions for large t. The following subpart will present a brief study of some asymptotic performances of a restoration process. The most effective result to be used often in what follows in this subpart will be the key restoration theorem. 2. Limiting Performances Denote by l the mean of the baseline inter-arrival time∫X that is described by the ∞ CFD F(t) and assume that it is finite, i.e., μ ≡ E[X ] = 0 F(u)du < ∞. It is also possible to assume that X is continuous and therefore does not contain atoms. Next result combines the following two asymptotic properties. Theorem 3.1 With probability P = 1 1 N (t) → as t → ∞, t μ
(3.31)
1 H (t) → as t → ∞. t μ
(3.32)
Equation (3.32) is usually called the Elementary Renewal Theorem and its intuitive meaning is quite clear: due to the strong law of large numbers, asymptotically as t → ∞, the mean of the restoration cycle is approximately t over the total number of restoration cycles in (0, t]. The approximation is a result of the last unfinished cycle, the duration of which is different from the preceding cycles. When inter-reaching times are exponentially distributed, H(t) = λt and Eq. (3.32) is exact. The following theorem gives the next term of asymptotic in Eq. (3.32). Its proof will be given via the corresponding Laplace transforms. Theorem 3.2 Let E[X] = μ, Var(X) = σ 2 . Then the following asymptotic relationship holds as t → ∞. H (t) =
σ 2 − μ2 t + + o(1). μ 2μ2
(3.33)
3.3 Repair System Reliability Analysis
69
Proof The Laplace transform of the density f (x) can be obtained by the corresponding expansion into the Tailor’s series: [
[
[
1 f (s) = E exp{−s X } = E 1 − s X + (s X )2 − · · · 2 s2 = 1 − sμ + (σ 2 + μ2 ) + o(s 3 ). 2
]
(3.34)
Substituting the expression for the Laplace transform of the renewal function Eq. (3.26), we obtain after algebraic transformations, as s → ∞ H ∗ (s) =
1 σ 2 − μ2 + o(1). + μs 2 2μ2 s
(3.35)
Inversion of this equation for t → ∞ results in H (t) =
σ 2 − μ2 t + + υ(t), μ 2μ2
(3.36)
where υ(t) is a “residual” term. Denote by υ * (t) the Laplace transform of υ(t). Then, using the Tauberian-type theorem in the view lim υ(t) = lim sυ ∗ (s),
t→∞
s→0
the following limit can be obtained: (
1 σ 2 − μ2 lim s H (s) − 2 − t→∞ μs 2μ2 s ∗
) = 0.
Therefore, Eq. (3.33) holds. Asymptotic Eqs. (3.32) and (3.33) for t → ∞ can be written in a more convenient way as t (1 + o(1)), μ
(3.37)
σ 2 − μ2 t + (1 + o(1)), μ 2μ2
(3.38)
H (t) = H (t) =
where Eq. (3.37) defines the first term of asymptotic expansion and Eq. (3.38) defines already two terms of this expansion. Definition 3.3 Let A(t) = t − TN (t) and B(t) = TN (t)+1−t
70
3 System Reliability Models
denote the random age and the residual (excess) lifetime (at a chronological time t) of an item operating in accordance with a restoration process. Thus A(t) defines the time elapsed since the last restoration, whereas B(t) defines the time to the next restoration. The following important limiting theorem specifies the corresponding distributions and sates that these distributions are asymptotically, as t → ∞, equal. Theorem 3.3 Let X be a continuous random variable with a finite mean, μ ≡ E[X ] < ∞. Then ∫x F(u)du . (3.39) lim P( A(t) ≤ x) = lim P(B(t) ≤ x) = 0 t→∞ t→∞ μ This is a remarkable result. First, it establishes asymptotic equality of distributions of age and residual lifetime. Secondly, it defines the equilibrium distribution ∫x Feq (x) =
0
F(u)du μ
(3.40)
that is widely used in different applications which will be demonstrated later. Most importantly, it allows to look at asymptotic specifics in Theorem 3.1 from a different side. Note that the Laplace transform of the equilibrium distribution is ∗ Feq (s) =
1 − f ∗ (s) . μs
(3.41)
Then, consider the delayed restoration process (all cycles, except the first one, are independent and identically distributed, whereas the first one is independent of the others but has a different distribution) with the distribution of the first cycle given by equilibrium distribution F eq (x). This specific delayed process is often called the equilibrium restoration process. Denote the restoration function for the delayed process by H D (t). Theorem 3.4 For the equilibrium restoration process, the following equality holds: H D (t) =
t , μ
(3.42)
which also means that asymptotic relation for the ordinary renewal process, Eq. (3.32), turns into equality for the equilibrium renewal process. Proof Similar to Eq. (3.10), it is easy to show for the equilibrium restoration process that the Laplace transform of the corresponding restoration function is H D∗ (s) =
∗ Feq (s)
1 − f ∗ (s)
.
(3.43)
3.3 Repair System Reliability Analysis
71
Then from Eqs. (3.41) and (3.43), it follows that H D∗ (s) =
1 . μs
(3.44)
1 Inverting μs and taking in to account the uniqueness of transforms result in the exact Eq. (3.42). It can be also proved that the equilibrium process possesses stationary increments and, therefore, is a stationary process. The result in Theorem 3.4 is really meaningful as it allows for simple description of the corresponding restoration and restoration density functions instead of integral equations or infinite sums in case of the ordinary restoration process. The intuitive reasoning behind the equality is as follows. Consider the ordinary restoration process at t = 0 which had started at t = –∞. Therefore, the corresponding delayed restoration process that has started at t = 0 will have the first cycle defined by the equilibrium distribution Eq. (3.40) and is described by Eq. (3.42) for t ≥ 0. In other words, the point t = 0 is equivalent to the infinity point for an ordinary restoration process where it is already stationary. For formulating the next limiting result, it is needed to define the following conditions that are sufficient conditions for θ (t) to be a Riemann integral function:
θ (t) ≥ 0, t ≥ 0, θ (t) is non-increasing ∫∞ θ (u)du < ∞. 0
The next theorem is called the Key Renewal Theorem and its importance in restoration theory and applications is hard to overestimate. Theorem 3.5 Let F(x) be the distribution of the continuous inter-arrival time X in the ordinary restoration process. Assume that θ (t) is directly Riemann integrable. Then ∫t
∫t θ (t − u)dH (u) = lim
lim
t→∞
t→∞
0
0
1 h(u)θ (t − u)du = μ
∫t θ (u)du, 0
∫∞ where μ = 0 F(u)du < ∞ is the mean of the cycle in the ordinary restoration process and in accordance with Eqs. (3.5) and (3.7) H (t) =
∞ ∑ n=1
Fn (t), h(t) =
∞ ∑ n=1
f n (t).
72
3 System Reliability Models
First of all, it should be noted that Theorem 3.5 is a limiting result establishing an important property for t → ∞. Secondly, it enables derivations in various settings describing repairable items that dramatically simplify results. Indeed, the complex restoration density functions is “vanishing” as t → ∞ and only the mean of the cycle duration and the integral of the function θ (t) are left. The power of this theorem in applications is in the fact that the functions θ (t) can be different for different conditions thus giving the opportunity to consider the variety of models. 3. Alternating restoration and restoration reward processes Ordinary restoration process was defined assuming that the replacement of the failed item is single point. It is usually not the case in practice, although at many examples the mean time to failure is much larger than the mean time of repair and the approximation of single-point repair can be adopted as a plausible model. Though this approximation is not often met in practice and the restoration processes with non-single time repair should be considered. The simplest processes of this kind that are often used in practice are alternating restoration processes. These processes are still the processes of major repair, when an element after repair is “as good as new”. Let an element’s consecutive operation times be {X i }, i ≥ 1 (independent and identically distributed with the distribution F(x) and density f (x)), whereas the corresponding repair times be {Y i }, i ≥ 1 (independent and identically distributed with the distribution G(x) and density g(x)). Assume that these series are independent and the corresponding random variables are continuous. Therefore, the process {X i + Y i ≡ Z i }, i ≥ 1 is an ordinary restoration process with an underlying distribution function C(x) that is a convolution of F(x) and G(x). Therefore ∫x C(x) = P(Z i ≤ x) =
∫x F(x − u)g(u)du =
0
G(x − u) f (u)du. 0
Denote the corresponding means by μX , μY , μZ and let the state of an element (system) be given by the binary variable: Ω(t) = 1 if an element is operating at time t and Ω(t) = 0, if it is in the state of failure (repair). Definition 3.4 The defined process is called the alternating restoration process. In reliability, the first index of interest for a system that is operating and being repaired in accordance with the alternating restoration process is availability, i.e., the probability that a system is operating at time t A(t) = P(Ω(t) = 1) = E[Ω(t)].
(3.45)
Specifically, the stationary (limiting) availability is mostly used in applications A = lim A(t). t→∞
3.3 Repair System Reliability Analysis
73
The following theorem provides an intuitively expected equation for A, whereas the non-stationary A(t) is also derived. Theorem 3.6 For the element operating in accordance with the described alternating restoration process, the stationary availability is given by A = lim A(t) = t→∞
μX . μ X + μY
(3.46)
Proof In accordance with the law of total probability, we “get together” the corresponding events (and their probabilities) that are resulting in the state of operation at time t. Therefore ∫t A(t) = F(t) +
h Z (u)F(t − u)du,
(3.47)
0
where F(t) means that an element was operating without failures up to t. The integrand defines the probability that the restoration (that is defined by the restoration density function hZ (u) of the ordinary renewal process {Z i }, i ≥ 1) has happened in (u, u+du] and it was the last renewal in [0, t] (as we multiply by the survival probability F(t − u)). So, Eq. (3.47) defines non-stationary availability and it can be obtained at least via the corresponding Laplace transform. For obtaining the stationary availability, it will possible to apply the Key Renewal Theorem and reach at Eq. (3.46). Certainly, the first term in the right side of Eq. (3.47) is vanishing as t → ∞. The renewal density function hZ (t) is with respect to the ordinary restoration process with ∫ ∞ the average cycle duration μX + μY . The function F(x) is Riemann integral as 0 F(u)du = μ X < ∞. So, it is possible to apply this theorem which results in Eq. (3.46). Equation (3.47) defines the time-dependent availability that usually should be obtained numerically in practice. Though, for the simplest cases, the explicit solution exists. For example, in the case when both distributions are exponential, i.e., F(t) = 1 − exp{−λ X t}, λ X =
1 1 ; G(t)1 − exp{−λY t}, λY = , μX μY
a well-known expression for non-stationary availability can be obtained by applying the Laplace transform as A(t) =
μY μX + exp{−(λ X + λY )t}, μ X + μY μ X + μY
which converges to the stationary availability Eq. (3.46) when t → ∞. It is worth mentioning that, in this case, availability is the same as the probability of being in an “on-state” for the corresponding two-state Markov chain.
74
3 System Reliability Models
Definition 3.5 Given a series of random variables {X n }, n ≥ 1, an integer-valued random variable N is called a stopping time for {X n }, n ≥ 1, if for all n = 1, 2, …, event {N = n} is independent of X n+1 , X n+2 , …. Assume now that N is a stopping time for a restoration process. Then it is possible to observe the process in series order and let N be the number of observed events before stopping. If N = n, then it will stop after observing X 1 , …, X n and before observing X n+1 , X n+2 , …. It’s definitely that events {N ≤ n} and {N > n} will be determined by X 1 , …, X n only. Theorem 3.7 (Wald’s Equation) If N is a stopping time (E[N] < ∞) for the restoration series of random variables with limit means, then [ E
N ∑
] X n = E[N ]E[X ].
(3.48)
n=1
Proof Let I n = 1 for N ≥ n and I n = 0 for N < n. Then E
[ N ∑ n=1
]
[
Xn = E
∞ ∑
] In X n =
n=1
= E[X ]
∞ ∑
E[In X n ] = E[X ]
n=1
∞ ∑
∞ ∑
E[In ]
n=1
P(N ≥ n) = E[N ]E[X ],
n=1
where the third equality is because N is a stopping time and, thus, the event {N ≥ n} is determined by X 1 , …, X n–1 and independent of X n . We can assume now that each time a restoration in the ordinary restoration process {N(t) t ≥ 0} (with the average inter-reach time μ) happens, a random reward is assigned. Denote the reward after the n-th cycle by Rn , n ≥ 1. Assume that these random variables are independent and identically distributed with R ≡ E[Rn ], n ≥ 1, and the pairs (X n , Rn ), n ≥ 1, are independent. Therefore, the total reward in (0, t] is defined by the random reward process: R(t) =
N (t) ∑
Rn .
n=1
Theorem 3.8 (Renewal Reward Theorem) Under the given approximations lim
t→∞
E[R] E[R(t)] = . t μ
Proof Applying expectation to both sides of R(t) = equation
(3.49) ∑ N (t) n=1
Rn and using Wald’s
3.3 Repair System Reliability Analysis
E[R(t)] = H (t)E[R].
75
(3.50)
Dividing both sides of Eq. (3.50) by t and using the Elementary Renewal Theorem (Eq. 3.32) as t → ∞, we reach at Eq. (3.49). We can observe that E[R] has the meaning of the long-run reward rate (reward μ per unit of time). Note that reward can be negative and has the meaning of the cost incurred at the time of the n-th restoration. Then Eq. (3.49) can be interpreted as the long-run cost rate, which is the expected cost incurred in one cycle over the expected duration of the restoration cycle. There can be cases with both costs and rewards. For example, shown by various optimal maintenance tasks considered throughout this book, the conventional approximation is that the costs are positive whereas the rewards are negative in this case.
3.3.3 Non-parametric Analysis Methods Many elements have repeatable repairs that require specific statistical analysis in terms of developing approaches and patterns for simulation. The elements are usually regarded as stochastically independent, but the periods between the happenings of repair or failure occasions within a system block are neither necessarily independent nor distributed in the same way. The data are usually evaluated in such a way that system blocks have dissimilar finishes of operating time. A distribution analysis may also be applicable for systems that receive ranges of failure/repair sets, at their first detected failure, or when their time between failures shows no particular trend. Nevertheless, if the ranges of common failure or repair occasions, happening serially in time permanence, are to be studied for analysis, then the manner of happening of the failure occasion is of importance and disregard would lead to wrong parsing and conclusions thereof. While the set of stochastic variables considered in both systems develop during the time, their specifics are usually simulated by a process better than a distribution. In the science of reliability, such systems are usually called repairable systems. Such systems are comeback to their routine operating conditions by any minimal or major servicing. Therefore, for analyzing and evaluating the reliability of a system reliability, it is every time significant to do the differentiation between repairable and non-repairable systems to choose a rational way. For customers, the tasks can be (1) The amount of repairs, in the mean, for the total system at a specific functional time. (2) Evaluated time before first repair, consequent repairs, etc. (3) Tendencies in repair duration or expenses whether growing, decreasing, or considerably stable. (4) How to make decisions on running-in demands and maintenance or termination. (5) Is treating would be advantageous? How long and expenses relevant it would be?
76
3 System Reliability Models
(6) How to compare various variants, structures, or performance of systems functioning in various medium/areas, etc.? Each of the before tasks and others can be solved by the approach delivered in this chapter. This section explains the Mean Cumulative Function (MCF). It is grounded on non-parametric diagram method (a simple as else strong and dataful) to operate with failure occasions of systems experiencing a failure and repair sequences where in time to result a repair is supposed to be slight. This is a sensible approximation with regard to the functional times of an element that are normally longer than duration of its repair. The MCF method is plain as it is lightly to interpret, organize, and prepare the database. The MCF pattern is non-parametric in that meaning that its evaluation regards no approximation on the shape of the mean function or the procedure creating the system records. Moreover, this diagram approach permits the observation of system repeats and the maintenance of stochastic rigidness without using too complex hard random methods. Moreover, the non-parametric analysis of MCF supplies the same data as probability graphs in a conventional analysis of life data. Particularly, the graph of the non-parametric evaluation of the collection MCF supplies the larger half of the data desired, and the graph is as meaningful as is the probability graph for service life and other uni-dimensional results. The item MCF can be evaluated, and a graph can be drawn for one device only or for a whole set of devices in a collection. In any event, it can be drawn for all failures cases, breakdowns, other failures of a system due to specified failure conditions, etc. It can be also applied to observe area repeats and determine repeat tendencies, abnormal systems, irregular conditions, the influence of different factors (such as maintenance procedures, operational conditions, or environment requirements) on failures, etc. For several cases, it goes a forerunner to a more progress parametric analysis, and in others, this is the only analysis we need to do. 1. Mean Cumulative Function The cumulative failure number or repair cases that happen in time t (also named as age of a system), N(t), are widely used like reliability indexes of a repairable system. Here, age (or time) intends any measuring of element’s application, e.g., range, kilometers, cycles, months, hours, and so on. For this, time (or age) intends different measuring of element’s application, e.g., range, hours, months, cycles, kilometers, and so on. An element from a sample can’t also have detected one failure, while some have detected single or more failures before its laying-off time. Evidently, the number of cases happening in time is stochastic. For non-parametric cases (failure or repair) data analysis, each part of the collection can be depicted by an aggregated record function for the number of failures. The collection mean of aggregated number of cases (failures or repairs) in time t is called MCF, (M(t) = E(N(t))). m(t) = dM(t) dt is its derivative and is expected to be. It is named as intensity function or recurrence rate or collection single-point repair rate in time t. The rate can stay increasing or decreasing or changeless, in its features. It can be calculated as happenings per part
3.3 Repair System Reliability Analysis
77
collection element per unit time. So, it is a stairway function with a step junction at every case happening at time and observing the collected events number in certain time t. It is the average of whole stairway functions of each part in the collection. In Fig. 3.11, there is an example of the aggregated record function for a separate system, the plot of aggregated failures number cases versus the system life, t. This plot can be observed as a one measurement for a possible graph. We can take an example to show the conditions of three parts of identical systems that operate in various environments or maintenance conditions by MCF. Example 3.8 [2] Three repairable systems will be considered, which is detected before the time of their 12th failure next results. System 1: 9, 20, 65, 88, 104, 107, 138, 143, 149, 186, 208, and 227. System 2: 45, 76, 113, 129, 152, 174, 193, 199, 210, 219, 224, and 227. System 3: 3, 9, 20, 25, 41, 50, 69, 91, 128, 151, 182, and 227. Figure 3.12a–c, correspondingly, illustrates the N(t i ) versus t i plots It is seen in the plots in Fig. 3.12 how the three similar systems are acting otherwise. Moreover, these systems’ repair rates show linear (for a stable system), growing (for a deteriorating system), and decrescent (for an improving system) tendencies, correspondingly. In the before, the current case number in time t supplies an indifferent evaluation of the collection average number of cases (failures or repairs) per system, M(t). That is right, for a case where only one system is accessible in the analysis time. Other systems would occur at the next stages when this single system’s project is accepted and fulfilled the designated demands. Nevertheless, in most instances, they are interesting with general conditions and characteristics of some systems produced by similar procedures and their elements. But, the time for preparation of such systems varies, and they will have various failure records and operational times. Figure 3.13 shows an example of the sample MCF of a collection of randomly selected identical repairable systems. We can observe that the intersection points of the stairway functions with the vertical line passing through an age t i make a distribution for the values of cumulative number of events, wherein a fraction of the collection has collected no failure, another fraction has collected one failure, yet another fraction has collected two failures, and so on. This distribution differs at other ages and has the mean MCF(t i ) at age t i . Fig. 3.11 MCF example [2]
78 Fig. 3.12 The graphs of N(t i ) versus t i for a stable system (a), a deteriorating system (b), and an improving system (c)
Fig. 3.13 Records and distribution of failures measured at years t [2]
3 System Reliability Models
3.3 Repair System Reliability Analysis
79
Therefore, the MCF plot is a dotted average of all collection curves passing through the vertical line at each age t i as can be seen in Fig. 3.13. Outside of the above type of failure records, there can be the data in enough or in a mode wherein the grouping of the data is viable enough. It is normally a standard practice of resorting to grouping the data when the sample size is large decent as no significant data is lost in statistical reason if the data is grouped into different sets—equal or unequal. 2. Drawing of MCF Plot and Confidence Bounds 1) Process of MCF drawing for Exact Age Data. The sample MCF can be evaluated and its plot can be drawn in a simple way by following these three steps: (1) Rank order all failure and given times, i.e., arranged from shortest to longest by summing the failure records of n number of randomly chosen systems drawn from a collection of such systems. If a failure time for an element is the same as its given time, the failure time is ordered first. In case of many elements having a same repeat or given age, then follow an arbitrary order. (2) Calculate the number of elements r i that have detected life t i just previous to the happening of failure event or number of elements after the breaks happen, i.e., ⎧ ⎪ ⎨ ri−1 , for ti the age of event repeat, ri = ri − 1, for ti the age of a break, ⎪ ⎩ n, for the first detected age. (3) The MCF is determined for each sample repair age t i as MCFi =
1 1 + MCFi−1 , MCF0 = ri r0
The above process is shown by following examples. Example 3.9 [2] A company maintains a centrifuge and wants to do a forecast about the expected cumulative number of failures for 100 similar centrifuges after 4 years (equal to 35,040 h) of operation. Failure and break data, in hours, for a random sample of 22 centrifuges were recorded. Each time a centrifuge fails, a staff repairs the equipment to put it back again in service. Table 3.1 gives the failure and limited ages for each centrifuge, where “+” mark indicates a limited age. The MCF evaluates by following process steps and it is given in columns 4–6 of Table 3.1 and MCF plot is shown in Fig. 3.14. The calculation of confidence bounds given in the last column of Table 3.1 is explained later. Looking at MCF graph in Fig. 3.14, the company can expect the number of failures per equipment after years of operation which is M(35,040) = 0.3658. Hence, out of the 100 centrifuges, about 37 failures are expected to happen.
80
3 System Reliability Models
Table 3.1 Centrifuge failure data of Example 3.9
Machine ID#
t i /h
Machine ID#
t i /h
1
9424; 35,659+
12
29,981+
2
37; 18,412+
13
25,761+
3
64; 1920; 39,779+
14
28,780+
4
707; 34,213+
15
24,901+
5
18,980; 29,016+
16
31,360+
6
1851; 28,177+
17
23,940+
7
28,535+
18
26,009+
8
29,168+
19
32,236+
9
6792; 24,304+
20
30,472+
10
28,921+
21
23,792+
11
27,853+
22
30,183+
Fig. 3.14 MCF graph of Example 3.9
It can be detected that if a smooth curve is drawn through the MCF graph, it would have its derivative decreasing in kind, i.e., repair rate decreases as the equipment ages and it improves with the using. This specific shown example can either be by a typical product with produced defects repaired gradually or the effectiveness of the maintenance or improved training curve of the staff. 2) Confidence limits on MCF for Exact Age Data. It is expected that the values of cumulative number of events at any repeated age follow a log-normal distribution, then at any repeat age t i , the upper and lower confidence limits of MCF can be calculated by /
MCF(ti ) L = MCF(ti )e
−K δ Var(MCF(ti )) MCF(ti )
and /
MCF(ti )U = MCF(ti )e correspondently.
−K δ Var(MCF(ti )) MCF(ti )
3.3 Repair System Reliability Analysis
81
Here, 0.50 < δ < 1 is the confidence level, K δ is the δ standard normal percentile, and Var(MCF(t i )) is the MCF variance at recurrence age t i . The variance is calculated by ⎡ ⎤ ( ) 1 ⎣∑ 1 2⎦ d ji − , Var(MCF(ti )) = Var(MCF(ti−1 )) + 2 ri ri j∈R ti
where Rti is the set of the equipment that have not been break out by t i and d ji is defined as follows: d ji = 1, if j-th centrifuge had an event repeated at the age t i ; d ji = 0, if j-based on the above equations, the calculated variance, lower and upper bounds are provided in the columns 7 and 8, correspondently, of Table 3.2. Figure 3.15 shows the plots of Example 3.9 data with confidence bounds on the MCF. The evaluated 90% upper confidence bounds for the cumulative number of failures after 4 year of operation for 100 centrifuges is 0.5695 × 100 ≈ 57 failures. This information can be applied for evaluating the repair costs or planning for repair stock. The ensuing section presents the non-parametric evaluate of sample MCF from the sorted data. The unit did not have an event recurrence at age t i . 3) Drawing of MCF graph and confidence limits for sorted data. The process steps to calculate MCF for sorted data are as follows: (1) If the data is not accessible in sorted mode, determine the numbers of intervals and the interval size from the data range. A well-known Sturges’ rule can be applied to determine the necessary number of groups and into that a distribution of detections should be sorted; the number of groups or classes is 1 + 3.3 log n, where n is the number of detections. (2) Determine the number of happenings and break events in every interval and set them. (3) Enter the number of elements (N) entering in the following interval considering breaks (C i ) with an initial sample size of N i for i = 0. The following samples are calculated as Ni = Ni−1 − Ci−1 . (4) Calculate the MCF, MCF(t i ) = MCF(t i−1 ) + m(t i ), where the average number of repeat events per sample element over an interval i (increment in MCF(t i )) is given by m(ti ) =
Ri . Ni − 0.5Ci
The Ri denotes the number of repeat events in i-th group and denominator provides an average number of elements in that group. (5) Draw MCF according to the interval ages.
82
3 System Reliability Models
Table 3.2 MCF and confidence calculations [2] Equipment no
t i /h
State
ri
1/r i
MCF(t i )
Var(MCF(t i ))
MCF(t i ) UB
LB
2
37
F
22
0.0455
0.0455
0.0020
0.1590
0.0130
3
64
F
22
0.0455
0.0909
0.0039
0.2204
0.0375
4
707
F
22
0.0455
0.1364
0.0059
0.2810
0.0662
6
1,851
F
22
0.0455
0.1818
0.0079
0.3400
0.0972
3
1,920
F
22
0.0455
0.2273
0.0099
0.3979
0.1298
9
6,792
F
22
0.0455
0.2727
0.0118
0.4547
0.1636
1
9,424
F
22
0.0455
0.3182
0.0138
0.5107
0.1982
2
18,412
S
21
5
18,980
F
21
0.0476
0.3658
0.0160
0.5695
0.2350
21
23,792
S
20
See Fig. 3.14
17
23,940
S
19
9
24,304
S
18
15
24,901
S
17
13
25,761
S
16
18
26,009
S
15
11
27,853
S
14
6
28,177
S
13
7
28,535
S
12
14
28,780
S
11
10
28,921
S
10
5
29,016
S
9
8
29,168
S
8
12
29,981
S
7
22
30,183
S
6
20
30,472
S
5
16
31,360
S
4
19
32,236
S
3
4
34,213
S
2
1
35,659
S
1
3
39,779
S
0
4) Confidence limits on MCF for grouped data. As stated in theory that to calculate approximate confidence limits for MCF from interval data as calculation of right limits are complicated and cannot be calculated applying simple process, the native confidence limits for the exact data can increased to rather than in an easy way and do not require all the separate element’s records. The approximate confidence limits of MCF are calculated by
3.3 Repair System Reliability Analysis
83
Fig. 3.15 MCF with confidence limits of Example 3.9 data 1—MCF graph; 2—upper bound; 3—bottom bound
Fig. 3.16 MCF graph of Example 4.3: 1—MCF; 2—Poly (MCF); 3—Power (MCF)
√ (MCF(ti )) L = MCF(ti ) − K δ Var(MCF(ti )) and √ (MCF(ti ))U = MCF(ti ) − K δ Var(MCF(ti )), where Var(MCF(ti )) ∼ =
i ∑ j=1
m(t j ) . (N j − 0.5C j )
Example 3.10 [2] Table 3.3 shows the field data (sorted by months in service) on changes of defrost parts in refrigerators. Calculate the MCF evaluation for the cumulative percent changed. What would be the number of changes over a 15-year typical life of such refrigerators? The last column of Table 3.3 shows the MCF(t i ) that is drawn in Fig. 3.16. The second-order polynomial is found to be a better fit than a power relation. From this polynomial fit, the number of changes over a period of 15 years turns out to be changes per refrigerators. But, one can detect that there is a large number of breaks happened in the 12th and the 24th month, correspondently. The study revealed that the first 12-month data came from refrigerators whose owners had sent in a dated purchase record card and
84
3 System Reliability Models
Table 3.3 Data for Example 3.10 Month i
Number of replacements, Ri
Number entered, N i
Number suspended, C i
m(t i )
MCF(t i )
1
83
22,914
0.0036
0.0036
2
35
22,914
0.0015
0.0051
3
23
22,914
0.0010
0.0062
4
15
22,914
0.0007
0.0068
5
22
22,914
0.0010
0.0078
6
16
22,914
3
0.0007
0.0085
7
13
22,911
36
0.0006
0.0090
8
12
22,875
24
0.0005
0.0096
9
15
22,851
29
0.0007
0.0102
10
15
22,822
37
0.0007
0.0109
11
24
22,785
40
0.0011
0.0119
12
12
22,745
20,041
0.0009
0.0129
13
7
2,704
14
0.0026
0.0155
14
11
2,690
17
0.0041
0.0196
15
15
2,673
13
0.0056
0.0252
16
6
2,660
28
0.0023
0.0275
17
8
2,632
22
0.0031
0.0305
18
9
2,610
27
0.0035
0.0340
19
9
2,583
64
0.0035
0.0375
20
5
2,519
94
0.0020
0.0395
21
6
2,425
119
0.0025
0.0421
22
6
2,306
118
0.0027
0.0447
23
6
2,188
138
0.0028
0.0476
24
5
2,050
1,188
0.0034
0.0510
25
7
862
17
0.0082
0.0592
26
5
845
28
0.0060
0.0652
27
5
817
99
0.0065
0.0717
28
6
718
128
0.0092
0.0809
29
3
590
590
0.0102
0.0911
thus has known setting date and this date and date of a check change were used to calculate the refrigerator’s age at each change. The data on the month 13 through 24 months came from the refrigerators whose owners’ increased the warranty for another year for all parts and labor. The data from the 25th through the 29th months came from refrigerators whose owners’ bought warranty for another year. Therefore, the second- and the third-year data illustrative of subcollections with high change rate, not the collection as the whole; therefore,
3.3 Repair System Reliability Analysis
85
the MCF plot consists of parts of MCFs of three subcollections, giving an appearance of increasing change rate, whereas the collection rate was essentially constant. This can be verified by taking the data of the first 12-month period only.
3.3.4 Parametric Analysis Methods Reliability analysis based on different lifetime distributions (for example, exponential or Weibull distribution) are generally preferred for non-repairable systems since the time to repair (TTF) random variables regarded is independent and identically distributed. In case of repairable systems, the independent and similarly distributed approximation can be broken and processes such as restoration process, non-homogeneous Poisson process, and Markov models can have been extensively applied to simulate the case when system is brought to “as good as new” and “as bad as old” states, correspondingly. Nevertheless, a repairable system can finish up at other than these two extremes, viz., better than old but badly than new, better than new, and badly than old. In the search to have more tidy analyses and forecasting, the generalized restoration process can be of great demand to decrease the simulating uncertainty leading from the same repair approximations. This section takes up some basic terminologies related to repairable systems, followed by simulating and analysis of repairable systems by means of different processes. Markov models are discussed in Sect. 3.3.5. The following selected definitions and terms pertaining to the repairable systems are useful. (1) A point process is a stochastic model (stochastic model possesses some inherent randomness) that defines that happening of events in time. These happenings are thought of as point on the time continuum. Commonly, the times between happenings can be neither independent nor identically distributed. In our case, “happenings in time” is failure times of a repairable system. (2) Counting Random Variable. Let N(t) be the counting random variable that denotes number of failures happened in the interval [0, t]. When N has as its argument an interval, such as N(a, b), the result is the number of failures in that interval. N is called the calculating random variable. Number of failures in the interval (a, b) is defined as N (a, b) = N (b) − N (a). Suppose, N(t 1 ) = k 1 , N(t 2 ) = k 2 , …, N(t n ) = k n . (3) Mean Function of Point Process. It is the expected value of calculating random variable N(t) in time, i.e., the expected number of failures in time, Λ(t) = E(N(t)). (4) Rate of Occurrence of Failures. When Λ(t) is differentiable, it is possible to define the rate of occurrence of failures (ROCOF) as
86
3 System Reliability Models
Fig. 3.17 Bathtub-shaped intensity function
u(t) =
d Λ(t). dt
(5) Intensity Function. The intensity function of a point process is the probability of failure in a short interval divided by length of the interval, i.e., u(t) = lim
∆t→0
P(N (t, t + ∆t) ≥ 1) . ∆t
(6) Complete Intensity Function (Fig. 3.17). For some models, it will be more appropriate to consider the conditional probability given the failure records of the process. Let H t denote the entire records of the failure process in time t. These records can be represented by the set of failure times {t i : i = 1, 2, …, N(t)}. Hence, the complete intensity function is u(t) = lim
∆t→0
P(N (t, t + ∆t] ≥ 1|Ht ) . ∆t
1. Restoration process A restoration process is an idealized stochastic model for events that happen randomly in time. The basic mathematical approximation in restoration process is that the times between the series arrivals of events are independent and identically distributed. In the present case, if a repairable system in service can be repaired to an “as good as new” state following every failure such that the probability density function of the Time Between Failures (TBF) does not change from one failure to another, then the failure process is called a restoration process. A specific case of this is Homogeneous Poisson Process (HPP) which is a Poisson process with constant intensity function u. By other words, if TBF, X 1 , X 2 , … are independent and identically distributed exponential random variables, then N(t) corresponds to a HPP with a constant intensity function, u. The expected number of failures in time interval [0, t] would be ∫t E[N (t)] =
u(t)dt = ut. 0
(3.51)
3.3 Repair System Reliability Analysis
87
Since the intensity function is constant, the HPP cannot be used for model systems that decrease or improve and should be applied with attention. For such cases, a Poisson process with non-constant intensity function can be a viable alternative. 2. Non-Homogeneous Poisson Process (NHPP) NHPP is a Poisson process whose intensity function is non-constant. To understand NHPP model, let N(t) be the cumulative number of failures detected in cumulative test time t, and let u(t) be the failure intensity. Under the NHPP model, u(t)∆t is the probability of a failure happening over the interval [t, t + ∆t] for short ∆t. Therefore, the expected number of failures experienced over the test interval [0, t] is given by ∫t E[N (t)]
u(t)dt.
(3.52)
0
The NHPP model accepts that u(t) may be approximated by the Power Law Model, i.e., u(t) = abt b−1 , a > 0, b > 0,
(3.53)
where a is named a scale parameter because it depends upon the unit of measurement chosen for t while b is the shape parameter that specified the shape of the graph of the intensity function and system behavior. For b = 1, u(t) = a, a stable system. For b > 1, u(t) is increasing. It specifies a deteriorating system, whereas when b < 1, u(t) is decreasing indicating an improving system. The power law model has a very practical base in the means of minimal repair and is wide applicable for some reasons. Firstly, it models the case when the repair of a failed system is just enough to get the system operational once more. Secondly, if the time to first failure follows the Weibull distribution, then every following failure is ruled by the PLP model. From this point of view, the PLP model is an extension of the Weibull distribution. The expected number of failures for this case becomes ∫t E[N (t)] =
u(t)dt = at b .
(3.54)
0
This form comes from the approximation that inter-reaching of times between series failures follow a conditional Weibull probability distribution. It means that the reaching of the i-th failure is conditional on the cumulative operating time up to the (i – 1)-th failure. This conditionality also happens from the reason that the system keeps the state of “as bad as old” after the (i – 1)-th repair. Therefore, the repair process does not bring any additional life to the element or system. In order to get the model parameters, consider the following definition of conditional probability (see Fig. 3.18)
88
3 System Reliability Models
Fig. 3.18 Conditional probability of failures happening
P(T ≤ t|T > t1 ) =
R(t) F(t) − F(t1 ) =1− , R(t1 ) R(t1 )
(3.55)
where F(t) and R(t) are the probability of component failure and the reliability at the relevant times. So ( ) F(ti ) = 1 − exp a(ti−1 )b − a(ti )b .
(3.56)
With the help of Eqs. (3.53) and (3.56), the probability density function of the i-th failure given that (i – 1)-th failure happened at time t i–1 can be obtained as )) ( ( f ((t|ti−1 )) = abtib−1 exp −a (ti )b − (ti−1 )b .
(3.57)
The parameters for the NHPP model can be evaluated applying the Maximum Probability Evaluation (MBE) approach as given below. The probability function is defined as L=
n ∏
f (ti |ti−1 ),
i=1
where n is number of failures. Considering probability density function as given in Eq. (3.56), the probability function is given as follows: L = a n bn e−at
∗b
n ∏
tib−1 ,
i=1
where { ∗
t =
tn , the test is failure terminated . t > tn , the test is time terminated
Taking the natural log on both sides of Eq. (3.58)
(3.58)
3.3 Repair System Reliability Analysis
89
ln L = n ln a + n ln b − at ∗b + (b − 1)
n ∑
ln ti .
(3.59)
i=1
Differentiating Eq. (3.59) with regard to a and setting it to zero n ∂(ln L) n ⁀ = − t ∗b = 0, a = ∗b . ∂a a t
(3.60)
Similarly, taking the first derivative of Eq. (3.58) with regard to b and setting it to zero give ∑ ⁀ ∂(ln L) n n ∑n ln ti = 0, b = = − at ∗b ln t ∗ + . ∗ ∂b b n ln t − i=1 ln t i=1 n
(3.61)
The above term can easily be extended to a collection as well. Let there be a collection of K similar systems, then L=
K ∏
[ a nl bnl e−at
l=1
ln L = ln a
K ∑ l=1
n l + ln b
∗b
nl ∏
] tib−1 ,
i=1
K ∑
(n l − at ∗b ) + (b − 1)
l=1
nl K ∑ ∑
ln ti ,
(3.62)
l=1 l=1
where l = 1, 2, …, K. Differentiating Eq. (3.62) with regard to a and b, correspondingly, and setting it to zero, we get ∑K
l=1 n l , K t ∗b ∑K ⁀ nl b = ∑ ∑l=1 ( ∗ ) . nl K t l=1 i=1 ln ti ⁀
a=
(3.63)
(3.64)
Example 3.11 [2] Engines applied in an aircraft are empowered with high thrust to enable sharp climb and sustain high “G” loads during maneuvers. They are also designed to prevent surge and stall due to back pressure that are reasons for turbulence in front of the engine. These engines are exposed to high aerodynamic and thermal loadings and therefore are exposed to frequent failures. A picture of such an engine is shown in Fig. 3.19. The failure times in hours of such an engine with time between repairs of 550 h are as given below:
90
3 System Reliability Models
Fig. 3.19 Typical aviation engine
203, 477, 318, 536, 494, 213, 303, 525, 345, 299, 154, 230, 132, 321, 123, 351, 188, 49. 02, 267, 548, 380, 61, 160, 375, 550, 174, 176, 257, 102, 81, 541, 518, 533, 547, 299, 208, 326, 451, 349, 152, 509, 249, 325, 261, 328, 48, 19, 142, 200, 426, 90, 522, 446, 338, 55, 549, 84, 342, 162, 250, 368, 96, 431, 14, 207, 324, and 546. It is needed for the maintenance staff to find out whether the engine is fall out over a period of 550 h. This will also help them to evaluate the effectiveness of the modern preventive maintenance process. The staff is also involved in determination of the service life of the engine. Whereas the engine is a repairable system, apply NHPP model to evaluate the scale and the shape of parameters. Plot the intensity function and suggest alterative measures. Solution Applying Eqs. (3.60) and (3.61), the values we get are b=
67 67 = 1.16, a = = 0.0444. 57.6453 5501.15
Applying Eq. (3.53), the intensity function gets the intensity function curve which is shown in Fig. 3.20. The increase in intensity function is due to the wearing out in the system. The maintenance staff can consider upon giving a revise into their modern maintenance process and use the set-down standard maintenance processes even more efficiently. They can also consider upon revising the modern maintenance process. In any case, the quality of repair should also be paid attention. Example 3.12 [2] The failure times of 18 aviation engines with time between repairs of 550 h are given in Table 3.4. Fig. 3.20 Intensity function graph for Example 3.11
3.3 Repair System Reliability Analysis Table 3.4 Time to failure data of engines of Example 3.12
91
Engine no Times to failure/h Engine no Times to failure/h 1
324; 399; 531
10
451
2
342
11
414
3
287; 317
12
102
4
531
13
164; 176
5
426
14
160; 461
6
321; 337; 495
15
123
7
48; 408
16
299
8
325
17
318
9
349
18
203; 521
Evaluate the scale and shape parameters applying NHPP. Draw the intensity function. Also, evaluate operating MTBF at t = 550 h. Solution Applying Eqs. (3.60) and (3.61), the values we get are b=
27 27 = 1.56, a = = 0.000080, u(t) = 0.000080 × 1.56t 0.56 . 17.2718 18 × 5501.56
Operating MTBF = (u(t))–1 . MTBF(t = 550 h) = 234 h. The intensity function curve is shown in Fig. 3.21. The wearing out in the engines can be more compared to that of previous example. The maintenance staff can consider giving a revise into their modern maintenance process and apply the laid-down standard maintenance process even more efficiently. They can also consider revising the modern maintenance process. Also, the quality of repair should be taken into account. 3. Generalized Restoration Process (GRP) The GRP algorithm explained here is based on two wide approaches for general repair, Arithmetic Reduction of Age (ARA) and Arithmetic Reduction of Intensity (ARI). In ARA models, the effect of repair is shown by approximating a reduction in Fig. 3.21 Intensity function graph for Example 3.12
92
3 System Reliability Models
the actual age of the system and achieving an age named as virtual age. ARA-based Kijima’s virtual age models are the most widely mentioned and effective models in the references. In ARI approach, the repair effect is studied by the change induced on the failure intensity before and after failure. A short introduction to ARA and ARI models is covered in this subpart. 1) ARA model The principle of this models’ type is to consider that repair restores the system. The actual age of a system is its functioning time t, and the virtual age is determined as a positive function of its actual age, possibly depending on past failures. The principle that repair reduces the system’s age is the basis of Kijima’s virtual age models. To learn Kijima virtual age models, let’s consider a repairable system which can be monitored from time t 0 = 0, denoted by t 1 , t 2 , … the series failure times. Let the times between the failures be denoted by X n = t n – t n–1 . The idea of virtual age then is for the effectiveness of repair in the following way. Let q be the repair effectiveness and V n be the virtual age of the system after n-th repair with V 0 = 0. Kijima type-1 model take on that the n-th repair can remove the damage received only during the time between (n – 1)-th and n-th failures, outputting virtual age as Vn = Vn−1 + q X n ,
(3.65)
Vi = Vi−1 + q X i ,
(3.66)
Vi = q
i ∑
X j,
(3.67)
F(x + y) − F(y) . 1 − F(y)
(3.68)
j=1
where the distribution of X n is determined by Pr{X n < x|Vn−1 = y } =
But, in practice, the n-th repair can also decrease all failures collected up to n-th failure, giving the Kijima type-2 model for virtual age Vn = q(Vn−1 + X n ),
(3.69)
Vi = q(Vi−1 + X i ),
(3.70)
Vi =
i ∑
q i− j+1 X j ,
j=1
where the distribution of X n is given by Eq. (3.68).
(3.71)
3.3 Repair System Reliability Analysis
93
In Kijima virtual age models, q = 0 meets “as good as new” state after the repair and therefore can be simulated by RP. Approximation of q = 1 means that the element is restored to the same state when it was before the repair, i.e., “as bad as old” state and can be simulated by NHPP. Hence, q can be construed as an index for corresponding effectiveness and quality of repair. For example, the q that get into the interval 0 < q < 1 correspond a system state in that the state of the system is better than old but badly than new. For q > 1, the system is in a state of badly than old. The approximations for Kijima models are as follows: (1) Time to the First Failure (TTFF) distribution is known and can be evaluated by the available data applying Weibull distribution MLEs. (2) The repair time is expected to be negligible so that the failures can be considered as point processes. The specifics for the GRP model are also evaluated applying MLE. Presently, the approach of Yanez [3] is among the widely used for GRP parameter evaluation. The inter-reaching of failures is expected to follow the Weibull distribution, and the f (t) and F(t) of the time to i-th failure are determined by [ [ [ { }[ f (ti |ti−1 ) = ab(Vi−1 + X i )b−1 exp a (Vi−1 )b − (Vi−1 + X i )b , }[ [ { F(ti ) = 1 − exp a (Vi−1 )b − (Vi−1 + X i )b . In above equations, V i could be either from Kijima type-1 or Kijima type-2 models. The probability, log-probability, and MLEs for the failure-terminated (singleand multiple-repairable systems) and time-ended (single- and multiple-repairable systems) cases for system data are shown as follows for both Kijima type-1 or Kijima type-2 models. 2) Kijima type-1 Model The probability, log-probability functions, and MLEs for failure-terminated singlerepairable system data set are given as ⎡ L=
n ∏ i=1
⎛
⎢ ⎝ ⎣ab q
i−1 ∑ j=1
⎫⎤ ⎡ ⎧⎛ ⎞b−1 ⎤ ⎞b ⎪ ⎪ i−1 ⎨ ⎬ ⎥ ⎢ ⎝ ∑ ⎠ b ⎥ ⎠ q X j + Xi Xj − (Vi−1 + X i ) ⎦. ⎦ exp⎣a ⎪ ⎪ ⎩ ⎭ j=1 (3.72)
By taking log on both sides of Eq. (3.71), we get ln L = n log(b) + n log(a) + (b − 1)
n ∑ i=1
⎛ log⎝q
i−1 ∑ j=1
⎞ X j + Xi ⎠
94
3 System Reliability Models
⎡ ⎧⎛ ⎞b ⎛ ⎞b ⎫⎤ ⎪ ⎪ n i−1 i−1 ⎨ ⎬ ∑ ∑ ∑ ⎥ ⎢ ⎝ ⎝ ⎠ ⎠ q + Xj − q X j + Xi ⎦. ⎣a ⎪ ⎪ ⎩ ⎭ i=1 j=1 j=1
(3.73)
To get failure-ended MLEs, differentiate the log-probability function (Eq. 3.56) with regard to each of the three parameters a, b, and q and equal to zero. ⎧⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ n ⎪ i−1 i−1 ⎨ ⎬ ∑ ∑ ∑ n ∂ log(L) ⎢⎝ ⎥ ⎠ ⎠ ⎝ = + Xj − q X j + X i ⎦ = 0, ⎣ q ⎪ ⎪ ∂a a ⎭ i=1 ⎩ j=1 j=1 n ∂ log(L) = + ∂b b
⎧ ⎡⎛ ⎛ ⎞b ⎞ i−1 i−1 ∑ ∑ ⎢ a ⎣⎝q X j ⎠ log⎝q X j⎠ ⎪ ⎩
n ⎪ ⎨ ∑ i=1
j=1
⎛ −⎝q
i−1 ∑
j=1
⎞b
n ∑
⎛ log⎝q
i=1
i−1 ∑
⎛
X j + X i ⎠ log⎝q
j=1
+
(3.74)
i−1 ∑ j=1
⎞⎫⎤ ⎪ ⎬ ⎥ ⎠ X j + Xi ⎦ ⎪ ⎭
⎞ X j + Xi ⎠
j=1
= 0,
(3.75)
∑i−1 n ∑ ∂ log(L) j=1 X j = (b − 1) ∑i−1 ∂q j=1 X j + X i i=1 q ⎡ ⎛ ⎤ ⎛ ⎞b−1 ⎞b−1 n i−1 i−1 i−1 i−1 ∑ ∑ ∑ ∑ ∑ ⎢ ⎝ ⎥ +a X j⎠ X j − b⎝q X j + Xi ⎠ X j⎦ ⎣b q i=1
= 0.
j=1
j=1
j=1
j=1
(3.76)
The probability, log-probability functions, and MLEs for time-terminated singlerepairable system data set are given as ⎧ ⎡⎛ ⎡ ⎛ ⎞b−1 ⎤ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ n i−1 i−1 i−1 ⎨ ⎬ ∏ ∑ ∑ ∑ ⎢ ⎝ ⎥ ⎢ ⎥ L= X j + Xi ⎠ X j ⎠ − ⎝q X j + Xi ⎠ ⎦ ⎣ab q ⎦ exp a ⎣⎝q ⎪ ⎪ ⎩ ⎭ i=1 j=1 j=1 j=1 ⎧ ⎡⎛ ⎫ ⎤ ⎞b ⎛ ⎞b ⎪ ⎪ n n ⎨ ⎬ ∑ ⎥ ⎢⎝ ∑ ⎠ ⎝ × exp a ⎣ q Xj − T − tn + q X j⎠ ⎦ . (3.77) ⎪ ⎪ ⎩ ⎭ j=1 j=1
3.3 Repair System Reliability Analysis
95
For the device that consists of K number of systems (l = 1, 2, …, K). Let us apply similar procedures. The probability, log-probability functions, and MLEs for the time-terminated multiple-repairable system data set are given as ⎡ ⎡ ⎛ ⎞b−1 ⎤ nl K i−1 ∏ ∏ ∑ ⎢ ⎢ ⎝ ⎥ L= X l, j + X l,i ⎠ ⎦ ⎣ ⎣ab q l=1
i=1
j=1
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ i−1 i−1 ⎨ ⎬ ∑ ∑ ⎢ ⎥ exp a ⎣⎝q X l, j ⎠ − ⎝q X l, j + X l,i ⎠ ⎦ ⎪ ⎪ ⎩ ⎭ j=1 j=1 ⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫⎤ ⎪ ⎪ nl nl ⎨ ⎬ ∑ ∑ ⎢⎝ ⎥ ⎥ ⎝ ⎠ ⎠ × exp a ⎣ q X l, j − T − tl,n + q X l, j ⎦ ⎦. ⎪ ⎪ ⎩ ⎭ j=1 j=1
(3.78)
3) Kijima type-2 Model The probability, log-probability functions, and MLEs for failure-terminated singlerepairable system data set are given as ⎡ ⎛ ⎞b−1 ⎤ n i−1 ∏ ∑ ⎢ ⎝ ⎥ L= q i− j X j + X i ⎠ ⎦ ⎣ab i=1
j=1
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ i−1 i−1 ⎨ ⎬ ∑ ∑ ⎢⎝ ⎥ i− j i− j ⎝ ⎠ ⎠ exp a ⎣ q Xj − q X j + Xi ⎦ ⎪ ⎪ ⎩ ⎭ j=1 j=1
(3.79)
Taking log on both sides of Eq. (3.78), we get ln L = n log(b) + n log(a) + (b − 1)
n ∑ i=1
⎞ ⎛ i−1 ∑ i− j log⎝ q X j + Xi ⎠ j=1
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ n ⎪ i−1 i−1 ⎨ ⎬ ∑ ∑ ∑ i− j i− j ⎥ ⎢⎝ ⎝ ⎠ ⎠ + q Xj − q X j + Xi ⎦ . a⎣ ⎪ ⎪ ⎭ i=1 ⎩ j=1 j=1
(3.80)
To obtain failure-terminated MLEs, differentiate the above logarithm of the probability function with regard to every of the three parameters a, b, and q, and equate to zero.
96
3 System Reliability Models
⎧⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ n ⎪ i−1 i−1 ⎨ ⎬ ∑ ∑ ∑ ∂ log(L) n ⎢⎝ ⎥ i− j i− j ⎝ ⎠ ⎠ q Xj − q X j + X i ⎦ = 0, = + ⎣ ⎪ ⎪ ∂a a ⎭ i=1 ⎩ j=1 j=1 (3.81)
⎧ ⎡⎛ ⎛ ⎞b ⎞ ⎪ n i−1 i−1 ∑⎨ ⎢ ∑ ∑ n ∂ log(L) a ⎣⎝ = + q i− j X j ⎠ log⎝ q i− j X j ⎠ ⎪ ∂b b i=1 ⎩ j=1 j=1
⎛ ⎞⎤⎫ ⎪ i−1 ⎬ ∑ ⎥ i− j i− j ⎝ ⎝ ⎠ ⎠ − q X j + X i log q X j + Xi ⎦ ⎪ ⎭ j=1 j=1 ⎛ ⎞ n i−1 ∑ ∑ log⎝ q i− j X j + X i ⎠ + ⎛
⎞b
i−1 ∑
i=1
j=1
= 0,
(3.82)
n ∑i−1 i− j−1 ∑ Xj ∂ log(L) j=1 (i − j )q = (b − 1) ∑i−1 i− j ∂q X j + Xi j=1 q i=1 ⎡ ⎛ ⎞b−1 ⎛ ⎞ n i−1 i−1 ∑ ⎢ ⎝∑ i− j ⎠ ⎝∑ +a q Xj (i − j)q i− j−1 X j ⎠ ⎣b i=1
j=1
⎛ −b⎝
i−1 ∑
j=1
⎞b−1 q i− j X j + X i ⎠
j=1
⎤ i−1 ∑
⎥ (i − j )q i− j−1 X j ⎦
j=1
= 0.
(3.83)
The probability, log-probability functions, and MLEs for time-terminated singlerepairable system data set are given as ⎡ ⎛ ⎞b−1 ⎤ n i−1 ∏ ∑ ⎢ ⎝ ⎥ L= q i− j X j + X i ⎠ ⎦ ⎣ab i=1
j=1
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ i−1 i−1 ⎨ ⎬ ∑ ⎢⎝∑ i− j ⎠ ⎥ i− j ⎝ ⎠ exp a ⎣ q Xj − q X j + Xi ⎦ ⎪ ⎪ ⎩ ⎭ j=1 j=1
3.3 Repair System Reliability Analysis
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ n n ⎨ ⎬ ∑ ∑ ⎥ ⎢⎝ n− j+1 n− j+1 ⎝ ⎠ ⎠ × exp a ⎣ q Xj − T − tn + q Xj ⎦ . ⎪ ⎪ ⎩ ⎭ j=1 j=1
97
(3.84)
For the device that consists of K number of systems (l = 1, 2, …, K). Let us apply the similar procedures. The probability, log-probability functions, and MLEs for time-terminated multiple-repairable system data set are given as ⎡ ⎡ ⎛ ⎞b−1 ⎤ nl K i−1 ∏ ∏ ∑ ⎢ ⎢ ⎝ ⎥ L= q i− j X l, j + X l,i ⎠ ⎦ ⎣ ⎣ab l=1
i=1
j=1
⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫ ⎪ ⎪ i−1 i−1 ⎨ ⎬ ∑ ⎥ ⎢ ∑ i− j exp a ⎣⎝ q X l, j ⎠ − ⎝ q i− j X l, j + X l,i ⎠ ⎦ ⎪ ⎪ ⎩ ⎭ j=1 j=1 ⎧ ⎡⎛ ⎞b ⎛ ⎞b ⎤⎫⎤ ⎪ ⎪ nl nl ⎨ ⎬ ∑ ∑ ⎢⎝ ⎥ ⎥ n− j+1 n− j+1 ⎝ ⎠ ⎠ × exp a ⎣ q X l, j − T − tl,n + q X l, j ⎦ ⎦. ⎪ ⎪ ⎩ ⎭ j=1 j=1 (3.85)
The MLE equations obtained for both Kijima type-1 and Kijima type-2 models for different cases are non-linear and complex in nature and cannot be solved easily. These non-linear equations can be solved by the mean of software like MATLAB. Nevertheless, an easier way to get the evaluators is by maximizing the log-probability functions. There are different approaches to maximize the objective function and any of the software products can be used for this. 4) Virtual Age-Based Reliability Indexes The transformation of reliability indexes from the real timescale to virtual timescale creates mathematical calculations easier in evaluating reliability performances. The virtual timescale can be returned into the real timescale later. The intensity function, MTBF, reliability, availability, and expected number of failure equations based on virtual timescale are given in Table 3.5. Example 3.13 [2] Consider the failure times of aviation engines of Example 3.10. Evaluate the scale, shape parameters, and the repair effectiveness index applying GRP Kijima type-1 model. Draw the intensity function graph for the mean time to repair of the engines, MTTR = 528 hours. Draw the availability graph. What will MTBF and availability of the engines be at t = 550 h? Solution Maximizing Eq. (3.73) or solving Eqs. (3.74), (3.75), and (3.76), we obtain the next results a = 0.00022, b = 1.35, q = 0.75, MTTR = 528 h( given).
98
3 System Reliability Models
Table 3.5 Virtual timescale age-based reliability indexes No.
Performance name
Equation
1
Intensity function
u(vi ) = abvib−1
(3.86)
2
Mean Time Between Failures (MTBF)
MTBF(vi ) =
(3.87)
3
Reliability
For 1st failure{ } R(v1 ) = exp −a(v1 )b where v1 = qt 1
(3.88)
For next failures,
(3.89)
4
1 u(vi )
R(v1 + vi−1 ) R(vi−1 ) [ ] = exp a (vi−1 )b − (v1 + vi−1 )b
R(vi |vi−1 ) =
5
Availability
6
Expected number of failures
MTBF(vi ) A(vi ) = MTBF(v i +MTTR) where MTTR is Mean Time to Repair ∫vi E[N (t)] = u(vi )dv
(3.90) (3.91)
0
for i = 1, 2, …, n
Using Eqs. (3.86), (3.87), and (3.89), intensity function, MTBF, and availability can be evaluated. The intensity function equation. u(vi ) = 0.00022 × 1.35 × (vi )0.35 . The intensity function graph is shown in Fig. 3.22. MTBF (t = 550 h) = 394.5602 h. The availability graph is shown at Fig. 3.23. Availability (t = 550 h) = 0.4277.
Fig. 3.22 Intensity function graph for Example 3.13
3.3 Repair System Reliability Analysis
99
Fig. 3.23 Availability graph for Example 3.13
Example 3.14 [2] To instance the general application of GRP, we must consider a system tested for T = 395.2 h with the 56 failure times given in Table 3.6. The first failure was detected at 0.7 h into the test and the second failure was detected 3 h later at 3.7. The last failure detected at 395.2 h in the test and the system was rejected from the test. This failure data is failure truncated. Find out a-, b-, and q-value for the given data set using Kijima MLEs. Solution Here, n = 56, K = 1, and T = 395.2 h. Maximizing Eq. (3.73) or solving Eqs. (3.74), (3.75), and (3.76), we get the next results for Kijima type-1 model b = 0.9372, a = 0.2061, q = 1.0. Maximizing Eq. (3.80) or solving Eqs. (3.81), (3.82), and (3.83), we get the next results for Kijima type-2 model b = 0.24725, a = 0.89442, q = 0.93.
Table 3.6 Time to failure data (hours) of Example 3.14 0.7
0.63
125
244
315
366
3.7
72
133
249
317
373
1
99
151
250
320
379
1
99
163
260
324
389
1
100
164
263
324
394
2
102
174
273
342
395.2
4
112
177
274
350
5
112
191
282
355
5
120
192
285
364
5
121
213
304
364
100
3 System Reliability Models
3.3.5 Markov Chain and Its Application 1. Base information The mathematical description of a Markov random process occurring in a system with discrete states depends on the times at which the system’s transitions from state to state can occur. If transitions between states can occur only at predetermined times, then such a process is called a discrete-time Markov process. If transitions can occur at any time, such a process is called a continuous-time Markov process. With an exponential distribution of the random residence time of the system in each of its states, the Markov process is homogeneous (the intensity of transitions between states does not depend on time). Homogeneous Markov processes with a discrete set of states and continuous time are the main tools for studying the reliability of complex systems with recovery. This is due to the fact that they allow one to obtain analytical expressions or computational schemes for calculating various reliability indicators. Moreover, in the vast majority of cases, the initial data for the elements are either constant failure rates or average operating times before failure. The construction of Markov reliability models is as follows. Based on information about the structure and operating principles of the system under study, a set of its possible states is determined. This set is divided into two subsets: functional states and failure states. Next, a Markov transition graph is constructed, where the vertices are the states of the system and the edges are possible transitions between states. The intensity of the transitions is determined by the reliability and maintainability characteristics of the system elements. According to the transition graph, the necessary system of equations is compiled, the analytical solution of which allows to obtain equation expressions for the required reliability indicators. If the solution of the system is possible only by numerical methods, then numerical values of reliability indicators at specified times are obtained. 2. Markov model of the reliability of the repair element Consider the operation of the recovered element under the following assumptions: (1) The failure rate of the element is Poisson with parameter λ (failure rate). (2) The recovery rate of the element is Poisson with parameter μ (recovery intensity). These assumptions are equivalent to assuming an exponential distribution of random failure and recovery times. An element can be in two states: (1) Serviceable state. (2) Failure state. The Markov graph of element transitions between the serviceable state and the failure state is shown in Fig. 3.24. Let’s denote: P1 (t) is the probability of finding an element in state 1 at time t. P2 (t) is the probability of finding an element in state 2 at time t. The event A of the operability of the element (being in state 1) at the time t + ∆t can occur in two ways.
3.3 Repair System Reliability Analysis
101
Fig. 3.24 Markov transition graph for the element being recovered
Or event B will occur, which consists in the fact that the element was already in state 1 at time t and did not leave this state during time t (failure did not occur during t). Or event C will occur, which consists in the fact that at the moment t the element was in the failure state 2 and during the time t it passed from the state 2 to the state 1 (the functionality of the element was restored in t). The probability of event B is equal to P(B) = P1 (t)e−λ∆t
(3.92)
By decomposition of the exponent into a series, we have e−λ∆t = 1 − λ∆t +
(λ∆t)3 (λ∆t)2 − + · · · = 1 − λ∆t + 0(∆t), 2! 3!
or to the values of the highest order of smallness P(B) ≈ P1 (t)(1 − λ∆t).
(3.93)
The probability of event C is equal to P(C) ≈ P2 (t)(1 − e−λ∆t ) ≈ P2 (t)μ∆t.
(3.94)
Then the probability of event A (the operability of the element at time t + ∆t), taking into account the fact that events B and C are incompatible, is defined as P(A) = P1 (t + ∆t) = P1 (t)(1 − λ∆t) + P2 (t)μ∆t.
(3.95)
If we transfer P1 (t) to the left side of the equation, divide the resulting increment of the function by the increment of the argument, and aim ∆t at zero, we get a differential equation with respect to the unknown probability P1 (t) P1' (t) = P1 (t)(−λ) + P2 (t)μ.
(3.96)
Similarly, by reasoning, one can obtain a differential equation with respect to the probability P2 (t). Thus, we have obtained a system of ordinary differential equations for the probabilities P1 (t) and P2 (t)
102
3 System Reliability Models
{
P1' (t) = −λP1 (t) + P2 (t)μ P2' (t) = λP1 (t) − P2 (t)μ
.
(3.97)
The resulting system of differential equations is solved under initial conditions P1 (0), P2 (0), which specify the probability distribution of the states at the initial time t = 0. Since for any given time the events of finding an element in one of its possible states form a complete group, the normalization condition P1 (t) + P2 (t) = 1 is satisfied. The intensities of Eq. (3.97) can be represented as a square matrix [ T =
] 0λ , μ0
which has a definite correspondence with a Markov graph. In matrix form, Eq. (3.97) can be written as [
P1' (t),
[
P2' (t)
[ = [P1 (t), P2 (t)]
−λ λ
]
μ −μ
or P ' (t) = P(t) · Λ,
(3.98)
where P' (t), P(t) are a row bit vector; Λ is an infinitesimal matrix. There is an obvious relationship between the elements of the T matrices and Λ ⎧ ⎨ ti j , i /= j n ∑ λi j = ti j , i = j ⎩− j=1
When we build Markov models of the reliability of multi-element systems and take into account additional factors, it is obvious that the state space of the model will increase. The system of differential equations with respect to Pi (t) (i = 1, 2, …, n) is generally written as P1' (t) = −P1 (t)
∑
λi j +
i∈g1
··· Pk' (t) = −Pk (t)
∑
Pn' (t) = −Pn (t)
∑ i∈gn
Pi (t)λi1
i∈G 1
λk j +
i∈gk
···
∑
∑
Pi (t)λik
i∈G k
λn j +
∑ i∈G n
Pi (t)λin
(3.99)
3.3 Repair System Reliability Analysis
103
where gk is the set of states into which a direct transition from a given state k is possible; Gk is the set of states from which a direct transition to state k is possible. Equations of the form (3.99) for state probabilities of a continuous-time Markov process with a discrete set of states are called Kolmogorov-Chapman equations. The product Pi (t)λij has the name of the probability flow. When composing the Kolmogorov-Chapman equations on the transition graph, it is convenient to use the following rule: the probability derivative of any state is equal to the sum of the probability flows that lead the system into that state, minus the sum of all the probability flows that lead the system out of that state. 3. Analytical methods for solving equations (1) The Laplace transform method Let’s apply the Laplace transform to the system of differential equations (3.97) describing the process of failures and recoveries of the element. As a result, we obtain a system of algebraic equations {
s P1 (s) − 1 = −λP1 (s) + P2 (s)μ s P2 (s) = λP1 (s) − P2 (s)μ
,
(3.100)
∫∞ where Pi (s) = 0 Pi (t)e−st dt is the Laplace transform for Pi (t). The set of Eq. (3.100) is conveniently written as {
(s + λ)P1 (s) − P2 (s)μ = 1 λP1 (s) − (μ + s)s P2 (s) = 0
,
(3.101)
Table 3.7 of Laplace transforms of the main functions used in reliability calculations is given below. For example, we can find a solution to the algebraic set of Eq. (3.101) using Cramer’s rule, and then represent it as a sum of simple fractions to obtain the inverse Laplace transform P1 (s) =
A1 B1 μ+s = + s[s + (λ + μ)] s s + (λ + μ)
Table 3.7 Laplace transforms of the main functions used in reliability calculations Result
Base
Result
Base
αp1 (s) + βp2 (s)
αp1 (t) + β p2 (t)
1/s
1(t)
sp(s) − p(0)
p ' (t)
1/s n , n = 1, 2, . . .
t n−1 /(n − 1)!
e−bs p(s)
p(t − b)
1/(s − b)
ebt ) ( n−1 /(n − 1)! ebt t
(1/b) p(s/b), b > 0
p(bt)
1/(s
− b)n ,
n = 1, 2, . . .
104
3 System Reliability Models
P2 (s) =
λ A2 B2 = + . s[s + (λ + μ)] s s + (λ + μ)
(3.102)
The relations (A1 + B1 )s + A1 (λ + μ) = μ + s and (A2 + B2 )s + A2 (λ + μ) = λ allow us to find the desired values A1 =
μ λ λ λ , B1 = , A2 = , B2 = − . λ+μ λ+μ λ+μ λ+μ
(3.103)
Using the equations in Table 3.7, we can find P1 (t) and P2 (t) as the inverse Laplace transform of P1 (s) and P2 (s) μ λ + , λ+μ λ+μ ) λ ( 1 − e−(λ+μ)t . P2 (t) = λ+μ P1 (t) =
(3.104)
Probability P1 (t) is a probability of finding the repaired element in an operational state at an arbitrary time t. This probability is one of the most important indicators of the reliability of the repaired systems and is called the readiness factor C R (t). The probability P2 (t) is the probability of finding the repaired element in an inoperative state at an arbitrary time t. This indicator is called the downtime factor C D (t). (2) Reduction of a set of n equations to an equation of n-th order Let’s express the unknown probability P1 (t) from the second equation of (3.97) P1 (t) =
P2' (t) + μP2 (t) . λ
(3.105)
By substituting the obtained expression into the first equation of the set of Eq. (3.97), we obtain a linear homogeneous differential equation of the second order P2'' (t) + (λ + μ)P2' (t) = 0.
(3.106)
The characteristic equation for Eq. (3.106) has the view x 2 + (λ + μ)x = 0.
(3.107)
Roots of Eq. (3.107) are x 1 = 0 and x 2 = –(λ + μ). For the root x 1 = 0 there is solution C1 ex1 t = C1 . For the root x 2 = –(λ + μ) there is solution C2 ex2 t = C2 e−(λ+μ)t . The general solution of the homogeneous Eq. (3.107) has the form P2 (t) = C1 ex1 t + C2 ex2 t = C1 + C2 e−(λ+μ)t .
(3.108)
3.3 Repair System Reliability Analysis
105
It is needed to find the value of arbitrary constants C 1 and C 2 . Let’s substitute the initial condition P2 (0) = 0 into the general solution. Then C 1 = –C 2 . Let’s express Eq. (3.105) in terms of the general solution and, given the initial condition P1 (0) = 1, we get C1 =
λ λ , C2 = − . λ+μ λ+μ
(3.109)
The two methods considered are not the only methods for the analytical solution of a set of differential equations describing the dynamics of the probabilities of a system being in a set of its possible “reliability” states. For example, there are known and successful methods based on the computation of eigenvalues and eigenvectors which are used to solve reliability analysis problems [4]. In conclusion, it should be noted that all analytical methods are explicitly or implicitly related to the solution of the characteristic equation, which is not analytically solved for large-dimensional systems. Therefore, in modern software for reliability analysis on Markov models, the solution of the Kolmogorov-Chapman equations is implemented by numerical methods. Since Markov models of reliability generate rigid differential equations due to the large difference in the values of failure rates (λ ≈ 10–6 –10–9 ) and recovery rates (μ≈1–100), they use special numerical methods. The creation of effective methods for solving rigid equations is an area of active mathematical research. 4. Reliability indicators of repair systems based on Markov models (1) The probability of failure-free operation For recoverable systems, when finding an indicator of the probability of failurefree operation, it is necessary to make all failure states absorbing. Formally, this means removing all branches of the Markov graph (or setting the transition intensities to zero) that correspond to the return from the failure states to the operational states. Then, for each k-th state of the system’s operability, the following differential equation can be written: Pk' (t) = −Pk (t)
∑ i∈gk
λki +
∑
Pi (t)λik ,
(3.110)
i∈G + ∩G k
where G+ is a set of health states of the system. In the case of a recoverable element, the Markov graph with an absorbing failure state has the form shown in Fig. 3.25. Fig. 3.25 A Markov graph with an absorbing failure state of an element
106
3 System Reliability Models
Then the probability of failure-free operation is P1 (t), which is determined from P1' (t) = −λP1 (t).
(3.111)
Under the initial condition: P1 (0) = 1. By separating the variables and integrating the left and right parts, we obtain a well-known expression for the probability of failure-free operation of an exponentially distributed element P(t): ∫ P1' (t) dt = − λdt P1 (t) ⇒ P1 (t) = C1 e−λt ⇒ P1 (0) = 1 ⇒ C1 = 1 ∫
(3.112)
⇒ P(t) = P1 (t) = e−λt (2) Mean time to failure It is well known that the mean time to failure T and the probability of failure-free operation are linked by the relationship ∫∞ T =
P1 (t)dt. 0
Consequently, the mean time to failure can be obtained by integrating a set of differential equations for the “absorption” model. Consider the general case of a system with n states, where state n is absorbing (system failure) P1' (t) = −P1 (t)
∑
λ1i +
Pk' (t) = −Pk (t)
∑
Pi (t)λi1
i∈G + ∩G 1
i∈g1
···
∑
λki +
' (t) = −Pn−1 (t) Pn−1
Pi (t)λik
.
i∈G + ∩G k
i∈gk
···
∑
∑ i∈gn−1
λ(n−1)i +
∑
(3.113)
Pi (t)λi(n−1)
i∈G + ∩G n−1
Under the initial condition: P1 (0) = 1, …, Pk (0) = 0, …, Pn (0) = 0. Let’s integrate the left and right parts of Eq. (3.113). Considering that in the presence of an absorbing state Pi (∞) = 0, then
3.3 Repair System Reliability Analysis
− T1
∑
λ1i +
− Tk
∑
λki +
∑
− Tn−1
Ti λi1 = 1
∑
Ti λik = 0
i∈G + ∩G k
i∈gk
···
∑ i∈G + ∩G 1
i∈g1
···
107
∑
λ(n−1)i +
Ti λi (n−1) = 0,
(3.114)
i∈G + ∩G n−1
i∈gn−1
where T i is the mean time spent in operation mode i when starting from operation mode. The mean time to failure T is determined by summing T i over all operating states T =
∑
Ti .
(3.115)
i∈G +
For a single element, considering P1 (0) = 1, we have –1 = –λT, where T =
1 . λ
(3.116)
5. Stationary indicators for markov models of reliability In the theory of random processes, it is proved that a homogeneous Markov process without absorbing states (states from which there is no exit) has a stationary mode, which necessarily occurs at sufficiently large times (t → ∞). The stationary mode is characterized by the fact that the probabilities of Pi no longer depend on time, and consequently their derivatives become zero. Therefore, in order to compute the stationary probabilities of the states of the system, it is necessary to set the derivatives standing in the left parts of Eq. (3.99) to zero. To prevent the resulting system from ∑ndegenerating, one of the equations is replaced by the normalization Pi = 1). As a result, we obtain the following system of n algebraic condition ( i=1 equations for the determination of stationary reliability indicators − P1
∑
λ1i +
i∈g1
··· − Pk
∑ i∈gk
···
∑
Pi λi1 = 0
i∈G 1
λki +
∑ i∈G k
Pi λik = 0
108
3 System Reliability Models
∑
− Pn
λ(n−1)i +
i∈gn−1
∑
Pi λi(n−1) = 0
i∈G n−1
P1 + . . . + Pk + . . . + Pn−1 + Pn = 1.
(3.117)
The solution of the set of Eq. (3.117) allows us to obtain such reliability indicators as the stationary readiness factor C R (the probability of finding an object in an operational state at any sufficiently distant time) and the stationary downtime factor C D (the probability of finding an object in an inoperative state at any sufficiently distant time). CR =
∑
Pi ,
(3.118)
Pi ,
(3.119)
i∈G +
CD =
∑ i∈G −
where G– is the set of all inoperable states of the system. Let’s study the reliability of the system for a single recoverable element (Fig. 3.24) at a stationary site. The set of algebraic equations obtained from Eq. (3.97) for t → ∞ has the form { −λP1 + μP2 = 0 . (3.120) P1 + P2 = 1 Solutions for this set are P1 =
μ λ , P2 = . λ+μ λ+μ
The reliability indicators of the recovered element obtained on the Markov model are summarized in Table 3.8. 6. Markov model state consolidation Table 3.8 Equations for reliability indicators of the recovered element Name of the indicator
Equation
Readiness factor
C R (t) =
Downtime factor
C D (t) =
Stationary readiness factor
CR =
μ λ −(λ+μ)t λ+μ + λ+μ e ( ) λ −(λ+μ)t λ+μ 1 − e
Probability of failure-free operation
μ λ+μ λ C D = λ+μ P(t) = e−λt
Mean time to failure
T =
Stationary downtime factor
1 λ
3.3 Repair System Reliability Analysis
109
Analytical Markov models are a powerful and fairly universal mathematical tool for analyzing the reliability of complex systems. However, when using them, wellknown dimensionality problems arise: the growth of the model state space and the relationships between states with an increase in the number of elements of the analyzed∏system. In general, the dimension of the state space of the Markov model n K i , where n is the number of elements of the system, K i is the number is K ≥ i=1 of states in which the i-th element of the system can be, for example, the device can be in three states (operational and states corresponding to two types of inactivity: failure of the malfunction type, failure of the false alarm type). If an element can be in two states, operable and inoperable, then the dimension of the Markov model is K ≥ 2n . Let’s compare three reliability models from the point of view of dimensionality: a block diagram (it was explained in Sect. 3.2), a fault tree (it will be explained in Sect. 4.3 of this book), a Markov graph for a system of three parallel elements with different reliabilities, the failure of which is the failure of all three of its elements (the operability criterion “1 out of 3”). Figure 3.26 shows three different models of the reliability of this system: Markov graph (a), block diagram (b), and failure tree (c). Obviously, in this case, logical-probabilistic models turn out to be much more compact than Markov models. If we also take into account that in order to compute reliability indicators on the Markov model, it is necessary to create and solve a system of differential equations, and the calculation on block diagrams and trees in this case is reduced to a fairly simple transformation of logical functions and replacement of logical variables by probabilistic ones, then the comparison will not be in favor of Markov modeling. But this is not the case. (1) As soon as we want to remove the assumption of complete independence of the process of repairing elements (the model Fig. 3.26 is constructed precisely under this assumption), then logical-probabilistic methods stop “functioning”. Using the Markov model, we can take into account a number of features of the repair process (limitation of the number of repair teams, priorities, system shutdown during repair, etc.). Logical-probabilistic models describe only the case of unlimited, independent repair of elements, and the repair is performed with the system running, which is very rare in practice. (2) For systems with recovery based on logical-probabilistic models, only point reliability indicators determined at time t are calculated, for example, the readiness factor, the downtime factor. Markov models allow us to calculate all the main reliability indicators, both point and interval, for example, the probability of failure-free operation (failure) at a time interval (0, t), the mean time to failure. (3) The computational power of modern computers makes it possible to find numerical solutions to large-dimensional systems of differential and algebraic equations generated by Markov graphs. In fact, the speed, the amount of RAM, and the means of dynamic allocation of memory at execution time present in modern programming languages make it easy to solve systems of equations with thousands or more unknowns even on modern laptops, not to mention large specialized mainframe computers.
110
3 System Reliability Models
Fig. 3.26 Reliability models of the “1 out of 3” system of multi-reliable elements: a Markov model; b block diagram; c failure tree Fig. 3.27 Duplicated system of equally reliable elements Markov graph
However, the problem of dimensionality of Markov models cannot be solved completely. The ergonomic part of the problem is actually related to the difficulty of the input description of the model and the determination of its parameters by a human. Building a Markov graph with a thousand nodes is an extremely difficult task for a human. Even the most advanced graphical editors embedded in modern reliability analysis software do not help. Therefore, when constructing Markov models, they usually do not consider the entire set of possible states of the system, but try either to remove some states based on the conditions of the system operation, and/or to
3.3 Repair System Reliability Analysis
111
consolidate (combine) some groups of states into one. Therefore, this section is devoted to the study of formal rules for state consolidation in Markov models. The consolidation of Markov process states can be either precise or approximate. Let’s study the conditions for implementing exact consolidation and demonstrate the method of combining states using a practical example. Let’s construct a Markov model of the reliability of a duplicated system of multi-reliable elements with a failure rate of λ1 , λ2 and a recovery rate of μ1 , μ2 , respectively (Fig. 3.27). Here, 0 is a operational state, 1 and 2 are operational states of a single failure, and 3 is a system failure. The set of differential equations related to the states of the Markov process has the form P0' (t) = −(λ1 + λ2 )P0 (t) + μ1 P1 (t) + μ2 P2 (t) P1' (t) = λ1 P0 (t) − (λ2 + μ1 )P1 (t) + μ2 P3 (t) . P2' (t) = λ2 P0 (t) − (λ1 + μ2 )P2 (t) + μ1 P3 (t)
(3.121)
P3' (t) = λ2 P1 (t) + λ1 P2 (t) − (μ1 + μ2 )P3 (t) For the initial conditions: P0 (0) = 1, P1 (0) = P2(0) = P3(0) = 0, direct Laplace transform from the set of differential equations (3.121) s P0 (s) − 1 = −(λ1 + λ2 )P0 (s) + μ1 P1 (s) + μ2 P2 (s) s P1 (s) = λ1 P0 (s) − (λ2 + μ1 )P1 (s) + μ2 P3 (s) s P2 (s) = λ2 P0 (s) − (λ1 + μ2 )P2 (s) + μ1 P3 (s) s P3 (s) = λ2 P1 (s) + λ1 P2 (s) − (μ1 + μ2 )P3 (s).
(3.122)
The roots of the characteristic equation |Q–sE| = 0, where Q is the infinitesimal matrix, E is the unit matrix, are equal to S 1 = 0; S 2 = –(λ1 + μ1 ); S 3 = –(λ2 + μ2 ); S 4 = –(λ1 + μ1 + λ2 + μ2 ). Let’s denote β = λ1 + μ1 , γ = λ2 + μ2 , then S 1 = 0; S 2 = –β; S 3 = –γ ; S 4 = –(β + γ ). Further, by decomposing the expressions for Pi (s) into simple fractions and inverting them according to the standard rules of the inverse Laplace transform, we get ) 1 ( μ1 μ2 + λ1 μ2 e−βt + λ2 μ1 e−γ t + λ1 λ2 e−(β+γ )t βγ ) 1 ( λ1 μ2 − λ1 μ2 e−βt + λ1 λ2 e−γ t − λ1 λ2 e−(β+γ )t P1 (t) = βγ ) 1 ( λ2 μ1 + λ1 λ2 e−βt − μ1 λ2 e−γ t − λ1 λ2 e−(β+γ )t P2 (t) = βγ 2 ∑ ) λ1 λ2 ( 1 − e−βt − e−γ t + e−(β+γ )t . Pi (t) = P3 (t) = 1 − βγ i=0
P0 (t) =
(3.123)
112
3 System Reliability Models
From Eq. (3.123), we obtain an expression for the non-stationary readiness factor of a duplicated system with two different elements: C R (t) =
2 ∑
Pi (t).
(3.124)
i=2
The stationary probabilities of the states are equal: ∑ λ1 μ2 λ2 μ1 λ1 λ2 μ1 μ2 , P1 = , P2 = , P3 = 1 − . Pi = βγ βγ βγ βγ i=0 2
P0 =
(3.125)
The stationary readiness factor of the system CR =
μ1 μ2 + λ1 μ2 + λ2 μ1 . (λ1 + μ1 )(λ2 + μ2 )
(3.126)
Let’s try to reduce the state space of the model by combining two states 1 and 2 into one. This will correspond to the case where one element is operating and the other has failed and is being repaired, i.e., we will not distinguish which of the elements is operating and which has failed (Fig. 3.28). But if the failure and repair rates of each element are different, then the Markov property will be violated with such a consolidation. If the transition to the generalized (consolidate) state occurs when element 1 fails, then the output will be related to the failure of element 2 or to the repair of element 1. If it occurs when element 2 fails, it’s the opposite. For the Markov process, it does not matter how the system gets to the current state. There is a violation of the property of the absence of aftereffects: the past is important and it is impossible to write the intensity of the exit from the consolidate state exactly. In this case, it is possible to construct an approximate consolidate model, as shown in Fig. 3.29.
Fig. 3.28 The consolidation states of a single failure of the Markov model of a duplicated system of different elements
Fig. 3.29 Approximate consolidation of single-failure states of the Markov model of a duplicated system from different elements
3.3 Repair System Reliability Analysis
113
Equivalent transition intensities λe and μe of the approximate model are calculated by the equations λe = λ2
P1 P2 + λ1 , P1 + P2 P1 + P2
(3.127)
μe = μ1
P1 P2 + μ2 . P1 + P2 P1 + P2
(3.128)
1 Here P1 and P2 are calculated using Eq. (3.125); P1P+P is the stationary conditional 2 probability that the first element will fail if either the first or second element fails; P2 is the stationary conditional probability that the second element will fail if P1 +P2 either the first or second element fails. The precision of the approximation of the initial (unconsolidated) process to the consolidated process depends on the ratio of the failure and repair rates. The greater the difference between λi and μi , the more precise the approximation. In the vast majority of cases, redundant systems are formed from identical elements. Consider a duplicated system with equally reliable elements (λ1 = λ2 and μ1 = μ2 ). For the model of this system (Fig. 3.30), the intensity of the exit from the consolidated state is the same and does not depend on how we entered it. In this case, the consolidation is correct and does not violate the Markov property. Thus, the exact consolidation of the states of the Markov model can be performed if the following conditions are met:
1) Transitions from each of the consolidated states are only possible to the same states, i.e., if there are transitions from one of the states being consolidated into one to a subset of other states, then there must be transitions from other states being consolidated into one to the same subset of states. 2) The intensities of the exits from the states that are consolidated into one should be the same. If conditions 1 and 2 are satisfied, the following rules should be followed: (1) The intensity of the transition to the consolidated state is equal to the sum of the intensities of the transitions to each of the consolidated states. (2) The intensity of the exit from the consolidated state is equal to the intensity of the exit from one of the consolidated states. In accordance with the above rules, let’s perform a consolidation of the states of the Markov model of a duplicated recoverable system with a backup element
Fig. 3.30 Exact consolidation of single-failure states of the Markov model of a duplicated system of identical elements
114
3 System Reliability Models
Fig. 3.31 Markov model of a duplicated system with the light redundancy
operating in a light mode (Fig. 3.31). Failure rate of the light redundancy Λ = αλ (0 ≤ α ≤ 1). The set of differential equations P0' (t) = −(λ + αλ)P0 (t) + μP1 (t) + μP2 (t) P1' (t) = λP0 (t) − (λ + μ)P1 (t) + μP3 (t) . P2' (t) = αλP0 (t) − (λ + μ)P2 (t) + μP3 (t) P3' (t) = λP1 (t) + λ1 P2 (t) − 2μP3 (t)
(3.129)
In this model, states 1 and 2 can be consolidated. The set of differential equations (3.129) after consolidation has the form P0' (t) = −(λ + αλ)P0 (t) + μP1,2 (t) ' (t) = (λ + αλ)P0 (t) − (λ + μ)P1,2 (t) + 2μP3 (t) P1,2 P3' (t)
(3.130)
= λP1,2 (t) − 2μP3 (t)
Note that the set of Eq. (3.130) can be obtained automatically as a result of ' (t) = P1' (t) + P2' (t)). The set of Eq. (3.130) is changing the ratio of 1 and 2 (P1,2 completely identical to Eq. (3.129) (in the sense of equality and representation of all components), which confirms the complete identity of the original and consolidated transition graphs (Fig. 3.32). Often, a combination of loaded and unloaded redundancy is used to increase reliability. In this case, we say that hybrid redundancy is implemented in the system. Let’s build a Markov model of the reliability of a system with a hybrid redundancy, where there is an operating unit consisting of three identical parallel elements. In addition, there are two unloaded redundant elements connected to the place of the failed elements of the operating unit (Fig. 3.32). Repair of the failed elements is Fig. 3.32 Hybrid redundancy system
3.3 Repair System Reliability Analysis
115
carried out by two repair teams. The system is put into operation only after the complete repair of the operating unit (all three elements are repaired). The Markov graph of the system is shown in Fig. 3.33. Each state (node of the graph) is assigned a code: N, K, W, where N is the number of operable elements of the operating block, K is the number of operable unloaded redundant elements, and W is the number of elements waiting for repair. The character R indicates a serviceable state and the character F indicates a failure. The following characteristics are considered in the constructed Markov model: (1) Unloaded redundancy. (2) Limitation of the number of repair teams. (3) Specific system recovery procedure. None of these characteristics could be considered in logical-probabilistic models. By consolidating the states of the same failure rate of the operating unit, the dimension of the model is reduced. 7. Markov model-based reliability analysis of complex recoverable systems Let’s continue to study the reliability of systems, for which it is necessary to apply Markov processes to model them correctly. Consider three examples: (1) Computation of the readiness factor of a system with dependent operation of elements. (2) Construction of a system reliability model with built-in monitoring and recovery deferred until the end of the task. (3) Computation of the stationary parameters of the failures rate and the mean time between failures for redundant structures with recovery of elements. (1) Sequential, recoverable system with dependent operation of elements. Consider a system consisting of n sequentially connected elements in terms of reliability. Suppose that if any one element fails, the system shuts down, i.e., the operation stops to restore that element. This is the most common operating condition. For example, an aircraft powerplant consists of a fuel system, a lubrication system, an engine mount, etc. Failure of any of these subsystems will result in failure of the powerplant, and to restore (repair) it is necessary to shut off the powerplant. This means that while repairing a failed subsystem, failures of other subsystems are either impossible or their probability can be neglected. Then the number of states of the Markov model of the system will be n + 1, not 2n (Fig. 3.34). Fig. 3.33 Markov graph of a system with hybrid redundancy
116
3 System Reliability Models
Fig. 3.34 Markov model of a system with dependent operation of elements
The expression for the stationary readiness factor is obtained from the set of equations: λ1 P0 − μ1 P1 = 0 λ2 P0 − μ2 P2 = 0 ... λn−1 P0 − μn−1 Pn−1 = 0 P0 + P1 + . . . + Pn−1 + Pn = 1.
(3.131)
The solution of the set of Eq. (3.131) gives us Pj =
λj P0 . μj
(3.132)
This means that we express the stationary failure probability of the i-th element (i = 1, 2, …, n) by the probability of an operational state P0 . Substituting Eq. (3.132) into the last equation of the set of Eq. (3.131), we obtain an expression for the stationary readiness factor of the system, CRst = P0 =
1+
1 ∑n
λj j=1 μ j
.
(3.133)
Let’s express the stationary readiness factor of the system C Rst by the readiness μj λjCj factors of its elements C j , where C j = λ j +μ . Then μ j = 1−C and from where we j j get CRst = P0 =
1+
∑n
1 (
j=1
1 Cj
). −1
(3.134)
Let n = 3, C 1 = 0.61, C 2 = 0.72, C 3 = 0.63. Then from Eq. (3.134) C Rst = 0.38. A common mistake is to compute the readiness factor of such a system by multiplying the readiness factors of the elements: C Rst = C 1 C 2 C 3 = 0.28, which leads to an incorrect result. This is explained by the fact that it is possible to multiply the readiness factors of elements to obtain an estimate of the system readiness indicator only if the elements are independent. In this case, there is a dependency, if one of the
3.3 Repair System Reliability Analysis
117
elements fails, the others will not fail. It is necessary to compute using Eq. (3.134) obtained using Markov simulation. (2) A system reliability model with built-in monitoring and recovery deferred until the end of the task Operational built-in test (BIT) of the technical condition of elements and systems, and monitoring of the correctness of functions allow to fully realize the possibilities of redundancy, to take timely measures for reconfiguration of systems and change of operating modes, and thus to ensure the fault-proof property of the system as a whole. However, the monitoring system is not ideal. Firstly, it can refuse itself, and secondly, it does not detect absolutely all failures. Therefore, in order to ensure high reliability and safety indicators, a thorough reliability analysis of systems is required, taking into account monitoring characteristics. One of the most important such characteristics is the completeness of monitoring, which characterizes the proportion of failures of the object detected during health monitoring. In general, the quality of monitoring is determined by the list of elements (components) whose failures are detected by the monitoring. Therefore, one of the characteristics of completeness of the monitoring can be the ratio of the number of monitored elements to the total number of elements of the monitored object in consideration (e.g., as a percentage). However, for the joint simulation of the reliability behavior of an object and monitoring tools, it is desirable to define the completeness of monitoring as some probabilistic indicator or as the ratio of the reliability characteristics (failures) of the monitored elements to all elements. The expediency of such a task is explained by the fact that when simulating the reliability behavior of the analyzed object, it will be possible to divide the total failure rate into two components: identified monitored failures and hidden failures. The completeness of monitoring in this case can be defined as the conditional probability of the monitored failure, provided that the failure occurred (
monitored failure η=P failure ocures in (0, t)
) =
1 − e−
∫t
1 − e−
0
Λm (t)dt
∫t 0
Λ(t)dt
,
(3.135)
where Λ is the total failure rate of the monitored object (monitored + hidden) and Λm is the total failure rate of the monitored failures. Averaging the failure rates over the interval (0, t) yields η=
λm av 1 − e−λm av ·t = , −λ ·t av 1−e λav
(3.136)
∫t where λav = 1t 0 λ(t)dt and for the real high reliability systems λav t index(y) 1
(5.18)
2
where ∆ denotes an arbitrary Boolean logic operation and index() denotes variable ordering. 2. PMS-BDD method In PMS, it is impossible to handle the stage dependency relationship of components if all single-stage failure models are only combined, so a progressive combination is used to gradually generate BDDs, i.e., PMS-BDD [11]. The complete PMS-BDD method consists of six parts: fault tree to BDD transformation, BDD merging method, BDD simplification method, variable ranking method, phase correlation processing method, and mission reliability calculation for each phase. (1) Transformation of Fault Tree to BDD. BDD is generally transformed from a fault tree model rather than constructed directly. The top event of a fault tree can be represented by a Boolean expression and expressed as a Shannon decomposition tree by Shannon decomposition. The ROBDD is obtained by sorting the variables in the fault tree and simplifying the Shannon decomposition tree, which has a smaller node size and cleaner graphs than the Shannon decomposition tree. In the ROBDD model, the minimum cut set of the fault tree can be obtained by traversing the nodes from the root node in top-down order and finding the endpoint with the number 1. Figure 5.8 shows a simple fault tree and its corresponding ROBDD, where the variables are ordered as x1 < x2 < x3 < x4 < x5 . The expressions for G 1 , G 2 and T in the figure are G 1 = x2 + x3 + x4 = ite(x2 , 1, 0) + ite(x3 , 1, 0) + ite(x4 , 1, 0), = ite(x2 , 1, ite(x3 , 1, 0)) + ite(x4 , 1, 0)
5.3 Multi-Phase Mission Reliability Analysis
179
Fig. 5.8 A simple fault tree and the corresponding ROBDD
= ite(x2 , 1, ite(x3 , 1, ite(x4 , 1, 0)
(5.19)
G 2 = x2 + x5 = ite(x2 , 1, 0) + ite(x5 , 1, 0), = ite(x2 , 1, ite(x5 , 1, 0))
(5.20)
T = x1 · G 1 · G 2 = ite(x2 , 1) · ite(x2 , 1, ite(x3 , 1, ite(x4 , 1, 0))) · G 2 = ite(x1 , ite(x2 , 1, ite(x3 , 1, ite(x4 , 1, 0))), 0) · ite(x2 , 1, ite(x5 , 1, 0)), = ite(x1 , ite(x2 , 1, ite(x3 , f 1 · ite(x4 , f 1 , 0))), 0)
(5.21)
Among them, f 1 = ite(x5 , 1, 0). The transformation step from the fault tree model to the BDD model is to first use the bottom events in the fault tree model as the gate nodes of BDD input, and then connect the nodes by logical “or” and logical “and” rules, progressively, until all the bottom events are expressed as BDD structure. If the two bottom events are “logical with” in the Boolean expression, the nodes are connected with 1 branch when converting them to BDD; if they are “logical or”, the nodes are connected with 0 branches when converting them to BDD. The BDD connection method of “logical or” and “logical and” relationship is shown in Fig. 5.9. (2) Multi-stage BDD merging. When merging the BDD model of PMS, the rule of sequential merging is used. The BDD of the previous phase is taken as the primary BDD and the BDD of the latter phase is taken as the secondary BDD. If the BDDs of the two phases are connected by logical “with”, the secondary
180
5 Model-Based Reliability Analysis Methods
Fig. 5.9 BDD connection method for logical “and” a and logical “or” b relations
Fig. 5.10 BDD merging method
BDDs are connected to each end node of the primary BDD marked with 1. If the BDDs of the two phases are connected by logical “or”, the secondary BDD is connected to each end node of the primary BDD marked with 0. By merging the two BDDs shown in Figure 5.9, where G 1 is the primary BDD and G 2 is the secondary BDD, the BDD merging method is shown in Fig. 5.10. (3) BDD model simplification method. When converting the fault tree into a BDD, multiple duplicate nodes may appear, and to avoid the contradictory state of duplicate event analysis, the following rules can be used for path simplification. (1) The first occurrence of the event in the path specifies the state of the duplicate variable, and when the node indicates the second occurrence of the event needs to be replaced with the event branch below it on its 1 or 0, depending on the state of the event when it first occurs in the path. If the first occurrence of the event passes through the 1 branch of the BDD node, the branch on the second occurrence of the event is replaced with the BDD structure under that 1 branch; conversely, if the first occurrence passes through the 0 branches, the branch on the second occurrence of the event is replaced with the BDD structure under that 0 branch. (2) If the BDD structure under the 1 and 0 branches of any node is the same, the node is irrelevant and needs to be replaced with the structure under either branch. In other words, if the state of the system does not depend on the occurrence of basic events, the unimportant nodes must be deleted.
5.3 Multi-Phase Mission Reliability Analysis
181
Fig. 5.11 BDD simplified model
(3) Node sharing. If the paths to the end nodes of the two BDDs (logical “and” input for end node 1 of the two BDDs and logical “or” input for end node 0 of the two BDDs) cross the same branch of the duplicate event, the same copy of the second BDD will be connected to the two end nodes. According to the above rules, the BDD diagram in Fig. 5.10 is simplified, and the simplified model is shown in Fig. 5.11. (4) BDD variables ordering. The appropriate variable ordering method has a significant impact on the size of the final generated multi-stage BDD. Theoretically, if there are n variables in a Boolean expression, the number of corresponding BDD nodes is in the interval of [n +2, 2n−1 ]. When the variables are sorted optimally only 1 node is generated for each variable, while 2 nodes are generated for each node when the variables are sorted worst. Therefore, improper sorting can lead to an explosion in the number of generated BDD nodes. For PMS, the intra-stage variable sorting method can use the structural importance sorting method and heuristic variable sorting method, and the inter-stage variable sorting method (the sorting method between stages of the same variable) can use both the forward stage sorting method and backward stage sorting method. Take component A as an example, the sorting of A is A1 , A2 , . . . , An , when the forward-phase sorting method is used, and the sorting of A is An , An−1 , . . . , A1 , when the backward-phase sorting method is used. Compared with the forward-phase variable sorting, the backward-phase sorting method can simplify the Boolean expressions of the same component at different phases to the maximum extent, and the generated BDD is smaller in size. (5) Stage correlation processing. The same parts appearing in each stage can be processed by the micro-part method and the stage algebra method, so as to eliminate the influence of part-stage correlation. (1) Micro-component method. Assume that a non-repairable PMS component A in stage j can be replaced by a set of mutually independent micro-components j {ai }i=1 in series. Figure 5.12 shows the reliability block diagram and fault tree representation of this method.
182
5 Model-Based Reliability Analysis Methods
Fig. 5.12 Micro-component method
The probability of failure of part A at stage i = 1, 2, …, n when considering the stage dependence is calculated as follows [
] j−1 j−1 ∏ ∏( ( ) ) 1 − p A,i (Ti ) + 1 − p A,i (Ti ) p A,i (t), FA, j (t) = 1 − i=1
(5.22)
i=1
where p A,i (t) denotes the probability of failure of component A at stage i, Ti denotes the duration of stage i, 0 < t < T j . Stage algebra. For the cross-stage components, assume that i < j, the stage algebra rule is as follows ⎧ ⎧ ⎪ ⎨ Ci · C j → C j ⎪ ⎨ Ci + C j → Ci Ci · C j → Ci , Ci + C j → C j , ⎪ ⎪ ⎩ ⎩ Ci + C j → 1 Ci · C j → 0
(5.23)
where Ci indicates that the component fails at stage i and C j indicates that the component fails at stage j. (6) Calculation of task reliability at each stage. From the BDD diagram, we find out all the Sets of Disjoint Path (SDP) that make the system operate normally, and then find out the occurrence probability of each SDP path, and the sum of the probability of each SDP path is the mission reliability of each stage [12].
5.3.4 Dynamic Reliability Analysis Method Based on Semi-Markov The Markov method based on the state space model can perform reliability analysis of PMS containing complex dynamic failure behaviors, but the Markov method can
5.3 Multi-Phase Mission Reliability Analysis
183
only analyze systems in which the component lifetimes obey exponential distribution, and the Semi-Markov model has the advantage of not being limited by the type of component lifetime distribution of the system, so this section investigates the application method of the Semi-Markov model in the reliability analysis of dynamic PMS [13]. 1. Semi-Markov basic theory (1) Markov update theory. Assume that the state of a multi-state system at any moment can be represented by a stochastic process {X, S} = {X n , Sn ; n ∈ [1, M]} and {X, S} satisfies [14]:
P{X n+1 = j, Sn+1 − Sn ≤ t|X n = i, . . . , X 0 ; Sn , . . . , S0 }, = P{X n+1 = j, Sn+1 − Sn ≤ t|X n = i}
(5.24)
then denotes {X, S} as a Markov update process. Where Sn denotes the moment when the system’s working state is shifted, X n denotes the state of the system at the time Sn , and M denotes the number of working states of the system. If a Markov update process for ∀n(n ≥ 1), its conditional transfer probability Q i, j (t) satisfies Q i, j = P(X n+1 = j, Sn+1 − Sn ≤ t|X n = i ) = P(X 1 = j, S1 − S0 ≤ t|X 0 = i). = P(X 1 = j, S1 ≤ t|X 0 = i )
(5.25)
Then the Markov process is flush, and Q(t) = [Q i, j (t)] denotes the kernel matrix of this Markov update process. (2) Semi-Markov process. If a stochastic process Y = {Yt ; t ≥ 0}, satisfies
Yt = X N (t) = X n , t ≥ 0, Sn < t < Sn+1 ,
(5.26)
where N(t) is the counting process and denotes the N(t)-th change of the stochastic process Yt , then the stochastic process is a Semi-Markov process. The Semi-Markov process produces a state change at the time Sn and is memoryless only at the time Sn . For a Semi-Markov process, if the initial state vector P(0) and the kernel matrix Q(t) are known, the state probability of the system at each time point can be found. θi, j (t) denotes the conditional probability of the system moving from state i to state j in time [0, t] and satisfying
184
5 Model-Based Reliability Analysis Methods
θi, j (t) = σi, j (1 −
M ∑
Q i, j (t)) +
j=1
Among them, qi,k (t) =
qi,k (t)θk, j (t − τ )dτ.
(5.27)
k=1
{ dQ i,k (t) , σi, j dt
M ∑
=
1, i = j
. 0, i /= j The state vector P(t) of the system at any moment is given by P(t) = P(0)θ (t)
(5.28)
2. Semi-Markov method When the Semi-Markov method is used to solve the dynamic PMS reliability alone, considering the stage correlation, this paper makes the probability of the system working normally at the end moment of the previous stage as the initial normal state probability of the latter stage, while the probability of the system failing in the latter stage is set to 0. The steps are as follows. Step 1. Analyze all the possible states of the system according to the working principle and composition structure of the system and establish the state space of the system. The state space of the system can be expressed as S = {S1 , S2 , · · · , Sn }, the set of normal working states and the set of failure states of the system can be expressed as W = {S1 , S2 , · · · , Sl } and F = {Sl+1 , Sl+2 , · · · , Sn } respectively. Step 2. Define the stochastic process {X (t), t > 0}, X (t) = j to denote that the system is in state S j at time t and its probability can be expressed as PS j (t) = P{X (t) = j}, S j ∈ S, t ≥ 0
(5.29)
After that, the state transfer diagram of the system is drawn according to the above definition of the state transfer process of the system. Step 3. Determine the expressions of the kernel matrix Q(t) and the state transfer probability matrix θ (t) of the system based on the state transfer diagram. Step 4. Based on the state probability vector Pi (0) at the initial moment of stage i, the probability of each state at any moment of stage i is obtained. Step 5. Let the probability of the normal state at the end of stage i be the probability of the normal state at stage i + 1, and the probability of the failure state at the initial moment of stage i + 1 be set to 0, so as to obtain the probability of each state at the end of stage j. Finally, all the normal state probabilities are summed up to obtain reliability considering the correlation between stage i and stage i + 1.
5.3 Multi-Phase Mission Reliability Analysis
185
5.3.5 Modular Analysis Method For large-scale complex PMS, there are many disadvantages of using the BDD method and Semi-Markov method alone, such as the PMS-BDD method can only analyze the reliability of static PMS, while the Semi-Markov method will inevitably encounter the state space explosion problem when modeling the PMS as a whole. To fully utilize the advantages of these two methods while avoiding the disadvantages of these two methods, this section proposes a modular analysis method that combines the PMS-BDD method and the Semi-Markov method, which consists of five steps. Step 1. Establish a multi-stage Dynamic Fault Tree (DFT) model of PMS, and partition all bottom events according to the modular partitioning principle to obtain multiple mutually independent Module Basic Events (MBE). The mutually independent MBE should satisfy the following two conditions: (1) Each module is a combination of several bottom events. (2) The static module contains only several bottom events with static fault relationships, and the dynamic module contains only several bottom events with dynamic fault relationships. Multiple dynamic modules may be constructed for the convenience of subsequent analysis, and each dynamic module should be as simple as possible. Step 2. Use the above-segmented MBE as the bottom event of the Modularized Fault Tree (MFT) model and build the MFT model of the PMS according to the PMS fault logic. Step 3. Adopt a top-down, left-to-right backward sorting method to sort the MBE in the MFT model, transform the MFT model of each stage into a BDD model of each stage according to the variable sorting method, after which the BDD models of each stage are gradually fused to obtain a BDD model considering stage correlation, and find all the SDP that make the system task successful from top-down in the BDD model, and then obtain the equation for calculating the reliability of each stage of the system. Step 4. Based on the dynamic and static characteristics of each module and its failure parameters, select a suitable reliability calculation model and solve the reliability of each module respectively, where the dynamic failure module can be solved by Semi-Markov and the static failure module can be solved by conventional methods. Step 5. Bring the reliability calculation results of each module obtained in step 4 into the system-level task reliability calculation equation for each stage given in step 3, and solve the task reliability for each stage.
5.3.6 Case Study Figure 5.13 shows the DFT model of a hypothetical PMS, which does not capture the full dynamic failure characteristics of a complex PMS. The focus of this section
186
5 Model-Based Reliability Analysis Methods
Fig. 5.13 Multi-stage DFT model for hypothetical systems
is to detail the process of applying the modular analysis method and to verify its effectiveness. The model consists of three task phases: Phase 1 works with components A, B, C, D, E, F, G, A and B form a static unit with a logical “and” fault relationship, and C and D form a dynamic unit with a logical “priority-and-gate (PAND)” fault relationship. E, F and G constitute a dynamic working unit whose fault relationship contains a “Functional Dependence Gate (FDEP)”, and the failure of E directly causes the failure of F and G. The Stage 1 system fails when any of the conditions A and B fail simultaneously, C fails before D, E fails before F and G fail, E fails after F fails, or E fails after G fails are satisfied. The components operating in stage 2 are A, B, E, F, G and J. The fault logic relationships of A and B and E, F and G are the same as in stage 1, J is an independent working unit and the stage 2 system fails when any one of the conditions A and B fail simultaneously, E fails before F and G fail, E fails after F fails, E fails after G fails and J fails are met. The components working in stage 3 are C, D, H and I. The fault logic relationship between C and D is the same as in stage 1, and H and I form a dynamic working unit with a fault relationship containing a “Cold Spare Gate (CSP)”, which will fail only when H fails first and I fail. Assume that the probability of failure of each component in Fig. 5.13 obeys the Weibull distribution (Eq. (2.47)), and the failure parameters of each component are shown in Table 5.6, and the task time of each phase is 10 h, 5 h and 20 h, respectively.
5.3 Multi-Phase Mission Reliability Analysis
187
Fig. 5.14 MFT model for multi-stage missions
The steps of the multi-stage task reliability analysis of the system, based on the modular analysis methodology, are as follows. Step 1. Partition the bottom event of the DFT model in Fig. 5.13 into five independent modules according to the modular partitioning principle: M1 = {A, B},M2 = {C, D},M3 = {E, F, G},M4 = {H, I },M5 = {J }. Where modules are static modules and the rest are dynamic modules; the set of modules in each phase are T1 = {M1, M2, M3},T2 = {M1, M3, M5},T3 = {M2, M4}. Step 2. Based on the set of modules in each phase, the MFT model of the PMS is constructed as shown in Fig. 5.14. Step 3. First, the MFT of each stage is transformed into the BDD model of each stage, as shown in Fig. 5.15. After that, the PMS-BDD method is applied to gradually merge the BDD models of the first i (i = 2, 3) stages to obtain the system BDD models of the first i (i = 2, 3) stages of the system, as shown in Fig. 5.16.
Fig. 5.15 Example of BDD model calculations for each phase: a T 1 stage failure; b T 2 stage failure; c T 3 stage failure
188
5 Model-Based Reliability Analysis Methods
Fig. 5.16 System BDD Model: a T 1 and T 2 merge; b T 1 , T 2 and T 3 merge
According to the BDD model of the system shown in Figs. 5.15a and 5.16, the SDPs of the system in the first phases are obtained as P T1 = M11 M21 M31 P T2 = M12 M21 M32 M52 ,
(5.30)
P T3 = M12 M23 M32 M43 M52 where, P Ti (i = 1, 2, 3) denotes the system SDP of the first i stages, and the system reliability is obtained as follows ⎧ ⎪ ⎨ R M1 (t)R M2 (t)R M3 (t), R(t) = R M1 (t)R M2 (T1 )R M3 (t)R M5 (t − T1 ), ⎪ ⎩ R M1 (T1 + T2 )R M2 (t)R M3 (T2 )R M4 (t − T1 − T2 )R M5 (t2 ),
0 < t ≤ T1 T1 < t ≤ T1 + T2 , T1 + T2 < t ≤ T
(5.31) where Ti (i = 1, 2, 3) denotes the mission time of stage i, T = T1 + T2 + T3 . Step 4. According to the dynamic and static characteristics of each module and its failure parameters, a suitable reliability calculation model is selected for calculation. M1 and M5 modules are both static modules solved by conventional methods. The remaining modules are dynamic modules and their life parameters obey the Weibull distribution, so the Semi-Markov method can be used to calculate the reliability of each phase. The following is a detailed description of the process of the Semi-Markov method for solving the reliability of the module M4. The module M4 consists of two components, H and I, where H is the main working unit and I is the cold backup unit, and I will be put into operation only when
5.3 Multi-Phase Mission Reliability Analysis
189
Fig. 5.17 M4 Module state transfer diagram
H fails, and the probability of failure during the period when I is not working is 0. The state space of the module M4 is {S1 , S2 , S3 }, where {S1 , S2 } is the set of normal states and {S3 } is the set of failure states. S1 indicates that H is working and I is in the cold backup state, S2 indicates that H changes from normal to the failed state and I changes from backup to working state, and S3 indicates that I changes from working to the failed state. Based on the above analysis, the state transfer diagram of M4 modules is drawn as shown in Fig. 5.17. X and X (X = H, I ) indicate that the part is in a normal state and a failed state respectively. After that, the expressions for the kernel and state transition matrices 1 and θ (t) of the M4 module transfer process are determined as ⎡
⎤ 0 0 Q 1,2 (t) Q(t) = ⎣ 0 0 Q 2,3 (t) ⎦, 0 0 0
(5.32)
⎡
⎤ θ1,1 (t) θ1,2 (t) θ1,3 (t) θ (t) = ⎣ 0 θ2,2 (t) θ2,3 (t) ⎦. 0 0 θ3,3 (t)
(5.33)
Based on the dynamic fault logic relationships within the M4 module, each element of the kernel matrix Q(t) and its first-order differential are determined to be expressed in the form ∫t Q 1,2 (t) = P{T1,2 ≤ t} =
dFH (t),
(5.34)
dFI (t),
(5.35)
0
∫t Q 2,3 (t) = P{T2,3 ≤ t} = 0
190
5 Model-Based Reliability Analysis Methods
q1,2 (t) =
dQ 1,2 (t) = f H (t), dt
(5.36)
q2,3 (t) =
dQ 2,3 (t) = f I (t), dt
(5.37)
where FH (t) and FI (t) denote the failure time distributions of H and I respectively. The calculated kernel matrices Q(t) and qi, j (t) are brought in to obtain θ (t) the corresponding set of conditional probability equations. ⎧ θ (t) = 1 − Q 1,2 (t), ⎪ ⎪ ⎪ 1,1 ⎪ ⎪ ⎪ ⎪ θ2,2 (t) = 1 − Q 2,3 (t), ⎪ ⎪ ⎪ θ3,3 (t) = 1, ⎪ ⎪ ⎪ ∫ t ⎪ ⎪ ⎨ q1,2 (τ )θ2,2 (t − τ )dτ, θ1,2 (t) = 0 ⎪ ⎪ ∫ t ⎪ ⎪ ⎪ ⎪ (t) = q1,2 (τ )θ2,2 (t − τ )dτ, θ ⎪ 1,3 ⎪ ⎪ 0 ⎪ ⎪ ∫ ⎪ t ⎪ ⎪ ⎪ q2,3 (τ )θ3.3 (t − τ )dτ ⎩ θ2,3 (t) =
(5.38)
0
Then, the integral in the above system of equations is solved, and in this paper, the trapezoidal integral is used for the approximate solution. Taking θ1,2 (t) as an example, its trapezoidal integral approximation can be expressed as ∫t θ1,2 (t) =
q1,2 (τ )θ2,2 (t − τ )dτ 0
=
L ∑ k=1
(5.39)
[ 1] q1,2 (τk )θ2,2 (t − τk ) + q1,2 (τk+1 )θ2,2 (t − τk+1 ) · (τk+1 − τk ). 2
The M4 module works only in stage 3, so its initial state probability vector P(0) is [1 0 0], and the state probability vector P of the M4 module at the end of stage 3 is calculated as [0.9632 0.0363 0.0005] based on the above calculation, which means that the reliability of M4 is 0.9995. Step 5. The results of the reliability calculation of each module are brought into the equation, and the reliability of each phase of the task is obtained as 0.9655, 0.9318, and 0.9515 respectively. When solving the PMS shown in Fig. 5.13, the modular approach requires solving the Semi-Markov equation for at most 4 states. In contrast, when the traditional Semi-Markov model approach is used for the overall modeling, assuming that each
5.4 Conclusions
191
component has two states, the whole system has at most states, and the solution process is very complicated, and due to the lack of relevant software support, this paper does not continue to solve the Semi-Markov equation for this model. It can be seen that the modular analysis approach is more efficient for solving the reliability of PMS with dynamic fault behavior, which is the most significant advantage of the modular approach compared to the single approach.
5.4 Conclusions On the basis of the multi-source data processing demand of reliability assessment of civil aircraft, this paper proposes a reliability processing system of machinery parts to tackle the data multi-source problem in calculating the reliability level of entire airline industry. A comprehensive reliability assessment model was established to support reliability improvement and continuous airworthiness. The major results are summarized as follows. a. The reliability processing system includes providing basic multi-source data from airlines or maintenance base, forming appropriate data set by Pauta criterion and linear interpolation, sorting multi-source data by fusion requirements and assessing the comprehensive reliability level for certain index. b. The fusion requirements are related to weight evaluation criteria of multi-source data. The importance of different data sources is decomposed into subjective index weight (determined by analytic hierarchy process and Euclidean distance) and objective index weight (determined by discrete degree of the index weight). c. An example is presented to illustrate the proposed modeling method. According to the part’s statistical failure time from airline, maintenance base and flight test lab., the trend between comprehensive reliability and service time is simulated and conforms to the actual reliability change, which is simplified and validated to be practical through case analysis. d. The comprehensive method considering data sources obtains more accurate reliability of mechanical parts and provides new insights to assess reliability, maintenance and safety operation of civil aircraft. In the multi-stage mission reliability analysis of complex airborne systems, the traditional fault tree, reliability block diagram model and Markov model are difficult to model and solve for reliability due to a large number of components, the dynamic correlation of component failures, and the existence of multiple distribution types of component lifetimes. In this chapter, the BDD method and Semi-Markov method with stronger reliability modeling and analysis capabilities are investigated, and the advantages and disadvantages of these two methods, their applicability and the application process in PMS reliability analysis are analyzed. Then a modular analysis method that combines the advantages of both is proposed. The effectiveness of the method is verified by a dynamic PMS reliability analysis arithmetic example. Compared with
192
5 Model-Based Reliability Analysis Methods
the overall method of building a state transfer model, the modular analysis method can alleviate the state space explosion problem and reduce the difficulty of solving complex PMS reliability.
References 1. Wang L. Study on the reliability data screening and imputation of civil aircraft [D]. Nanjing: Nanjing University of Aeronautics and Astronautics, 2010. 2. McGough J, Reibman A, Trivedi K. Markov reliability models for digital flight control systems [J]. Journal of Guidance, Control, and Dynamics, 1989, 12(2): 209–219. 3. Li X Y, Huang H Z, Li Y F. Reliability analysis of phased mission system with non-exponential and partially repairable components [J]. Reliability Engineering & System Safety, 2018, 175: 119–127. 4. Zhai Q, Xing L, Peng R, Yang J. Aggregated combinatorial reliability model for non-repairable parallel phased-mission systems [J]. Reliability Engineering & System Safety, 2018, 176: 242–250 5. Chew S, Dunnett S, Andrews J. Phased mission modelling of systems with maintenance-free operating periods using simulated Petri nets [J]. Reliability Engineering & System Safety, 2008, 93(7): 980–994. 6. Yan H, Gao L, Qi L, Wan P. Simplified Markov model for reliability analysis of phased-mission system using states merging method[J]. Journal of Shanghai Jiaotong University, 2018, 23: 418–422. 7. Yang X S, Wu X Y, Wu X Y. Automated generation of mission reliability simulation model for space tracking, telemetry and control system by extensible markup language and extended object-oriented Petri net [J]. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 2014, 228(4): 397–408. 8. Esary J, Ziehms H. Reliability Analysis of Phased Missions [M]. Naval Postgraduate School Monterey CA, 1975. 9. Mo Y. New insights into the BDD-based reliability analysis of phased-mission systems [J]. IEEE Transactions on Reliability, 2009, 58(4): 667–678. 10. Zang X, Sun N, Trivedi K S. A BDD-based algorithm for reliability analysis of phased-mission systems [J]. IEEE Transactions on Reliability, 1999, 48(1): 50–60. 11. Xing L, Levitin G. Combinatorial analysis of systems with competing failures subject to failure isolation and propagation effects[J]. Reliability Engineering, 2010, 95(11): 1210–1215. 12. Li X T, Tao L M, Jia M J E i N. A Bayesian networks approach for event tree time-dependency analysis on phased-mission system [J]. Maintenance and Reliability, 2015, 17(2): 273–281. 13. Ou Y, Dugan J B. Modular solution of dynamic multi-phase systems [J]. IEEE Transactions on Reliability, 2004, 53(4): 499–508. 14. Li X Y, Huang H Z, Li Y F, Xiong X. A Markov regenerative process model for phased mission systems under internal degradation and external shocks [J]. Reliability Engineering & System Safety, 2021, 215: 107796.
Chapter 6
System Reliability Prediction and Allocation
6.1 Introduction As we said, a system is a set or location of elements and is related or linked to form an integrity. From the point of engineering, a system is a collection of independent and interrelated elements which are coordinated regarding a specified structure in order to obtain a specific performance and reliability and at the same time meet environmental, safety, and legal regulations. From the classification point of view, a system consists of some subsystems that may be also divided into lower-level subsystems, looking at the aim of the system analysis. Elements are the lowest-level components of a system. For example, an aircraft is a typical complex system. It contains an airframe, a powerplant, a hydraulic, electrical and other subsystem. A powerplant subsystem consists of engine, gearbox, fuel subsystem, etc. that are very complex and can be divided further into lower-level subsystems. A reliability level is normally set up for a whole product that can be seen as a system. To provide the total reliability, it is important to divide it into partial subsystems that comprise the product, particularly when customers are taking part in a product manufacturing process. The distributed reliability of a subsystem turns on to its goal, and the producer should ensure achievement of this goal. In the aircraft example, the whole reliability goal for an aircraft should be divided into the airframe, powerplant and other subsystems. The reliability distributed to the powerplant is further apportioned to the engine, gearbox and subsystems. The distribution process is proceeded till the unit level is reached. Then the aircraft suppliers should provide the reliability of the units they are contracted to produce. In this chapter, we present various reliability distribution approaches. An integrated reliability program usually requires estimation of product reliability at the designing and development stages for different aims, considering, for example, choice of structural materials and components, comparison of structural alternatives and reliability prediction and adjusting. Once a system or subsystem designing is finished, the reliability must be estimated and analyzed with the reliability goal that has been defined or distributed. If the goal is not met, the structure must be © Science Press 2024 Y. Sun et al., Reliability Engineering, https://doi.org/10.1007/978-981-99-5978-5_6
193
194
6 System Reliability Prediction and Allocation
redesigned and must have a restoration of reliability. This process proceeds till the wanted reliability level is reached. In the aircraft example, the reliability of the aircraft should be determined after the airframe and systems are completed and units’ reliabilities are available. The process usually is iterated sometimes and can even cause reliability redistribution if the goals of some subsystems are unachievable. In this chapter it will be described the approaches for estimating the reliability of systems with various configurations, considering series, parallel, series–parallel, and k-out-of-n voting. Approaches of determination of confidence intervals for system reliability are shown. It will also present criterions of component importance. Because knowledge of system structure is a requirement for reliability distribution, it is presented first in the chapter.
6.2 Reliability Prediction 6.2.1 Reliability Block Diagram A reliability block diagram is a graphical view of logic link of elements in a system. The main elements of logic links consider series and parallel, from which more complex system structures can be created, such as the series–parallel and k-out-of-n voting systems. In a reliability block diagram, elements are indicated by rectangular blocks that are linked by lines regarding their logic connections. Counting on the purpose of the system analysis, a block can be the lowest-level element, a module, or a subsystem. It is processed as a black box for which the real specifics are not shown and needn’t to be known. The reliability of the element that a block indicates is the only input that refers to the system’s reliability estimation. The following example shows the drawing of reliability block diagrams at various levels of a system. Example 6.1 Figure 6.1 shows the structure of an aircraft that contains an airframe, a powerplant, and subsystems [1]. Every subsystem is divided additionally into multiple lower-level subsystems. From a reliability providing, the aircraft is a series system that fails if one or more subsystems failure. Figure 6.2 shows the reliability block diagram of the aircraft, in which the blocks indicate the first-level subsystems, suppose that their reliabilities are known. Figure 6.3 is a diagram that indicates second-level subsystems. Comparing Figs. 6.2 and 6.3, it is shown that the complexity of a reliability block diagram increases with the level of subsystem that blocks correspond. The reliability block diagram of a typical aircraft consists of more than some hundred thousand blocks if every block indicates an element or part. Using drawing of a reliability block diagram, pay attention that real connections in series or parallel do not need to fully show the similar logic links in terms of reliability. For example, an aircraft piston engine can have six cylinders linked in parallel technically. From a reliability position, the six cylinders are in series because
6.2 Reliability Prediction
195
Fig. 6.1 Common structure of a typical aircraft [1]
Fig. 6.2 Reliability block diagram with blocks instead of first-level subsystems
Fig. 6.3 Reliability block diagram with blocks instead of second-level subsystems
the engine is considered to be failed if one or more cylinders fails. Drawing of a reliability block diagram for a complex system is time-consuming. Luckily, nowadays, this process can be eased by applying a special software. A reliability block diagram is a multipurpose and the main instrument for the system reliability analysis.
6.2.2 Series Systems A system is named as a series system if the failure of one or more elements in this system leads to failure of the whole system. that is to say, all elements of a system must operate for the system to operate. Figures 6.2 and 6.3 show the aircraft series systems at two hierarchical structures. The reliability of a general series system can be determined as follows. Say that a series system contains n mutually independent elements. Here, mutual independence means that the failure of one element does not affect the operation of other elements. By explanation, successful operation of a system requires all elements to be operational. From probability theory, the system
196
6 System Reliability Prediction and Allocation
reliability is R = Pr(E) = Pr(E 1 E 2 . . . E n ), where R is the system reliability, E is the case that the system is operational, and E i is the case that element i = 1, 2, …, n is operational. Because of the independence supposal, this goes R = Pr(E 1 )Pr(E 2 ) . . . Pr(E n ) =
n ∏
Ri ,
(6.1)
i=1
where Ri is the reliability of element i. If the n elements are identic with reliability R0 , the system reliability is R = R0n .
(6.2)
Equation (6.1) shows that the system reliability is the product of reliabilities of elements. This result is not good, in that the system reliability is lesser than the reliability of any element. Moreover, the system of reliability reduces quickly as the amount of elements in a system grows. The measurements support the rule of decreasing the degree of complexity of an engineering project. Let’s study a simple case, when the times to failure of n elements in a system have an exponential distribution. The exponential reliability function for element i is Ri (t) = exp(−λi t), where λi is the failure rate of element i. Then from Eq. (6.1), the system reliability can be expressed as ( R(t) = exp −t
n ∑
) λi
= exp(−λt),
(6.3)
i=1
where λ is the failure rate of the system and λ=
n ∑
λi .
(6.4)
i=1
The mean time to failure of the system is ∫∞ MTTF = 0
1 R(t)dt = ∑n i=1
λi
.
(6.5)
Equation (6.3) shows that the life of a system follows the exponential distribution if all elements in the system have exponential distribution and the failure rate of the system is the sum of all separate failure rates. Equation (6.3) is commonly applied, and
6.2 Reliability Prediction
197
often misapplied, because it is simple. For example, MIL-HDBK-217F [2] expects that all elements have constant failure rates and applies Eq. (6.3) to determine the system failure rate. Example 6.2 Concern to Fig. 6.2. Think that the lifetimes of the airframe, powerplant, and electrical and hydraulic subsystems are exponentially distributed with λ1 = 5.1 × 10−4 , λ2 = 6.3 × 10−4 , λ3 = 5.5 × 10−5 , and λ4 = 4.8 × 10−4 failures per 1000 flight hours, correspondingly. Determine the reliability of the aircraft at 36,000 flight hours and the mean time to failure. Solution Insert the values of λ1 , λ2 , λ3 , and λ4 in Eq. (6.4) gets λ = 5.1 × 10−4 + 6.3 × 10−4 + 5.5 × 10−5 + 4.8 × 10−4 = 16.75 × 10−4 failures per 1000 flight hours. The reliability at 36,000 flight hours is R(36,000) = exp(−16.75 × 10−4 × 36) = 0.9415. The mean time to failure (MTTF) is got from Eq. (6.5) as MTTF =
1 597 000 flight hours. 16.75 × 10−4
Now let’s study the other case where the times to failure of n elements in a system has the Weibull distribution. The Weibull reliability function for element i is [ ( ) ] t bi Ri (t) = exp − , θi where θ i and bi are, correspondingly, the shape parameter and the service life of component i. From Eq. (6.1) the system reliability is [
] n ( )bi ∑ t R(t) = exp − . θi i=1
(6.6)
Then the failure rate h(t) of the system is h(t) =
( ) n ∑ bi t bi −1 . θ θi i=1 i
(6.7)
Equation (6.7) shows that as the exponential case, the failure rate of the system is the sum of all separate failure rates. When bi = 1, Eq. (6.7) changes to Eq. (6.4), where λi = (θ i )–1 . If the n components have a common shape parameter b, the mean time to failure of the system is given by
198
6 System Reliability Prediction and Allocation
Fig. 6.4 Resonating circuit (a) and Reliability block diagram for the resonating circuit (b): 1-AC power; 2-Inductor; 3-Capacitor; 4-Resistor
∫∞ MTTF = 0
Γ
(1
R(t)dt = [ ∑n
b
i=1
) +1 , ( )b ] b1
(6.8)
1 θi
where G is the gamma function. Example 6.3 [3] A single-tuned network contains an alternating-current supply source, a resistor, a capacitor and an inductor, as shown in Fig. 6.4a. By the reliability view, the circuit is series. The reliability block diagram is shown in Fig. 6.4b. The times to failure of the elements have Weibull distribution with the indexes are shown in Fig. 6.4b. Determinate the reliability and failure rate of the circuit at 5 × 104 h. Solution Insert the values of the Weibull indexes to Eq. (6.6) gives [ ( )1.3 ( )1.8 ) ( 5 × 104 5 × 104 4 R 5 × 10 = exp − − 3.3 × 105 1.5 × 106 )1.6 ( )2.3 ] ( 5 × 104 5 × 104 − − = 0.913. 4.7 × 106 7.3 × 105 The failure rate is determined from Eq. (6.7) as ( )1.3−1 ( )1.8−1 5 × 104 5 × 104 1.3 1.8 h(5 × 10 ) = − 3.3 × 105 3.3 × 105 1.5 × 106 1.5 × 106 ( ) ( )2.3−1 1.6−1 5 × 104 5 × 104 1.6 2.3 − − 4.7 × 106 4.7 × 106 7.3 × 105 7.3 × 105 4
= 2.43 × 10−6 (failures per hour).
6.2 Reliability Prediction
199
6.2.3 Parallel Systems A system is named as a parallel system if and only if the failure of all elements in the system leads to the failure of the whole system. That is to say, a parallel system is in operational conditions if one or more elements are operational. For example, the lighting system that contains three lightbulbs in an office is a parallel system, because office blackout happens only in time all three lightbulbs interrupt. The reliability block diagram of the lighting system is shown in Fig. 6.5. The reliability of a common parallel system is determined as follows. Think that a parallel system contains n mutually independent elements. By explanation, all n elements must fail for a parallel system to fail. By probability theory, the system unreliability is F = Pr(E) = Pr(E 1 E 2 . . . E n ), where F is the system unreliability (probability of failure), E is the case that the system is operational, E i is the case that element i is operational, X is the complement of X, where X is E i or E. Because E i (i = 1, 2, …, n) are mutually independent, this equation can be written as F = Pr(E 1 )Pr(E 2 ) . . . Pr(E n ) =
n ∏
(1 − Ri ).
(6.9)
i=1
The system reliability is the complement of the system unreliability R =1−
n ∏
(1 − Ri ).
(6.10)
i=1
If the n elements are similar, then R = 1 − (1 − R0 )n .
(6.11)
If R is defined in advance as a goal, the minimum number of elements required to reach the goal is Fig. 6.5 Reliability block diagram of the lighting system: 1-Bulb 1; 2-Bulb 2; 3-Bulb 3.
200
6 System Reliability Prediction and Allocation
n=
ln(1 − R) . ln(1 − R0 )
(6.12)
If the life of the n similar elements has the exponential distribution with failure rate λ, then R(t) = 1 − [1 − exp(−λt)]n .
(6.13)
The mean time to failure of the system is calculated by ∫∞ MTTF = 0
1∑1 . λ i=1 i n
R(t)dt =
(6.14)
Against to a series system, the reliability of a parallel system grows with the number of elements in the system, as shown in Eq. (6.10). Therefore, a parallel structure is a way of improving system reliability and is much applied in safety– critical systems like aircraft or spaceships, etc. Nevertheless, application of this way is often limited by other conditions, such as the additional cost and weight due to the increased number of elements. For example, parallel structure is seldom applied for increasing car reliability because of its high cost. Example 6.4 [3] See Fig. 6.5. Think that the lighting system contains three similar lightbulbs and that other elements in this system are 100% reliable. The times to failure of the lightbulbs have Weibull distribution with indexes θ = 1.35 and b = 35 800 h. Determine the reliability of the system after 8760 h of operation. If the system reliability goal is 99.99%, how many lightbulbs should be in parallel connection? Solution Because the life of the lightbulbs has the Weibull distribution, the reliability of a separate lightbulb after 8760 h of operation is [ ( ) ] 8 760 1.35 R0 exp − = 0.8611. 35 800 Insert the value of R0 to Eq. (6.11) obtains the system reliability at 8760 h as R = 1 − (1 − 0.8611) × 3 = 0.9973. By Eq. (6.12), the minimum amount of lightbulbs necessitated to reach 99.99% reliability is n=
ln(1 − 0.9999) = 5. ln(1 − 0.8611)
6.2 Reliability Prediction
201
6.2.4 Combined Structure There are cases in that series and parallel structures are combined in a system to reach operational or reliability requirements. The combinations create series–parallel and parallel-series structure. 1. Series–parallel systems Usually, a series–parallel system consists of n subsystems in series with mi (i = 1, 2, …, n) elements in parallel in subsystem i, as shown in Fig. 6.6. The structure is infrequently named the low-level redundancy structure. To determine the system reliability, it is needed first to simplify every parallel subsystem to an equal reliability block. By Eq. (6.10), the reliability Ri of block i is Ri = 1 −
mi ∏
(1 − Ri j ),
(6.15)
j=1
where Rij is the reliability of element j in subsystem i, i = 1, 2, …, n and j = 1, 2, …, mi . The n blocks create a series system equal to the initial system, as illustrated in Fig. 6.8. So, the system reliability R is got from Eqs. (6.1) and (6.15) as R=
n ∏
⎡ ⎣1 −
i=1
mi ∏
⎤ (1 − Ri j )⎦.
(6.16)
j=1
In the case, if all elements in the series–parallel system are similar and the amount of elements in every subsystem is same, Eq. (6.16) changes to
Fig. 6.6 Common series–parallel system
Fig. 6.7 Reliability block diagram equal to Fig. 6.6
202
6 System Reliability Prediction and Allocation
Fig. 6.8 Common parallel-series system
R = [1 − (1 − R0 )m ]n ,
(6.17)
where R0 is the reliability of a separate element and m is the amount of elements in every subsystem. 2. Parallel-series systems A common parallel-series system contains m subsystems in parallel with ni (i = 1, 2, …, m) elements in subsystem i, as illustrated in Fig. 6.8. The structure is also recognized as the high-level redundancy structure. To determine the system reliability, it is needed firstly to reduce every series subsystem to an equal reliability block. By Eq. (6.1) the reliability Ri of block i is Ri =
ni ∏
Ri j , i = 1, 2, ..., m,
(6.18)
j=1
where Rij is the reliability of element j in subsystem i. The m blocks create a parallel system equal to the initial one, as illustrated in Fig. 6.9. Insert Eq. (6.18) into Eq. (6.10) yields the reliability of the parallel-series system as R =1−
m ∏ i=1
⎛ ⎝1 −
ni ∏
⎞ Ri j ⎠.
(6.19)
j=1
If all elements in the parallel-series system are similar and the amount of elements in every subsystem is the same kind, the system reliability can be expressed as R = 1 − (1 − R0n )m ,
(6.20)
where R0 is the reliability of a separate element and n is the amount of elements in every series subsystem. Example 6.5 Think that a researcher is given four similar elements, every having 90% reliability at the service life. The researcher wants to select the system structure that has a higher reliability from between the series–parallel and parallel-series
6.2 Reliability Prediction
203
Fig. 6.9 Reliability block diagram equal to Fig. 6.8
structures. The two structures are illustrated in Figs. 6.10 and 6.11. Which structure should the researcher take from the reliability aspect? Solution By Eq. (6.17), the reliability of the series–parallel structure is R = [1 − (1 − 0.9)2]2 = 0.9801. By Eq. (6.20), the reliability of the parallel-series structure is R = 1 − (1 − 0.92)2 = 0.9639. Clearly, the series–parallel design should be chosen. Commonly, the reliability of a series–parallel system is higher than that of a parallel-series system if both content the same amount of elements. To show this evidence numerically, Fig. 6.12 shows the reliabilities of the two systems versus the element reliability for different combinations of the values of m and n. In this figure, S-P means for series–parallel and P-S for parallel-series. It is understood that the difference between the reliabilities is high if the element reliability is low. But, contrast drop-off as the element reliability grows and gets minimal for high values, Fig. 6.10 Series–parallel structure
Fig. 6.11 Parallel-series structure
204
6 System Reliability Prediction and Allocation
Fig. 6.12 Reliability of series–parallel (S-P) and parallel-series (P-S) structures [3]
about 0.99. Figure 6.12 also shows that, yielding the same amount of elements, a system with n > m has a lower reliability than one with m > n.
6.2.5 k-Out-of-n Systems A parallel system is operating, if at least one element operates. Actually, there are systems that need more than one element not failing in order for the whole system to operate. Such systems are often met. A power-generating system that contains four generators operating in a derating mode can require at least two generators to operate in full mode synchronously to support the required power. Web hosts can be set up with five servers. At least three of them must be operating, so that the web service is not disrupted. In a positioning navigation system that has five detectors, a minimum of three detectors operable is needed to determine the object’s location. Systems of this type are commonly classified into the k-out-of-n:G systems, where n is the total number of elements in the system, k is the minimum number of n elements that must operate for the system to function correctly, and G means for “good” and associates success. By the explanation, a parallel system is a 1-out-of-n:G system and a series system is an n-out-of-n:G system. In some cases, it can be actual for determining a system in means of failure. A system is recognized as a k-out-of-n:F system, where F means for “failure”, if and only if the failure of at least k elements makes the n elements system to fail. According to this explanation, a parallel system is an n-out-of-n:F system, and a series system is a 1-out-of-n:F system. Obviously, a k-out-of-n:G system is equal to an (n–k + 1)-out-of-n:F system. Assume that the times to failure of n elements in a k-out-of-n:G system are independently and similarly distributed. Let x be the number of operational elements in the system. Then x is a random variable and has the binomial distribution. The
6.2 Reliability Prediction
205
probability of presenting precisely k elements operational is Pr(x = k) = Cnk R0k (1 − R0 )n−k , k = 0, 1, ..., n,
(6.21)
where R0 is the reliability of an element. Because an operable k-out-of-n:G system requires at least k elements to be operating, the system reliability R is R = Pr(x ≥ k) =
n ∑
Cni R0i (1 − R0 )n−i .
(6.22)
i=k
For k = 1, that is, the n elements are in parallel, Eq. (6.22) equals R = 1 − (1 − R0 )n . This is the same as Eq. (6.11). For k = n, that is, the n elements are in series, Eq. (6.22) can be expressed as R = R0 n . This is similar to Eq. (6.2). If the time to failure has exponential distribution, the system reliability is R(t) =
n ∑
n−i
Cki e−λit (1 − e−λt )
,
(6.23)
i=k
where λ is the element failure rate. The mean time to failure of the system is ∫∞ MTTF = 0
1∑1 . λ i=k i n
R(t)dt =
(6.24)
Equations (6.14) and (6.24) are the similar for k = 1. Example 6.6 [3] A web host has five independent and similar servers linked in parallel. At least three of them must operate for the web service not to be failed. The server life has an exponential distribution with λ = 2.7 × 10−5 failures per hour. Determine the mean time between failures (MTBF) and the reliability of the web host after one year of continuous operation. Solution This web host is a 3-out-of-5:G system. If a failed server is repaired in short time to a good as new state, the MTBF is the same as the MTTF and can be determined by Eq. (6.24) as 5 ∑ 1 1 = 2.9 × 104 hours. MTBF = 2.7 × 10−5 i=3 i
Inserting the yielded data into Eq. (6.23) gives the reliability of the web host at 8760 h (one year) as.
206
6 System Reliability Prediction and Allocation
Fig. 6.13 Reliability block diagram equal to a 2-out-of-3:G system
R(8760) =
5 ∑
C5i e−2.7×10
−5
×8 760i
(
1 − e−2.7×10
−5
×8 760
)5−i
= 0.9336.
i=3
As explained above, a 1-out-of-n:G system is a clear parallel system. Commonly, a k-out-of-n:G system can be turned to a parallel system that contains links, every with k different elements. To show this turning, it is possible to study a 2-out-of-3:G system. The equal parallel system has C32 = 3 parallel links, and every link has two elements. The reliability block diagram of the parallel system is given in Fig. 6.13. With the note specified above, the probability of failure of the parallel system can be expressed as ) [( ) ( ) ( )] ( F = Pr E 1 E 2 · E 1 E 3 · E 2 E 3 = Pr E 1 + E 2 · E 1 + E 3 · E 2 + E 3 . If we use the Boolean rules, this equation changes to ( ) F = Pr E 1 · E 2 + E 1 · E 3 + E 2 · E 3 .
(6.25)
The equation shows that the system fails if any of the three events E 1 · E 2 , E 1 · E 3 or E 2 · E 3 happens. The event is named a minimal cut set. As shown in Eq. (6.25), a 2out-of-3:G system has three minimal cut sets, and every one consist of two elements. Commonly, a k-out-of-n:G system consists of Cnn−k+1 minimal cut sets, and every contains just k elements. Let’s continue the calculation of the failure probability. Equation (6.25) can be distributed to ) ( ) ( ) ( ) ( F = Pr E 1 · E 2 + Pr E 1 · E 3 + Pr E 2 · E 3 − Pr E 1 · E 2 · E 3 . Because E 1 , E 2 and E 3 are together independent, the system reliability can be expressed as R =1− F = 1 − (1 − R1 )(1 − R2 ) − (1 − R1 )(1 − R3 ) − (1 − R2 )(1 − R3 ) + 2(1 − R1 )(1 − R2 )(1 − R3 ).
(6.26)
6.2 Reliability Prediction
207
If the elements are similar and have a reliability R0 , Eq. (6.26) transfers, R = 1 − (1 + 2R0 )(1 − R0 )2 . The reliability is similar as that obtained by Eq. (6.22). Observe, that unlike Eqs. (6.22), (6.26) does not necessitate the elements to be similar in regard to determine the system reliability. Therefore, change of a k-out-of-n:G system to an equal parallel system supplies an approach for determining the system reliability for cases when element reliabilities are not similar.
6.2.6 Redundant System A redundant system consists of one or more reserve elements or subsystems in a system structure. These reserve elements will provide the system to proceed with the operation in time the major element fails. Failure of the system happens only in time when some or all of the reserve elements fail. Therefore, redundancy is a system design approach that can improve the system reliability. This approach is applied widely for critical systems. For example, a car is equipped with a spare wheel. Whenever a wheel fails, it is replaced with the spare wheel and that the car is still applicable. Another example, a power plant contains n active and one or some reserve generators. Usually, every of the n generators operates at 100(n − 1)/ n percent of its full power and together provides 100% power to final customers, where n − 1 generators can completely satisfy the power. In time when any one of the major generators fails, all other n − 1 generators will produce the power loss and the output power is still 100%. In the meantime, the reserve generator is activated and accelerates to 100(n − 1)/n percent, while the other n − 1 generators decelerate back down to 100(n − 1)/n percent. If a reserve element is fully activated in time the system is in application, the redundancy is named active or hot reserve. Parallel and k-out-n:G systems explained in the above subchapters are regular examples of active reserve systems. If a reserve element is fully activated only in time the major element fails, the redundancy is named as passive reserve. In time the major element is correctly operational, and the reserve element can be continued in reserve. This element will be in cold reserve. A cold reserve system requires a detection device to observe the failure of the major element and a servo actuator to activate the reserve element in time a failure happens. In the following parts, we will use the term servo actuator to consider both the detection device and the servo actuator. On the other side, if the reserve element is partly powered in the inactivity time, the redundancy is a warm reserve. A warm reserve element normally is powered to a decreased level of load and can fail before it is completely activated. Considering to the classification above, the spare wheel and reserve generators explained before are in cold reserve. We will study the cold reserve systems with a reliable or unreliable servo actuator. Figure 6.14 shows a cold
208
6 System Reliability Prediction and Allocation
Fig. 6.14 Cold reserve system
reserve system containing elements and a servo actuator. In this figure, element 1 is the major element and S is the servo actuator. 1. Cold reserve systems with a reliable servo actuator If the servo actuator is 100% reliable, system reliability is dependent on the n elements. Let T i means the time to failure of element i (i = 1, 2, …, n) and T means that of the whole system. So T =
n ∑
Ti .
(6.27)
i=1
If T 1 , T 2 , …, T n are independent and have exponential distribution with failure rate λ, T has a gamma distribution with indexes n and λ. The probability density function is f (t) =
λn n−1 −λt t e , Γ(n)
(6.28)
where Γ is the gamma function. The system reliability is ∫∞ R(t) = t
∑ (λt)i λn n−1 −λt t e dt = e−λt . Γ(n) i! i=0 n−1
(6.29)
The mean time to failure of the system has the gamma distribution MTTF =
n . λ
(6.30)
As an alternative, Eq. (6.30) can also be obtained by Eq. (6.27) MTTF = E(T ) =
n ∑ i=1
E(Ti ) =
n ∑ n 1 = . λ λ i=1
If there is only one reserve element, the system reliability is determined by Eq. (6.29) by inserting n = 2. So
6.2 Reliability Prediction
209
R(t) = (1 + λt)e−λt .
(6.31)
Example 6.7 [3] A power plant has two similar generators, one is active and the other are cold reserve. In every case where, the operational generator fails, the reserve generator is activated to operational state without delay. The service life of the two generators has an exponential distribution with λ = 3.6 × 10−5 failures per hour. Determine the power plant reliability at 5000 h and the mean time to failure. Solution Insert the data to Eq. (6.31) takes R(5000) = (1 + 3.6 × 105 × 5000)e−3.6×10
5
×5000
= 0.9856.
By inserting n = 2 to Eq. (6.30), we get the mean time to failure. MTTF =
2 = 5.56 × 104 hours. 3.6 × 10−5
If the n elements are not similar and have not exponential distribution, the determination of system reliability is more complex. Now let’s study an easy case when the cold reserve system consists of two elements. The system will operate time t if any of the following two cases happen: (1) The major element (which life is T 1 ) does not fail in time t, that is, T 1 ≥ t. (2) If the major element fails in time τ (τ < t), the cold reserve element (which life is T 2 ) continues to operate and does not fail in the left time (t − τ ). In the probability sense, the case is regarded by (T 1 < t) · (T 2 ≥ t − τ ). Because the above two cases cancel each other, the system reliability is R(t) = Pr[(T1 ≥ t) + (T1 < t)(T2 ≥ t − τ )] = Pr(T1 ≥ t) + Pr[(T1 < t)(T2 ≥ t − τ )] ∫t f 1 (τ )R2 (t − τ )dτ = R1 (t) +
(6.32)
0
where Ri and f i are, correspondingly, the reliability and probability density function of element i. In common, calculation of Eq. (6.32) requires a numerical approach. As a particular case, when the two elements are similar and have exponential distribution, Eq. (6.32) can transfer to Eq. (6.31). 2. Cold reserve systems with an unreliable servo actuator A switching system contains a failure detection device and a servo actuator, and therefore can be complex in reality. Practically, it is capable of failure. Now we study a two-element cold reserve system. By changing Eq. (6.32), it is possible to get the system reliability
210
6 System Reliability Prediction and Allocation
∫t R(t) = R1 (t) +
R0 (τ ) f 1 (τ )R2 (t − τ )dτ ,
(6.33)
0
where R0 (τ ) is the reliability of the servo actuator in time τ. We suppose that the two elements are similar and have exponential distribution with index λ, and cover with two cases in that R0 (τ ) is static or dynamic. For several actuators, such as human operators, the reliability cannot vary over time. In this case, R0 (τ ) is static or independent of time. Let R0 (τ ) = p0 . So, Eq. (6.33) can be written as −λt
R(t) = e
∫t + p0
λe−λτ e−λ(t−τ ) dτ = (1 + p0 λt)e−λτ .
(6.34)
0
The similarity and difference between (6.31) for a reliable switching system and (6.34) for unreliable. Equation (6.34) simplifies to Eq. (6.31) for p0 = 1. The mean time to failure of the system is ∫∞ MTTF =
R(t)dt = 0
1 + p0 . λ
(6.35)
Now we consider the case for R0 (τ ) is dynamic or dependent on time. Most modern switching systems consist of both hardware and software and have a complex structure. They can fail in different options before the major elements’ failure. If this failure happens, the reserve elements will never be actuated to initiate the operation of the failed major elements. Because switching systems degenerate a long time, it is actual to expect that the reliability of these systems is a function of time. If the service life of a switching system has an exponential distribution with index λ0 , from Eq. (6.33) the reliability of the whole system is R(t) = e
−λt
∫t +
e−λ0 τ λe−λt e−λ(t−τ ) dτ
0
[ ] λ = e−λt 1 + (1 − e−λ0 t ) . λ0
(6.36)
The mean time to failure is ∫∞ MTTF =
R(t)dt = 0
λ 1 1 − + . λ λ0 λ0 (λ + λ0 )
(6.37)
6.2 Reliability Prediction
211
As will be shown in Example 6.8, an unreliable switching system decreases the reliability and MTTF of the whole system. To better understand this, we first refer by r 0 the ratio of the reliability in time λ–1 with an unreliable switching system to that with a reliable system, by r 1 the ratio of the MTTF with an unreliable switching system to that with a reliable one, and by δ the ratio of λ to λ0 . So, by Eqs. (6.31) and (6.36) [ ] r0 = 0.5 1 + δ(1 − eδ ) .
(6.38)
By Eq. (6.30) for n = 2 and Eq. (6.37) [ ( r1 = 0.5 1 + δ 1 −
δ 1+δ
)] .
(6.39)
Figure 6.15 shows r 0 and r 1 for different δ. There is, the switching system’s unreliability has higher effect on MTTF than on the reliability of the whole system. Both values are intensively decreasing for λ0 is higher than 10% of λ. The effect is relieved by the decrease in λ0 , and gone nearly negligible for λ0 is smaller than 1% of λ. Example 6.8 [3] See Example 6.7. Assume that the switching system is capable of failure with the exponential distribution and λ0 = 2.8 × 10−5 failures per hour. Determine the power plant reliability at 5000 h and the mean time to failure. Solution Insert the data to Eq. (6.36) and get R(5 000) = e−3.6×10
−5
×5000
[ )] 3.6 × 10−5 ( −2.8×10−5 ×5000 1+ 1 − e = 0.9756. 2.8 × 10−5
Fig. 6.15 Graphs of r 0 and r 1 for various values of t
212
6 System Reliability Prediction and Allocation
The mean time to failure is got by Eq. (6.37) 1 1 3.6 × 10−5 + − 3.6 × 10−5 2.8 × 10−5 2.8 × 10−5 (3.6 × 10−5 + 2.8 × 10−5 ) 4 = 4.34 × 10 hours.
MTTF =
Analyzing these results with the results of Example 6.7, it is possible to note the harmful effect of the unreliable switching system.
6.2.7 Reliability Evaluation of Complex Systems So, we have analyzed series, parallel, series–parallel, parallel-series, k-out-of-n and reserve systems. Actual, these structures are often combined and created complex systems in order to provide operational requirements. Some systems, such as power supply grids and telecommunication systems, are so complex that they cannot easily be divided to the explained typical structures. Reliability estimation of complex systems requires improved approaches. Here we will explain three simple, but powerful ways. For large-scale complex systems, hand-operated calculation of reliability is difficult. Different software programs are capable of calculating reliability and other indexes of reliability of complex systems by simulation. 1. Reduction approach Some systems consist of independent series, parallel, series–parallel, parallel-series, k-out-of-n, and reserve subsystems. The system reduction approach is to convolute a system consecutive to the foregoing subsystems, every corresponded by an equal reliability block. The reliability block diagram continues simplified until the whole system is expressed by a separate reliability block. The approach is explained by the next example. Example 6.9 [3] The reliability block diagram of a technical system is present in Fig. 6.16. The times to failures of the elements have an exponential distribution, with the failure rates given by the related blocks with a multiplication of 10−4 failures per hour. Determine the system reliability at 600 h of operational time. Solution The stages for determining the system reliability are: (1) Divide the system to blocks A, B, C, and D that correspond to a parallel-series, parallel, series, and cold reserve subsystem, as shown in Fig. 6.17. (2) Determine the reliabilities of blocks A, B, C, and D. By Eq. (6.19), the reliability of block A is R A = 1 − (1 − R1 R2 )(1 − R3 R4 )
6.2 Reliability Prediction
213
Fig. 6.16 Technical system for Example 6.9 Fig. 6.17 Simplified system equal to Fig. 6.16
)( ) ( −4 −4 = 1 − 1 − e−(1.2+2.3)×10 ×600 1 − e−(0.9+1.6)×10 ×600 = 0.9736. By Eq. (6.10), the reliability of block B is R B = 1 − (1 − R5 )(1 − R6 ) )( ) ( −4 −4 = 1 − 1 − e−4.8×10 ×600 1 − e−3.3×10 ×600 = 0.955. By Eq. (6.1), the reliability of block C is R B = e−(1.7+2.5)×10
−4
×600
= 0.7772.
By Eq. (6.31) the reliability of block D is R B = (1 + 4.3 × 10−4 × 600)e−4.3×10
−4
×600
= 0.9719.
The equal reliability block diagram is present in Fig. 6.17. (3) The equal system in Fig. 6.18 contains two elements in series. It is simplified to one block, G. (4) Determine the reliability of block G. RG = R E R F = 0.9462 × 0.99 = 0.9367.
214
6 System Reliability Prediction and Allocation
Fig. 6.18 Simplified system equal to Fig. 6.17
Fig. 6.19 Simplified system equal to Fig. 6.18
Fig. 6.20 Bridge system structure
Now when the initial system has been simplified to one element, as shown in Fig. 6.19, the reduction process is finished. So, the system reliability is R = RG = 0.9367. 2. Decomposition approach The system reduction approach is effective for a complex system that can be divided into several simple subsystems whose reliabilities are directly gettable. In some cases, there are more complex systems, like the well-known bridge system presented in Fig. 6.20. It cannot be analyzed by the reduction approach. Here we explain a dividing approach also known as the conditional probability way or Bayes’ theorem approach. The dividing approach begins with selecting a key-element, say A, from the system being analyzed. This element seems to hold this system together. In Fig. 6.20, for example, element 5 is a such key-element. The key-element is supposed to be 100% reliable and is replaced by a line in the given structure. Then the same element is assumed to have failed and is rejected from this structure. The system reliability is determined under every supposition. According to the concept of the whole probability, the reliability of the initial system can be expressed as I R = Pr(system good|A )Pr(A) + Pr (system goodI A )Pr( A),
(6.40)
where A is the case that key-element A is 100% reliable, A the case that key-element A has failed, Pr(system good|A) the probability that the system is operable given that element A never fails, and Pr(system good|A) the probability that the system is operable given that component A has failed. The efficiency of the approach depends on the choice of the key-element. A correct selection of the element directs to an efficient determination of the conditional probabilities.
6.2 Reliability Prediction
215
Fig. 6.21 Bridge system structure for the case when element 5 never fails
Example 6.10 [3] Take the bridge system in Fig. 6.20. Assume that the reliability of element i is Ri , i = 1, 2, …, 5. Determine the system reliability. Solution Element 5 is selected as the key-element, marked A. Suppose that it never fails and is replaced by a line in the structure. After the system is simplified as presented in Fig. 6.21. The simplified system has a series–parallel structure, and the conditional reliability is Pr(system good|A ) = [1 − (1 − R1 )(1 − R3 )][1 − (1 − R2 )(1 − R4 )]. The next stage is to suppose that element 5 has failed and is rejected from the structure. Figure 6.22 shows the new variant that is a parallel-series structure. The conditional reliability is I Pr(system goodI A ) = 1 − (1 − R1 R2 )(1 − R3 R4 ). The reliability and unreliability of element 5 are Pr(A) = R5 and Pr(A) = 1 − R5 , correspondingly. Inserting the equations above to (6.40) gives the reliability of the initial system R = [1 − (1 − R1 )(1 − R3 )][1 − (1 − R2 )(1 − R4 )]R5 + [1 − (1 − R1 R2 )(1 − R3 R4 )](1 − R5 ) = R1 R2 + R3 R4 + R1 R4 R5 + R2 R3 R5 − R1 R2 R4 R5 − R2 R3 R4 R5 − R1 R2 R3 R5 − R1 R3 R4 R5 − R1 R2 R3 R4 + 2R1 R2 R3 R4 R5 .
(6.41)
As explained in Example 6.10, determining the reliability of the bridge structure system needs the choice of only one key-element, and Eq. (6.40) is used once. For several complex systems, the reliabilities of divided systems cannot be composed Fig. 6.22 Bridge system structure for the case when element 5 has failed
216
6 System Reliability Prediction and Allocation
directly. In these cases, it is possible to select additional key-elements and use Eq. (6.40) in turn until every part in the equation is easily gettable. For example, if Pr(system good|A) cannot be turned immediately, the divided system with A operational can be continued divided by choosing an additional key-element, say B. Using Eq. (6.40) to key-element B, it is possible to express the reliability of the initial system ) ( R = Pr(system good|AB)Pr(A)Pr(B) + Pr system good|A B Pr(A)Pr(B) I ) ( + Pr system goodI A Pr(A). (6.42) The dividing approach explained above chooses one key-element at a time. But it is possible to choose such several elements at the same time for several complex networks. For example, if two key-elements, say A and B, are chosen, the initial system will be divided into four subsystems with conditions AB, AB, AB, and AB, correspondingly. AB is the case that both A and B are operable, AB is the case that A is not operable and B is operable, AB is the case that A is operable and B is not operable, and AB is the case that both A and B are not operable. By using the rule of the whole probability, the reliability of the initial system can be expressed I R = Pr(system good|AB)Pr(A)Pr(B) + Pr(system goodI A B)Pr(A)Pr(B) I I + Pr(system goodI AB )Pr(A)Pr(B) + Pr(system goodI AB )Pr(A)Pr(B). (6.43) Equation (6.43) has four parts. Commonly, for binary elements, if m key-elements are chosen at the same time, the reliability equation consists of 2 m parts. Every part is the product of the reliability of one of the divided subsystems and that of the cases in which this subsystem is created. (3) Minimal cut set approach The simplification approach explained above is based on the concept of full probability. Here, let’s consider an approach to system reliability estimation by application of a minimal cut set and the inclusion–exclusion concept. First, let’s describe cut sets. A cut set is a set of elements whose failure cuts off all links between input and output points, and therefore makes the whole system to fail. In Fig. 6.20, for example (1, 3, 5) and (2, 4) are cut sets. Some cut sets can consist of unneeded elements. If taken, failure of the left elements still proves in system failure. In this example, cut set (1, 3, 5) has element 5 that can be removed from the cut set without varying the failure conditions of the system. Such cut sets can be simplified to minimal cut set conditions. A minimal cut set is the smallest collection of elements that if they all fail will make the system to fail. A minimal cut set is the smallest batch of elements whose failures are needed and sufficient to prove the system failure. If any element is taken from the set, the left elements together are no longer a cut set. Because each minimal cut set makes the system to fail, the case that the system failures is the combination of all minimal cut sets. Then the system of reliability can be expressed
6.2 Reliability Prediction
217
R = 1 − Pr(C1 + C2 + C1 + · · · + Cn ),
(6.44)
where C i (i = 1, 2, …, n) is the case that elements in minimal cut set i are all in failure conditions, and n is the full number of minimal cut sets. Equation (6.44) can be estimated by using the inclusion–exclusion concept that is Pr(C1 + C2 + . . . + Cn ) =
n ∑ i=1
) ( Pr Ci C j
i< j=2 n ∑
+
n ∑
Pr(Ci ) −
) ( Pr Ci C j Ck + · · · (−1)n−1 Pr(C1 C2 . . . Cn ).
i< j c, where r is the failures number in the test and c is the critical value. Since every test element has a binary outcome (i.e., either success or failure), r has a binomial distribution yielded by F(r ) = Cnr F r (1 − p)n−r , r = 0, 1, ..., n, where F is the probability of failure. The probability that the failures number r is less than or equal to the critical value c is Pr(r ≤ c) =
c ∑
Cni F i (1 − F)n−i .
(8.1)
i=0
It is suitable to have a customer’s risk of less than or equal to 1 − C for F = 1 − RL . Therefore, Pr(r ≤ c| p = 1 − R L ) ≤ 1 − C.
(8.2)
Combining Eqs. (8.1) and (8.2) yields c ∑
Cni (1 − R L )i R Ln−i ≤ 1 − C.
(8.3)
i=0
If c, RL and C are yielded, Eq. (8.3) can be solved for the minimal sample size. For c = 0 that is the case in life testing, Eq. (8.3) simplifies to R Ln ≤ 1 − C.
(8.4)
324
8 Reliability Tests
Fig. 8.7 Sample sizes for different values of C and RL
From Eq. (8.4), the minimal sample size is n=
ln(1 − C) . ln(R L )
(8.5)
If a sample of size n (the minimal sample size) provides 0 failures in tests, it is concluded that the product reaches the required reliability level RL at a 100C% confidence level. Figure 8.7 shows the minimal sample sizes for different values of C and RL . It is present that the sample size grows with the reliability required yielded a confidence level, or with the confidence level specified a required reliability. It grows sharply for the required reliability level 1. Example 8.1 [5] Find the minimal sample size to show R90/C90 that in industry usually means 90% reliability at a 90% confidence level. What is the minimal sample size for verifying R99/C90? Solution The minimal sample size for verifying R90/C90 is n=
ln(1 − 0.9) = 22. ln(0.9)
If R = 99% and C = 90%, the minimal sample size is n = 230. Sometimes, it can be interested of the lower reliability limit for testing a sample of size n. If no failures happen in t L , the lower limit reliability at a 100C% confidence level can be determined by Eq. (8.5) as R L = (1 − C)1/n
(8.6)
Example 8.2 [5] A stochastic sample of 30 elements was tested for 15 000 cycles and provide no failures. Determine the lower 90% confidence limit on reliability. Solution By Eq. (8.6), the lower 90% confidence limit on reliability is
8.3 Zero-Failure Test
325
R L = (1 − 0.9)1/30 = 0.926. Denote that this reliability is at 15 000 cycles under the test configuration.
8.3.2 Weibull Zero-Failure Testing We have understood above that the minimal sample size goes too large to be admissible for a high reliability is to be verified. As Example 8.1 shows, 230 elements are needed to show 99% reliability at a 90% confidence level. The sample size can be decreased if it is presented some data about the product life from the past. Assume that the product life has a Weibull distribution with scale parameter θ and shape parameter b, and b is known. The task is still to show the lower limit reliability RL at a 100C% confidence level. To conduct a proposal test, a sample of size n0 is pulled at stochastic and takes zero-failure testing for a defined period of time t 0 . The reliability at t 0 is [ ( ) ] t0 b R(t0 ) = exp − . (8.7) θ The probability of the sample of size n0 providing 0 failures is get by Eq. (8.1) as [
( )b ] t0 Pr(r = 0) = exp −n 0 . θ
(8.8)
Likewise, a sample of size n tested for t L without failures has [
(
tL Pr(r = 0) = exp −n θ
)b ] .
(8.9)
Combining Eqs. (8.8) and (8.9) yields n 0 = nκ −b ,
(8.10)
where κ = t 0 /t L , and is called the zero-failure ratio. Inserting Eq. (8.5) to Eq. (8.10) gives n0 =
ln(1 − C) . ln(R L )κ b
(8.11)
Equation (8.11) gives to Eq. (8.5) for the zero-failure ratio matches 1 and shows that the sample size can be decreased by growing the zero-failure ratio (i.e., increasing the test duration). The value of decreasing depends on the value of b. Then higher
326
8 Reliability Tests
this value, then higher the decreasing. In Table 8.1 it is present the sample sizes for various values of RL , C, κ, and b. Equation (8.11) can be obtained by another way, due partly to C. If we assume that a sample of size n0 is tested for time t 0 without failures. The lower 100C% confidence limit on the Weibull scale parameter θ is ( θL =
2n 0 t0b 2 χC,2
) (8.12)
where χ C,2 2 is the 100C-th percentile of the χ 2 distribution with 2 grades of freedom. The lower limit on reliability at t L is [ ( ) ] ( ) 2 t Lb χC,2 tL b R L = exp − = exp − b . θL 2t0 n 0
(8.13)
Let κ = t 0 /t L . Then the minimal sample size can be expressed as n0 = −
2 χC,2
2κ b ln(R L )
.
(8.14)
Seeing that χ C,2 2 = −2ln(1 − C), Eq. (8.14) simplifies to Eq. (8.11). Example 8.3 [5] An engineer wants to show that a sensor reaches a lower 90% confidence limit reliability of 95% at 15 000 cycles. Past data analysis has shown that the life distribution is approximately Weibull with a shape index between 1.5 and 2. The engineer has a task to decrease the sample size by testing the sensors for 33 000 cycles. Find the minimal sample size. Solution The zero-failure ratio is κ = 33 000/15 000 = 2.2. Assume, that the value of the shape parameter is 1.5. When RL = 0.95, C = 0.9, and b = 1.5, the sample size is 16 for κ = 2 and 12 for κ = 2.5 from Table 8.1. Linear approximation yields the needed sample size of 14. Direct determination by Eq. (8.11) also gives n0 = 14. Now the zero-failure testing is to test 14 elements of the sensor for 33 000 cycles. If no failures happen, the reliability of 95% at 15 000 cycles is showed at a 90% confidence level. Example 8.4 [5] See Example 8.3. Assume that the maximum permissible sample size is 10. Determine the test duration required. Solution By Eq. (8.11) the zero-failure ratio is [
ln(1 − C) n0 = n 0 ln(R L )
]1/ b
[
ln(1 − 0.9) = 10 ln(0.95)
]1/ 1.5
= 2.72.
8.3 Zero-Failure Test
327
Table 8.1 Sample size for zero-failure testing of a Weibull distribution [5] b
κ
1.25 1
1.5
100C = 90
100C = 95
100RL
100RL
90 92.5 95 97.5 99
90 92.5 95 97.5 99
90 92.5 95 97.5 99
16 21
32 64
161 22 30
45 91
230 29 39
59 119
299
1.5 10 13
19 39
97 14 18
28 55
139 18 24
36
72
180 126
2
7
9
14 27
68 10 13
19 39
97 12 17
25
50
2.5
5
7
10 21
51
7 10
15 29
73 10 13
19
38
95
3
4
6
8 17
41
6
8
12 24
59
15
30
76
8 10
3.5
4
5
7 14
34
5
7
10 19
48
6
9
13
25
63
4
3
4
6 12
29
4
6
8 17
41
6
7
11
21
53
16 21
32 64
161 22 30
45 91
230 29 39
59 119
299
1 1.5
9 12
18 35
88 12 17
25 50
125 16 21
32
65
163
2
6
8
12 23
57
8 11
16 33
82 11 14
21
42
106
2.5
4
6
8 17
41
6
12 24
58
8 10
15
30
76
8
3
3
4
7 13
31
5
6
9 18
45
6
8
12
23
58
3.5
3
4
5 10
25
4
5
7 14
35
5
6
9
19
46
4
2
3
4
21
3
4
6 12
29
4
5
8
15
38
1.75 1
8
16 21
32 64
161 22 30
45 91
230 29 39
59 119
299
1.5
8 11
16 32
79 11 15
23 45
113 14 19
29
59
147
2
5
10 19
48
9
14 28
9 12
18
36
89
7
7
69
2.5
4
5
7 13
33
5
6
10 19
47
6
8
12
24
60
3
3
4
5 10
24
4
5
7 14
34
5
6
9
18
44
3.5
2
3
4
8
18
3
4
6 11
26
4
5
7
14
34
2
2
3
6
15
2
3
21
3
4
6
4 2
100C = 80 100RL
1 1.5
11
27
16 21
32 64
161 22 30
45 91
4
9
230 29 39
59 119
299
7 10
14 29
72 10 14
20 41
102 13 18
26
133
53
2
4
6
8 16
41
6
8
12 23
58
8 10
15
30
75
2.5
3
4
6 11
26
4
5
8 15
37
5
7
10
19
48
3
2
3
4
18
3
4
5 11
26
4
5
7
14
34
8
3.5
2
2
3
6
14
2
3
4
8
19
3
4
5
10
25
4
1
2
2
4
11
2
2
3
6
15
2
3
4
8
19
59 119
299 120
2.25 1
16 21
32 64
161 22 30
45 91
230 29 39
1.5
7
9
13 26
65
9 12
19 37
93 12 16
24
48
2
4
5
7 14
34
5
7
10 20
49
6
9
13
25
63
2.5
2
3
4
21
3
4
6 12
30
4
5
8
16
38
9
3
2
2
3
6
14
2
3
4
8
20
3
4
5
10
26
3.5
1
2
2
4
10
2
2
3
6
14
2
3
4
8
18
4
1
1
2
3
8
1
2
2
5
11
2
2
3
6
14
(continued)
328
8 Reliability Tests
Table 8.1 (continued) b
2.5
κ
1 1.5
100C = 90
100RL
100RL
100RL
90 92.5 95 97.5 99
90 92.5 95 97.5 99
90 92.5 95 97.5 99
16 21 6
8
32 64 12 24
161 22 30 59
8 11
100C = 95
45 91
230 29 39
59 119
299
17 34
84 11 14
22
109
43
2
3
4
6 12
29
4
6
8 17
41
6
7
11
21
53
2.5
2
3
4
7
17
3
3
5 10
24
3
4
6
12
31
3
1
2
3
5
11
2
2
3
6
15
2
3
4
8
20
3.5
1
1
2
3
7
1
2
2
4
10
2
2
3
6
14
4
1
1
1
2
6
1
1
2
3
8
1
2
2
4
10
59 119
299
2.75 1
3
100C = 80
16 21
32 64
161 22 30
45 91
1.5
6
7
11 21
53
8 10
20
39
98
2
3
4
5 10
24
4
5
7 14
35
5
6
9
18
45
2.5
2
2
3
13
2
3
4
8
19
3
4
5
10
24
6
15 30
230 29 39 76 10 13
3
1
2
2
4
8
2
2
3
5
12
2
2
3
6
15
3.5
1
1
2
3
6
1
1
2
3
8
1
2
2
4
10
4
1
1
1
2
4
1
1
1
3
6
1
1
2
3
7
59 119
299
1
16 21
1.5
5
7
2
2
3
32 64 10 19 4
161 22 30
45 91
230 29 39
48
7
9
14 27
68
9 12
8
21
3
4
6 12
29
4
5
18
36
89
8
15
38
2.5
1
2
3
5
11
2
2
3
6
15
2
3
4
8
20
3
1
1
2
3
6
1
2
2
4
9
2
2
3
5
12
3.5
1
1
1
2
4
1
1
2
3
6
1
1
2
3
7
4
1
1
1
1
3
1
1
1
2
4
1
1
1
2
5
The test duration required is t 0 = κt L = 2.72 × 15 000 = 40 800 cycles. As seen from Examples 8.3 and 8.4, decreasing in sample size is at the reached by increased test time. In many cases, it is impracticable to extend a test. Alternatively, overhead of test stress levels is possible and rational. If the acceleration factor Af is known between the overhead and operational stress levels, the real test time t a is ta =
t0 . Af
(8.15)
8.4 Life Tests
329
8.4 Life Tests Series life testing is to test one object at a time until it fails or until a planned period of time has passed. Whenever new results get available, an estimation is made to find if (1) The specified reliability is shown. (2) The specified reliability is not shown. (3) The test should be proceeded. From the point of statistic, series life testing is a concept testing case in which the test results are reassessed as a new result is available and then analyzed against the decision requirements. When dismissal or acceptance rules are fulfilled, the test is stopped, and the conclusion is got at. Elsewise, the test should be proceeded. It can be perceived that the sample size needed to attain a conclusion is a stochastic value and cannot be precalculated. Since of the series specific, the test approach necessitates lesser samples than a life test.
8.4.1 Theoretical Concept Let’s study the concepts H0 : θ = θ0 ,
H1 : θ = θ1 ,
where θ is the life distribution parameter (e.g., an exponential MTTF or Weibull scale parameter) and θ 0 and θ 1 are the values defined for θ. Broadly, θ 0 corresponds to the upper reliability requirement limit above which the batch of the product should be accepted; θ 1 corresponds to the lower reliability requirement limit of below which the lot of products should be dismissed. The ratio d=
θ0 θ1
(8.16)
is named the discrimination ratio. Let X be the random variable with the PDF yielded by f (x; θ ). Assume that a series life testing creates x 1 , x 2 , …, x n that are n independent results of X. The probability of the n results is P(x1 , x2 , . . . , xn ; θ ) =
n ∏
f (xi ; θ ).
i=1
We determine the ratio of the probability at θ 1 to that at θ 0 as
(8.17)
330
8 Reliability Tests
P Rn =
P(x1 , x2 , . . . , xn ; θ1 ) . P(x1 , x2 , . . . , xn ; θ0 )
(8.18)
RPn is also named the probability ratio since the sample probability is the associated PDF for the sample as demonstrated in Eq. (8.17). Yielded a data set x 1 , x 2 , …, x n , the probability depends only on θ. The maximum probability principle shows that the probability is maximized when the value of θ occupies the true value. We can suppose that θ closer to the true one would prove in a higher value of the probability. Following the same way of thinking, if θ 0 is more toward to the true value of θ than θ 1 , L(x 1 , x 2 , …, x n ; θ 0 ) is higher than L(x 1 , x 2 , …, x n ; θ 1 ), and RPn is less than 1. RPn would go lower when θ 0 comings, and θ 1 leaves, the true value. It is sensible to determine a limit, say A, such that if RPn ≤ A, we would accept H 0 . Likewise, we can also find a limit, say B, such that if RPn ≥ B, we would dismiss H 0 . If RPn is between the limits, we would fail to accept or dismiss H 0 ; therefore, the test should be proceeded to create more results. The decision specifics are: (1) Accept H 0 if RPn ≤ A. (2) Dismiss H 0 if RPn ≥ B. (3) Take one more element and proceed the test if A < RPn < B. By following the decision specifics above and the explanations of type 1 and type 2 mistakes, we can find the limits as A=
β , 1−α
(8.19)
B=
1−β , α
(8.20)
where α is the type 1 mistake (manufacturer’s risk) and β is the type 2 mistake (customer’s risk). In many cases, it is more rational to apply the log probability ratio for the calculation ln(R Pn ) =
] [ n ∑ f (xi , θ1 ) . ln f (xi , θ0 ) i=1
(8.21)
Then the proceed test area becomes (
β ln 1−α
)
) 1−β < ln(R Pn ) < ln . α (
(8.22)
It should be noted that the true values of the two mistakes’ types are not precisely equal to the defined values of α and β. It is hard to find the true mistakes, but they are limited by
8.4 Life Tests
331
α' ≤
1 and β ' ≤ A, B
where α ' and β ' are the true values of α and β, correspondingly. For example, if a test qualifies α = 0.1 and β = 0.05, the true mistakes are limited by α ' ≤ 0.105 and β ' ≤ 0.056. It can be perceived that the upper limits are slightly higher than the defined values. Mostly, the maximal specific mistake of α ' to α is / 1 B−α β α' − α = = . α α 1−β The maximal specific mistake of β ' to β is A−β α β' − β = = . β β 1−α The operating feature (O.F.) graph is useful in concept testing. It shows the probability of accepting H 0 for H 0 is true for various true values of θ. The probability, Pa (θ ) can be expressed as Pa (θ ) =
Bh − 1 , h /= 0, B h − Ah
(8.23)
where h is a constant associated to θ. The relation between h and θ is specified by ∫∞ [ −∞
f (x; θ1 ) f (x; θ0 )
]h f (x; θ )dx = 1.
(8.24)
Solving Eq. (8.24) yields θ (h). Then it is possible to apply the following way to create the O.F. graph: (1) Specify a series of some numbers for h that can be between, for example, −3 and 3. (2) Determine θ (h) at the values of h given. (3) Determine Pa (θ ) at the values of h applying Eq. (8.23). (4) Create the O.F. graph by drawing Pa (θ ) versus θ (h). Let’s see two particular cases of Eq. (8.23). For h = 1, Eq. (8.23) becomes Pa (θ ) =
B−1 = 1 − α. B−A
(8.25)
For h = −1, Eq. (8.23) is becoming to Pa (θ ) =
B −1 − 1 = β. B −1 − A−1
(8.26)
332
8 Reliability Tests
Example 8.5 [5] Analyze a series life test for the exponential distribution. Assume that θ 0 = 2 000, θ 1 = 1 000, α = 0.1, and β = 0.1. Find the decision limits and O.F. graph for the test. Solution The decision limits are A=
0.1 1−β 1 − 0.1 β = = 0.111, B = = = 9. 1−α 1 − 0.1 α 0.1
So, if a series test of n elements leads to RPn ≤ 0.111, the null concept θ 0 = 2000 is accepted. If RPn ≥ 9, the null concept is dismissed. If 0.111 < PRn < 9, take one more element and proceed the test. To create the O.F. graph for the test, it is firstly needed to solve Eq. (8.24) for the exponential distribution, where f (x; θ ) =
( x) 1 exp − , x ≥ 0. θ θ
From Eq. (8.24) it is possible to express ∫∞ [ −∞
/ ]h ( x) θ0 exp(−x θ1 ) 1 / exp − dx = 1. θ θ1 exp(−x θ0 ) θ
Solving the equation yields / (θ0 θ1 )h − 1 / / . θ= h(1 θ1 − 1 θ0 )
(8.27)
From Eq. (8.27), if θ = θ 0 , then h = 1. From Eq. (8.25) we have Pa (θ 0 ) = 1 − α. That is, if θ 0 is the true MTTF, the probability of accepting the batch is 1 − α. Same, if θ = θ 1 , then h = −1. From Eq. (8.26) we get Pa (θ 1 ) = β; that is, if θ 1 is the true MTTF, the probability of accepting the batch is β. To draw the O.F. graph, assume h = −2, −1.9, −1.8, …, 1.8, 1.9, 2, and compute the related values of θ by Eq. (8.27), and of Pa (θ ) by Eq. (8.23). Then the O.F. graph is the show of the collections of Pa (θ ) and θ values, as presented in Fig. 8.8. From the graph it is seen that if the true value θ = 2 000, the probability of accepting the batch is 0.9 that is 1 − α, and if θ = 1 000, the probability is 0.1 that is β.
8.4.2 Binomial Series Life Testing As in zero-failure testing, it is sometimes interested in whether a test element fails in a certain period of time in series testing. The test result is either failure or success.
8.4 Life Tests
333
Fig. 8.8 O.F. graph for the series test plan of Example 8.5
So, the probability of a happening is expressed by a binomial distribution p(x) = p x (1 − p)1−x , x = 0.1,
(8.28)
where p is the failure probability, x = 0 if no-failure happens, and x = 1 if a failure happens. Assume that p0 is the lower limit of failure probability below that the batch of elements should be accepted and p1 is the upper limit of failure probability above that the batch should be dismissed. So, p0 < p1 . Then the series testing is equal to testing the concepts H0 : p = p0 ,
H1 : p = p1 .
For n results, the log probability ratio yielded by (8.21) can be expressed as ] ( ) 1 − p0 p1 (1 − p0 ) − n ln , ln(R Pn ) = r ln p0 (1 − p1 ) 1 − p1 [
where r is the full number of failures in n tests and r =
n ∑
(8.29)
xi .
i=1
The continue-test area can be gotten by replacing Eq. (8.36) to Eq. (8.29). Next reduction yields An < r < Bn , where ) ) ( ( 1 − p0 β , + nC ln An = C ln 1−α 1 − p1 ) ( ) ( ) ( p1 (1 − p0 ) 1 − p0 1−β , C = ln−1 + nC ln . Bn = C ln α 1 − p1 p0 (1 − p1 )
(8.30)
334
8 Reliability Tests
Fig. 8.9 Graph of binomial series test plan
An and Bn are the limits of the test. Relating to the decision specifics, we accept H 0 if r ≤ An , dismiss H 0 if r ≥ Bn , and take one more element and proceed the test if An < r < Bn . An and Bn are two parallel straight lines, as present in Fig. 8.9. The summarized number of failures can be drawn on the graph to demonstrate the actual decision and observe the test progress. To draw the O.F. graph for this test, it is firstly needed to solve Eq. (8.24) for the binomial distribution specified by Eq. (8.28) and get 1− p = ( )h p1 p0
(
1− p1 1− p0
−
(
)h
1− p1 1− p0
)h .
(8.31)
The probability of accepting H 0 for p is the true probability of failure is gotten by (8.23) for θ = p Pa ( p) =
Bh − 1 , h /= 0. B h − Ah
(8.32)
Then the O.F. graph can be drawn by following the steps explained above. In test preparation, it can be interested in the minimal number of tests resulting to acceptation of H 0 . The shortest path to the decision is when no failures happen in the tests. The minimal number na is yielded by (
β An = C ln 1−α
)
(
1 − p0 + n a C ln 1 − p1
) =0
or ( ) ln 1−α β ). na = ( p0 ln 1− 1− p1
(8.33)
Also, the minimal number of tests resulting to rejection of H 0 happens when all tests fail. The minimal number nr is yielded by
8.4 Life Tests
335
( Bn = C ln
1−β α
)
( + n r C ln
1 − p0 1 − p1
) = nr
or ( nr =
ln
1−β α
1 − C ln
(
)
1− p0 1− p1
).
(8.34)
The awaited number of tests E(n|p) to attain an accept or dismiss decision is yielded by ( ) ( ) Pa ( p) ln A1 + [1 − Pa ( p)] ln B1 ( ) , ( ) E(n| p ) = p0 p ln pp01 + (1 − p) ln 1− 1− p1
(8.35)
which shows that E(n|p) is the true p function that is unknown. In computation it can be changed with an estimate. Example 8.6 [5] A car supplier desires to show the reliability of a one-shot airbag at a certain time and test case. Assume that the technical specification for the airbag defines p0 = 0.001, p1 = 0.01, α = 0.05, and β = 0.1. Create a series test plan. Solution Replacing the yielded data to Eq. (8.30), we get the proceed test area 0.0039n − 0.9739 < r < 0.0039n + 1.2504. Following the decision of specifics, we take H 0 (the failure probability is less than or equal to 0.001 at the certain time and test case) if r ≤ 0.0039n − 0.9739, dismiss H 0 if r ≥ 0.0039n + 1.2504, and use an additional element for test if 0.0039n − 0.9739 < r < 0.0039n + 1.2504. The minimal number of tests that result to acceptance of H 0 is found from Eq. (8.33) as na = 249. The minimal number of tests leading to dismissing of H 0 is computed by Eq. (8.34) as nr = 2. After that, it is possible to compute the awaited number of tests for the test. The vendor was assured that the airbag reaches the specified reliability based on the
336
8 Reliability Tests
Fig. 8.10 Series life test plan for Example 8.6
accelerated test data of a same product and has p = 0.0008. Replacing the value of p to Eq. (8.31) yields h = 1.141. With the yielded α and β values, we get A = 0.1053 and B = 18. From Eq. (8.32), Pa (p) = 0.9658. Then the awaited number of tests is computed by Eq. (8.35) as E(n|p) = 289. The test plan is drawn in Fig. 8.10. The minimal numbers can also be found from this graph. To draw an O.F. graph for the test, specified h to different numbers between –3 and 3. Then compute the related values of p by Eq. (8.31) and of Pa (p) by Eq. (8.32). The graph of Pa (p) versus p is the O.F. graph, presented in Fig. 8.11. It is shown that the probability of accepting H 0 decreases sharply as the true p increases when it is less than 0.005. That is, the test plan is sensitive to the change in p in this area. To compare the series life test with the zero-failure test, it is possible to determine the sample size for the zero-failure test that shows 99.9% reliability at a 90% confidence level that is equivalent to p0 = 0.001 and β = 0.1 in the example above. By Eq. (8.5) it is possible to get n = 2 302. The sample size is considerably larger than 289 (the awaited number of tests in the series life test). Fig. 8.11 O.F. graph for the series life test of Example 8.6
8.4 Life Tests
337
8.4.3 Exponential Series Life Testing Exponential distribution can come close to the life distribution of several products, for example, the flash memory. Because of its simplicity, this distribution type is widely applied and possibly misused. In this subchapter, there is a series life test for this distribution. The exponential PDF is ( ) t 1 f (t) = exp − , θ θ where t is the lifetime and θ is the mean time to failure. The series life testing is to test the concepts H0 : θ = θ0 ,
H1 : θ = θ1 ,
where θ 0 > θ 1 . Additionally, to θ 0 and θ 1 , the test also defines α and β. To do the concept test, we make the log probability ratio applying Eq. (8.21) and get [
/ / ] ( ) ( ) (1 θ1 ) exp(−ti θi ) 1 1 θ0 / / −T , (8.36) − ln(P Rn ) = ln = n ln θ1 θ1 θ0 (1 θ0 ) exp(−ti θ0 ) i=1 n ∑
where n is the full number of tests and T is the full time to failure of the n elements n ∑ ti ). (T = i=1
By Eqs. (8.22) and (8.36), the continue-test area is An < T < Bn ,
(8.37)
where (
α An = C ln 1−β θ1 θ0 C= . θ0 − θ1
)
) ) ( ) ( θ0 θ0 1−α , Bn = C ln , + nC ln + nC ln θ1 β θ1 (
The measure in the test is the time to failure. The decision variable is the full time to failure, not the full number of failures. So, the decision specifics are that we accept H 0 if T ≥ Bn , dismiss H 0 if T ≤ An , test an extra element, and continue the test if An < T < Bn . The shortest way to the dismiss decision is testing n=
/ ln[(1 − β α)] / ln(θ0 θ1 )
338
8 Reliability Tests
elements that fail at time zero. The shortest way to the accept decision is testing one element that lives at least the time defined by (
1−α B1 = C ln β
)
) θ0 . + nC ln θ1 (
The O.F. graph for the test plan can be created by applying Eqs. (8.23) and (8.27). The process was explained in Example 8.5. The applying of Eq. (8.37) requires testing elements separately to failure. Compared with a cutting test, the test approach decreases sample size and increases test time. This is urged when accelerated testing is relevant. Time to time, it is needed the coincident testing of a sample of enough size. The specific decision and test plans are described in, for example, MIL-HDBK-781 [4]. Example 8.7 [5] A producer was induced to show the MTTF of a new electronic product not less than 5 000 h. Assume that the unsatisfactory MTTF lower limit is 3 000 h, α = 0.05, and β = 0.1. Create a series test plan. An accelerated life test of 5 elements given the failure times: 196.9, 15.3, 94.2, 262.6, and 111.6 h. Assume that the acceleration factor is 55. Do a decision as to whether to continue the test base on the test data. Solution The continue-test area is computed by Eq. (8.37) as − 21, 677.8 + 3831.2n < T < 16, 884.7 + 3831.2n. According to the decision specifics, it is possible to conclude that the MTTF of the product meets the requirement of 5 000 h if T ≥ 16, 884.7 + 3831.2n, but does not meet the requirement if T ≤ − 21, 677.8 + 3831.2n. Elsewise, take one more element and continue the test. To do a decision on whether to continue the test, it is needed to convert the failure times to those at the applying stress level by multiplying the acceleration factor. The equivalent full failure time is T = 55 × (196.9 + 15.3 + 94.2 + 262.6 + 111.6) = 37433. The decision limits are A5 = − 21677.8 + 3831.2 × 5 = − 2521.8 and B5 = 16884.7 + 3831.2 × 5 = 36040.7.
8.4 Life Tests
339
Fig. 8.12 Series life test plan and results of Example 8.7
Because T > B5 , it is possible to conclude that the product reaches the MTTF of 5 000 h. The series life test results and decision procedures are shown in Fig. 8.12. The collected test time traverses the Bn limit after a test of 5 elements.
8.4.4 Weibull Series Life Testing A Weibull distribution is applied most widely because of its flexibleness in shape. The Weibull PDF is [ ( ) ] ( ) t b b t b−1 exp − f (t) = , t ≥ 0, θ θ θ where b is the shape parameter and θ is the scale parameter. If we determine y = t b where m is supposed known, y has an exponential distribution with scale parameter (mean) γ = θ b . Then the series life test plan for a Weibull distribution can be gotten by changing the plan for the exponential distribution that was explained above. Assume that it is needed to demonstrate the scale parameter of the Weibull distribution such that if θ = θ 0 the probability of accepting the batch is 1 − α, and if θ = θ 1 where θ 1 < θ 0 , the probability of acceptance is β. This is same to testing the exponential concepts H0 : γ = γ0 ,
H1 : γ = γ1 ,
where γ0 = θ0b and γ1 = θ1b . By Eq. (8.37), the continue-test area is determined by An < T < Bn , where
(8.38)
340
8 Reliability Tests
) ( ) θ0 α , + nbC ln T = An = C ln 1−β θ1 i=1 ) ( ) ( (θ1 θ0 )b θ0 1−α Bn = C ln , C= b + nbC ln . β θ1 θ0 − θ1b n ∑
(
tib ,
We accept H 0 if T ≥ Bn , dismiss H 0 if T ≤ An , and continue the test elsewise. The O.F. graph can be drawn by applying the equations and algorithms for the exponential distribution with the changes γ0 = θ0b and γ1 = θ1b . The test approach applies a known shape parameter of the Weibull distribution. Practically, it can be evaluated by the accelerated test data gotten in earlier development stages or from past data on a same product. If corresponding data are not accessible, the shape parameter can be evaluated by the series life test itself. But the test structure requires to be adjusted correspondingly as the updated results become available. The process is described as follows: (1) (2) (3) (4)
Test upwards of three elements, one at a time, until all have failed. Evaluate the shape and scale parameters by the test data. Compute An and Bn applying the evaluations of the shape parameter. Use the decision specifics to the failure times in the state in which they were detected. If dismiss or accept decision is done, stop the test. Elsewise, go to step 5. (5) Take one more element and continue the test till it fails or till a decision to accept is achieved. If it fails, return to step 2. Notwithstanding the test results supply a better evaluation of the shape parameter, the evaluation can still have a high difference from the true value. This difference, of course, affects existent type 1 and type 2 mistakes. Thus, it is proposed that the sensitiveness of the test structure be evaluated to the uncertainty of the evaluation. Example 8.8 [5] The life of a mechanical element can be simulated with a Weibull distribution with b = 1.5. The producer is required to show that the scale parameter corresponds to the standard of 55 000 cycles. For θ 1 = 45 000 cycles, α = 0.05, and β = 0.1, develop a series test plan. Solution Inserting the specified data to Eq. (8.38), we get the continue-test area as −106 × 10 + 11.1 × 10 n < 6
6
n ∑
ti1.5 < 82.7 × 106 + 11.1 × 106 n.
i=1
The test plan is drawn in Fig. 8.13. The vertical axis T is the full transformed failure time. The O.F. graph is drawn by applying Eqs. (8.30) and (8.34) and the transformation γi = θib (i = 0, 1). Figure 8.14 demonstrates an O.F. graph that shows the probability of acceptance at various true values of the Weibull scale parameter, where θ = γ 1/b .
8.5 Accelerated Tests
341
Fig. 8.13 Series life test plan of Example 8.8
Fig. 8.14 O.F. graph for the series life test of Example 8.8
8.5 Accelerated Tests 8.5.1 Principles for Accelerating Tests The principles of accelerating tests of machinery parts for reliability are a set of theoretical and experimental regularities or justified assumptions, based on which the reduction of test duration is achieved [6]. The reduced tests are based on the principles of compressed operational cycles and duration extrapolation. The accelerated tests are based on the principles of load range revision operational cycle frequency, comparison and load extrapolation, “break completely”, progressive load forcing, and conditions. A set of rules for applying the principles of accelerated reliability testing of machinery parts to determine or to inspect the reliability of groups or types of products forms the method of accelerated testing for machinery. The accelerated test principles listed can be applied in conjunction with each other to create a test program and approach, which significantly reduces the test duration.
342
8 Reliability Tests
The above principles for accelerating tests are based on the hypothesis that the reliability of a test object, if the physical pattern of failures is maintained, depends on the amount of its past service life and does not depend on how that service life is expended. ∫T r (T ) =
ω(t)dt
(8.39)
0
where r(T ) is service life for the full operational time, ω(t) is a non-negative function expressing some parameter of the part that deteriorates in the course of operation. The service life of a part at the accelerated test must match its service life under normal operational load. Then the acceleration factor will be ∫T ro (T ) 0 = T Ca = ∫ ra (T )
ω(to )dt (8.40) ω(ta )dt
0
The acceleration factor indicates how many times the duration of accelerated tests is less than the duration of tests conducted under the operational conditions specified in the technical documentation for this product.
8.5.2 Operational Cycles Compressing It involves reducing part interruptions, eliminating downtime, and reducing auxiliary maintenance time. Reduction of operating cycles is allowed only in cases where the interruptions do not affect the rate of processes, leading to failures. Implementation of the principle of operational cycle acceleration is ensured by round the clock tests, as well as by automated tests with control of operational parameters and test modes. Acceleration factor of tests according to the principle of compressed operational cycles Ca =
to + tint To = Ta to
(8.41)
where t o is operational time, t int is time for interruptions. Taking into account the management features of accelerated tests, the acceleration factor of tests according to the principle of operational cycles compressing in time Ca =
Cu a tsh a i a Cu tsh i
(8.42)
8.5 Accelerated Tests
343
where C u and C ua are ratio of utilization of a part in its operation and in time of test correspondingly, t sh and t sh a are the duration of a shift for operation and test correspondingly, i and ia are numbers of shifts for operation and test correspondingly.
8.5.3 Extrapolation in Time This approach is based on the idea that it is possible to predict part reliability with high validity. The extrapolation is carried out on the basis of the failure model. The MTBF is estimated by the results of short-term revised tests. There are two types of revised tests: (1) Termination of tests when a specified number of failures is reached (for example, Sect. 8.3 of this chapter). (2) Termination of tests when the certain operational time is reached (for example, Sect. 8.3 of this chapter). The acceleration factor C a is defined as the ratio of the average predicted service life T p to the duration of the test, i.e., the corresponding operational time t t at the time of test revision for the approach of time extrapolation: Ca =
Tp tt
(8.43)
In the general case, the random function of the change in the parameter ω(t) of the product in time t is represented as ω(t) = m t + p(t) + d(t) = m t + p(t) + z(t) + c(t)
(8.44)
where mt is the initial value of the parameter (mathematical expectation), p(t) is the change of the parameter due to degradation processes occurring with an average rate, d(t) is a function of the parameter due to slow processes, z(t) is a periodic component, which is a deterministic function; c(t) is a random component, which is a stationary random function. If the features of the random process ω(t) of the parameter ω change are known from the accelerated test results, extrapolation is reduced to the determination of the probability of no-failure operation P(T p ) up to the time T p , considered as the probability of crossing the set boundary of parameter change ωmax (Fig. 8.15). At the same time, it should be taken into account that almost any function of the parameter ω based on the operational time t can be replaced by a linearized form by means of an appropriate coordinate transformation and the coefficients of the equations can be determined using the least squares method. The approach application assumes that the function of changing the parameter w remains unchanged in time, until the product reaches the limit state, and that it is monotonic and has derivatives throughout this interval.
344
8 Reliability Tests
Fig. 8.15 The layout of the test acceleration by extrapolation in time
It can be assumed that time extrapolation gives satisfactory accuracy results with a test duration of less than 40–70% of the product service life under regular tests.
8.5.4 Revision of Loading Range It is based on the exclusion of part of the loads that do not have a significant damaging effect on the test object, which is accompanied by a general increase in the average level of loading and accelerated loss of performance by the product. Most machinery is exposed to a wide range of random and periodically recurring external factors during operation. The planned set of loads implemented during testing is formed by conducting a statistical analysis of the repeatability of loads and discarding a certain part of the loads that do not affect the destruction process under consideration (Fig. 8.16). The acceleration factor in this case is determined by the ratio of the total number of N o cycles for the operational loading mode to the total number of N t cycles for the revised loading mode Ca =
No Nt
(8.45)
A special case of load range revision is the exclusion of the steady-state part of the operational cycle, and in some cases, it is advisable to carry out tests with the start-stop mode.
Fig. 8.16 The layout of the test acceleration by revision of loading range: a—operational mode; b—test mode
8.5 Accelerated Tests
345
The efficiency of the principle of the load range revision is higher if the modes that do not lead to damage make up the most part of the total operational range of operational modes in time.
8.5.5 Increased Frequency of Operational Cycles It is based on increasing the frequency of cyclic loading for fatigue tests or the speed of movement under load for parametric load tests. Its application is permissible for cases when the durability of the test part depends on the number of applied loading cycles but does not depend on the frequency of their application. For the fatigue tests, the acceleration factor is Ca =
fa fo
(8.46)
where f a and f o are frequencies of loading for accelerated test and regular operation conditions correspondingly. Application of the approach of increased frequency operational cycles requires a separate estimation of the influence of loading frequency on the endurance limit of the test part and on its service life by the fatigue criterion under the given loading conditions. In order to maintain the specified temperature mode of the test part, it is advisable to perform its cooling for accelerated tests.
8.5.6 Loading Extrapolation It involves testing at loading levels exceeding regular loading and extrapolation of the obtained dependence of the reliability index from the forced to the operational level of loading. For accelerated estimation of the endurance limit, several groups of material specimens or machinery parts are tested at stress levels exceeding the endurance limit. A segment of the left-hand graph 1 of the Wehler fatigue curve is plotted. By extrapolating this segment of the left-hand graph to the supposed abscissa of the N C breaking point of the fatigue curve, an approximate estimate of the endurance limit is obtained (Fig. 8.17). For the preliminary determination of the abscissa of the breaking point of the curve, it is needed to analyze various sources of information or the next dependence ( ασ ) 6 10 NC = 2 + 2
(8.47)
346
8 Reliability Tests
Fig. 8.17 The layout of the accelerated estimation of the endurance limit
where α σ is the theoretical stress concentration factor for the dangerous cross-section of the tested specimen of material. For the tests for parametric reliability, each representative group of parts is subjected to a forcing of a certain level F i (Fig. 8.18). For each group, the law of service life distribution and parameters of this law are determined. Application of least squares method, dependencies in the form of functions of change of parameters of distribution in time are established. After that, the reliability for the field of regular modes of loading is predicted. The disadvantages of the approach of load extrapolation include: (1) The necessity for a large number of specimens that tested to the limit state. (2) The impossibility of simultaneous forcing by several parameters of the test mode. The difference between the maximum F f max and the minimum F f min of the forcing loads (see Fig. 8.18) has a great influence on the accuracy of service life estimation. When the difference is small, the extrapolation accuracy decreases, when the difference is large, the test duration increases. Fig. 8.18 The layout of the test acceleration by loading extrapolation
8.5 Accelerated Tests
347
Fig. 8.19 The layout of the test acceleration by “break completely” approach
8.5.7 Break Completely It is used in accelerated fatigue tests and provides for short-term tests of parts at the operational stress σ o and subsequent “breaking completely” of the test part by a highest break load with σ b (Fig. 8.19). The equation for estimating the residual life at the stress σb after simplification has the view [ ] ( )1−lg n o lg N b −1 (8.48) n b ≈ μn o N b where n b is the mathematical expectation of the residual life of a tested part for the “break completely” with σ b ; μ is a factor that specified structural material behavior of part for the one-time load; no is previously operational test duration of the part with σ o ; N b is average service life for forced stresses σ b ; N o is required average life of the part at operational stresses σ o . Unknown parameters μ and N o of Eq. (8.48) are determined by iteration, sequential approximation, maximum likelihood methods according to the results of several tests studies with different previously operational tests durations no . But the expected value of service life is only 2–3 times higher than the maximum previously operational test duration. There are several modified approaches of “breaking completely” that have higher accuracy and stability of obtained estimates of fatigue life with the use of nomograms, which allow to select optimal previously tests time.
8.5.8 Progressive Load Forcing This approach is used for accelerated determination of the endurance limit. It is based on the constant growth of forced loadings for the test part over time until it reaches the limit state. There are several ways for this approach. Some of them are Prot, Enomoto, Locati, etc. Prot’s accelerated method of fatigue testing involves testing specimens to failure with a linearly increasing amplitude of the stress cycle. Depending on the structure of the testing equipment, the stress increase may be stepwise or continuous (Fig. 8.20).
348
8 Reliability Tests
Fig. 8.20 The layout of the test acceleration by continuous stress amplitude increase
To determine the endurance limit by the Prot’s method, it is needed to test at least three or four batches of specimens. The rate of stress amplitude increase α is different for each batch. The maximum loading rate is chosen so that the stress σ b at the moment of fracture does not exceed the yield strength of the material. The minimum rate is set as low as possible. However, it must be kept in mind that the duration of Prot’s tests is mainly determined by the stages with the minimum loading rate, i.e., the effectiveness of the method under consideration is largely dependent on the level of the minimum rate of stress increase. Usually, the rate of stress amplitude increase is chosen in the range of α = 5·10–3 –5·10–5 MPa/cycle. Tests of all batches are conducted at the same initial amplitude of stress cycle σ o , the value of which is selected for steels by 10–15% higher than the suggested value of fatigue limit. For other alloys, the initial amplitude of stress cycle is taken equal to the expected value of the endurance limit for the base of 107 cycles. Decreasing the level of the initial amplitude of the stress cycle compared to the specified values reduces the efficiency of accelerated tests. As a result of the tests, for each rate of stress increase a different value of fracture stress σ bi is obtained. The required value of the endurance limit is determined by extrapolating the experimental dependence of the fracture stress on the rate of stress increase on the basis of the equation σ b = σ−1 + A × α C
(8.49)
where σ b is the median of the breaking stresses that to correspond for this stress increase rate; σ−1 is the endurance limit of the tested part for the symmetrical cycle; α is the rate of stresses amplitude increase; A and C are coefficients of the equation. The accelerated Enomoto’s method involves testing one batch of four to five specimens at a constant rate of the stress cycle amplitude increase. The initial stress level is chosen in the same way as in Prot’s method. The determination of the endurance limit of accelerated Enomoto’s tests is based on the assumption that at a constant rate of stress amplitude, the ratio of breaking stresses to the endurance limit is a constant for the same type of material, i.e., σ−1 =
σb K
(8.50)
8.5 Accelerated Tests
349
where K = f (α) is a factor that depends on the rate of the stress amplitude increase. Values of K and its root mean square error δ K are given in Table 8.2. They were obtained according to the results of tests of different grades of steels, aluminum, and magnesium alloys for different rates of stress amplitude increase. Analysis of the results shows that the tolerance in determining the endurance limit for Enomoto’s method application reaches 10–15%. Therefore, this method can be used for the approximate estimation of the endurance limit. Locati’s method is applicable for materials that have right-hand graph section of the fatigue curve and can be approximated by a straight line parallel to the abscissa axis (for example, carbon steels). One batch of specimens is tested at a constant rate of stress amplitude increase. The stress increase is usually assumed to be stepwise. For each specimen tested, the value of accumulated damage is calculated a=
∑ ni Ni
(8.51)
based on three conditional (assumed) fatigue curves a, b and c (Fig. 8.21), covering the possible scatter area of the fatigue performance of the material. Then a graph of dependence of the sum of accumulated damage corresponding to the given fatigue curves for the accepted value of the endurance limit is plotted (Fig. 8.22). This graph is used to determine the required value of the endurance ∑ ni limit = 1. as the value of abscissa corresponding to the ordinate equal to one, i.e., for Ni It should be noted that the dispersion of the value obtained by the results of tests of a batch of identical specimens by Locati’s method cannot be used as an estimate of the measure of dispersion of the endurance limit related with heterogeneity of material properties, surface condition, etc., since the dispersion of the results of accelerated tests by Locati’s method is largely caused by errors in the choice of shape and parameters of conditional fatigue curves. Locati’s method achieves an acceleration factor of 25 for testing a batch of 5–6 parts. Table 8.2 K and δ K for different rates of stress amplitude increase [7] Grade of material
α·104 MPa/cycle
K
δK
Steel
0.1
1.08
0.08
0.5
1.13
0.12
1.0
1.18
0.14
0.3
1.30
0.05
0.5
1.33
0.06
1.0
1.38
0.07
2.0
1.46
0.10
4.0
1.63
0.12
6.0
1.76
0.18
Aluminum and magnesium alloys with the base of 107 cycles
350
8 Reliability Tests
Fig. 8.21 Conditional fatigue curves for computing the ultimate sum of accumulated damage Fig. 8.22 Graph for determining the endurance limit by Locati’s accelerated method
8.5.9 Alternating Modes Loading It is used for accelerated tests of parts for parametric reliability in those cases, when failure is caused by reaching the limit level by some parameter, monotonically changing in time. This approach is based on the alternation of operational Po and forced Pf loading modes (Fig. 8.23). The service life of the test part is determined mainly at the stages of forced loading. Tests can be started after the running-in of parts and finished at the stage with the operational mode. The duration of the operational mode stage is determined by the
8.5 Accelerated Tests
351
Fig. 8.23 The layout of the test acceleration by alternating modes loading: a—loading modes; b—change in the rate of the reliability parameter during test; c—recomputing of the results of accelerated tests to the operational mode
condition of accumulation of the minimum change for the parameter ∆amin that can be reliably measured. The duration of loading at the stages during testing can increase or decrease depending on the mathematical expectation of the parameter change rate and the rate change as the service life is exhausted. The number of operational loading stages mo is limited by the condition 2 ≤ mo ≤
alim + a0 ∆amin
(8.52)
It is advisable to conduct tests at each forced stage with the accumulation of the value of the parameter change ∆a f =
alim − a0 − m o · amin mo − 1
(8.53)
where alim is the maximum permissible value of the checked parameter, a0 is the value of the checked parameter after the running-in of parts. In the process of test, the rate of parameter changes at the stages of operational and forced loading is constantly monitored. Recomputing of the parameter value ∆af , accumulated at the forced loading stage, to the operational mode is performed by “extension the areas” under the rate graph
352
8 Reliability Tests
of the forced parameter change ξ fi by the parameter change rates known from the tests before the start ξ oi and after the finish ξ o(i+1) of each i-th forced stage. In the range (t 2 –t 1 ), the cumulative value of the parameter change for the first forced stage will be ∫t2 ∆a f 1 =
ξ f 1 (t)dt
(8.54)
t1
The time range ∆t o1 = (t 2 –t 1 ) of the operational stage that corresponds to the accumulated value of the parameter change ∆af 1 at the forced stage is determined by ∫
t2
t1
) ( ξ f 1 (t)dt = 0.5[ξo1 + ξo2 ] t2' − t1 ,
∆to1
2 = ξo1 + ξo2
(8.55)
∫t2 ξ f 1 (t)dt
(8.56)
t1
The time range ∆t o2 of the operational stage, corresponding to the accumulated value of the parameter change ∆af 2 in the range (t 4 –t 3 ) of the forced stage, is found similarly: ∆to2
2 = ξo2 + ξo3
∫t4 ξ f 2 (t)dt
(8.57)
t3
For the time range (t 2n –t 2n–1 ) and the accumulated value of the parameter change ∆af n ∆to n =
ξo n
2 + ξo(n+1)
∫t2n ξ f n (t)dt
(8.58)
t2n−1
The value of the service life of the tested part according to the parameter a for the recomputing to the operational mode will be t2n+1 = t0 + t01 + ∆to1 + t23 + ∆to2 + . . . + ∆to n + t2n(2n+1)
(8.59)
The acceleration factor can be determined by Ca =
' t2n+1 . t2n+1
(8.60)
8.5 Accelerated Tests
353
Fig. 8.24 Load configuration for the step-by-step increase of loads during accelerated testing
The advantages of testing using the alternating approach are high acceleration factors, sufficient accuracy for practice in assessing the durability of parts, and the possibility of obtaining reliable data with limited specimens of parts under the test. A variation of the alternating approach is the test based on the step-by-step increase of loads on the stages of operational loading in the range of operating modes from Po1 to Po n (Fig. 8.24). It makes it possible to determine the service life of a part by a specific parameter depending on the modes of its operation.
8.5.10 Some Specifics of the Accelerated Tests Complexity and responsibility of tasks, solved by modern machinery, impose very high requirements to their reliability, the more so that the same tasks require the increasing of operation time of machinery with the preservation of a set of initial reliability indicators. It is also necessary to note, that at high preset values of operation time of machinery the probability of their no-failure operation must be not less than 0.97–0.99. Testbeds and experimental facilities under such high requirements shall provide the following permissible deviations of simulated exposure to external factors: temperature ±3°C; relative humidity ±3%; pressure ±5%; vibration amplitude ±15%; vibration frequency ±2 Hz at frequencies of 50 Hz; ±5 Hz at frequencies above 50 Hz; acceleration (vibration, shock) ±20%. In order to determine the compliance of the machinery to the technical specifications, it is necessary to conduct long tests over large sample volumes. Therefore, the specifics of preparation of accelerated tests of machinery lie in the need to simultaneously solve two problems: reducing the duration of tests and reducing the number of tested samples. It is possible to reduce the duration of tests rationally, by using the laws of mathematical statistics and the general theory of experiment planning. The mathematical basis of the approach should be based on the regularities of the destruction processes of products during operation and testing.
354
8 Reliability Tests
To solve the second task, it is necessary to involve a preliminary data of physical or statistical nature about aging processes occurring in machinery, measuring systems and their parts, and to use modern mathematical methods for optimal using of statistical data of multifactorial experiments. For the analyzing of the disadvantages of accelerated tests, it should be taken into account that the widespread application of computer technology eliminates the disadvantages related with a large volume of calculations. In order to increase test efficiency and reduce economic costs, the volume of calculations should be increased if they lead to simplification or shortening of the tests themselves. For accelerated tests planning, the selection of influencing factors is of great importance: single-factor (temperature or humidity); multifactor (temperature, biological factors, pressure or mechanical influences, etc.).
8.6 Verification of Reliability Based on Prior Information The zero-failure tests or series life tests explained above can require a large sample size when high reliability is to be shown at a high confidence level. The sample size can be decreased by using the Bayesian approach if there presents known preceding data about the life parameter to be verified. Luckily, such data is sometimes accessible from the accelerated tests completed before at the development stages, and/or from the failure data of previous generation products. Including of preceding data to the creation of test plans can be carried out by applying the Bayesian approach that contents wide statistical calculations. The Bayesian approach is as follows: (1) Determine the preceding PDF ρ(θ ) of the life parameter θ to be shown. For example, θ is the reliability in binomial zero-failure testing and the MTTF in exponential series life testing. This step consists of the accumulation of preceding failure data, choosing of the preliminary distribution, and evaluation of the distribution parameters. (2) Select a probabilistic distribution to simulate the distribution of the test results, for example, x with given parameter θ. The distribution is qualified by θ and is denoted here as h(x/θ ). For example, if the test results are either success or failure, the distribution is binomial. (3) Compute the qualified joint PDF of the n independent results from the test given parameter θ. This is the probability in Sect. 8.4 and can be expressed as L(x1 , x2 , . . . , xn |θ ) =
n ∏
h(xi |θ ).
i=1
(4) Compute the joint PDF of the n independent results from the test and of parameter θ. This is done by multiplying the conditional joint PDF and the preceding PDF
8.6 Verification of Reliability Based on Prior Information
355
f (x1 , x2 , . . . , xn ; θ ) = L(x1 , x2 , . . . , xn |θ )ρ(θ ). (5) Find the borderline PDF of the n results k(x 1 , x 2 , …, x n ) by integrating the joint PDF with regard to parameter θ over its whole range. That is ∫ k(x1 , x2 , . . . , xn ) = f (x1 , x2 , . . . , xn ; θ )dθ. (6) Applying Bayes’ approach g( θ |x1 , x2 , . . . , xn ) = ∫
L(x1 , x2 , . . . , xn |θ )ρ(θ ) , L(x1 , x2 , . . . , xn |θ )ρ(θ )dθ
find the following PDF of parameter θ. It is calculated by dividing the joint PDF by the borderline PDF of the n results g( θ |x1 , x2 , . . . , xn ) =
f (x1 , x2 , . . . , xn ; θ ) . k(x1 , x2 , . . . , xn )
(7) Develop a test plan by applying the following PDF of parameter θ and the type 1 and type 2 mistakes defined. The process above is shown below through application to the creation of a zerofailure test. As explained in Sect. 8.3, a binomial zero-failure testing is to show RL at a 100C% confidence level. The sampling reliability, for example, R, is a random variable, and its preceding distribution is supposed known. It is well accepted, that the preceding data on R can be simulated with a beta distribution yielded by ρ(R) =
R a−1 (1 − R)b−1 , 0 ≤ R ≤ 1, β(a, b)
where β(a, b) = Γ (a)Γ (b)/Γ (a + b), Γ (·) is the gamma function, and a and b are unknown indexes to be evaluated from the preceding data. There are approaches for evaluating a and b. Because a zero-life test creates a binary result (either success or failure), the test result is expressed by the binomial distribution with a given R. The PDF is h(x|R ) = (1 − R)x R 1−x , where x = 0 if a success happens and x = 1 if a failure happens. By following the Bayesian approach, it is possible to get the posterior PDF of R for n success results (i.e., no failures are permitted in testing) g(R|x1 = 0, x2 = 0, . . . , xn = 0) =
R a+n−1 (1 − R)b−1 . β(a + n, b)
(8.61)
356
8 Reliability Tests
The following distribution is also the beta distribution, but the indexes are a + n and b. The zero-failure test plan with no failures permitted is to find the sample size required to show RL at a 100C% confidence level. This is equal to choosing n such that the probability of R not less than RL is equivalent to C. Then ∫1 g( R|x1 = 0, x2 = 0, . . . , xn = 0)dR = C RP
or ∫1 RP
R a+n−1 (1 − R)b−1 dR = C. β(a + n, b)
(8.62)
Equation (8.62) is solved numerically for n. The sample size is less than that yielded by Eq. (8.5).
8.7 Verification of Reliability by Means of Degradation Tests Before it was explained, reliability verification by zero-failure testing and sequential life testing. These test approaches can require a long test time to attain a conclusion, particularly when high reliability is to be demonstrated. Sometime, the failure of products is determined in terms of performances exceeding a certain threshold. Degradation of the performance is highly related with reliability. Thus, it is possible to verify reliability by analyzing performances and measuring data. For example, an acceptance sampling plan for keyboards based on degradation data. There will be described two approaches. The first approach is to verify the one-sided lower limit reliability at a certain confidence level. We can assume that, as in zero-failure testing, it is needed to show a required reliability RL at time t L at a 100C% confidence level. The process for calculating the lower limit reliability explained below should be based on the destructive degradation analysis. Other ways, such as pseudo-life analysis and random-effect analysis, can be applied, but the resultant accuracy and volume of the process should be reckoned. The process is as follows: (1) Calculate the Maximum Probability Estimator (MPE) of β and θ by maximizing the sample log probability yielded by P(β, θ ) =
nj m ∑ ∑ j=1 i=1
ln[ f y (yi j ; t j )],
(8.63)
8.7 Verification of Reliability by Means of Degradation Tests
357
where n is the samples number and m is the destructive inspections number; yij is the measurements in t j inspection time; β = [β 1 , β 2 , …, β p ] is the vector of p unknown parameters, θ = [θ 1 , θ 2 , …, θ k ] is the vector of k unknown parameters. ⁀
(2) Calculate the evaluation of reliability at t L , marked R(t L ), by applying F(t) = Pr(T ≤ t) = Pr[y(t) ≤ G] = Fy (G; t)
(8.64)
for y ≤ G and the like for y ≥ G. Here y is a critical performance characteristic, G is a certain threshold. (3) Compute the variance–covariance matrix for β and θ by assessing the inverse of the local Fisher data matrix. ⁀ ⁀ ⁀ (4) Calculate the variance of R(t L ), denoted V ar [ R(t L )], applying ( ) ∑ ( ) ( )( ) ) k k k ( ( ) ∑ ∑ ⁀ ⁀ ⁀ ⁀ ∂g ⁀ ∂g ∂g ⁀ ⁀ V ar g ≈ V ar θ i + Cov θ i , θ j , ∂θi ∂θi ∂θ j i=1 i=1 j=1i/= j (8.65) ⁀
⁀
⁀
where the ∂g/∂θ i are estimated at θ 1 , θ 2 , . . . , θ k . (5) The one-sided lower 100C% confidence limit reliability based on a normal distribution is / ⁀
⁀
⁀
R(t L ) − z C V ar [ R(t L )]. If the lower limit is greater than RL , it is possible to conclude that the object meets the reliability requirement at a 100C% confidence level. The computing above applies the known specified forms of μy (t; β) and σ y (t; θ ). Practically, they are often unknown but can be determined by test data. Initially, it is needed to assess the location and scale parameters at every inspection time. Then the linear or nonlinear regression analysis of the evaluations derives the specified functions. In many cases, the scale parameter is constant. This significantly simplifies consequent analysis. The pass explained above is analytically hard. Now we explain an approximate yet simple approach. We assume that a sample size is n, and it can be tested until t 0 , where t 0 < t L . If the tests were stopped at t L , the sample would give r failures. ⁀ Then p = r/n evaluates the probability of failure p at t L . The failures number of r is unknown. It can be computed from the pseudo-life approach. In specific, a degradation simulation is applied to every degradation path, then the degradation characteristic at t L is evaluated from the model. If the resulting characteristic attains the threshold, then the element is having failed. Generally, r is binomial. The reliability verification is the same as testing the concepts
358
8 Reliability Tests
H0 : p ≤ 1 − R L ,
H1 : p > 1 − R L .
When the sample size is comparatively large and p is not highly close to 0 or 1, the test statistic r − n(1 − R L ) Z0 = √ n R L (1 − R L )
(8.66)
can be simplified with the standard normal distribution. The decision specific is that we accept H 0 at a 100C% confidence level if Z 0 ≤ zC , where zC is the 100C-th percentile of the standard normal distribution. Example 8.9 [5] In an electrical welding operation, the failure of an electrode to have happened when the diameter of a weld place is less than 4 mm. The diameter decreases with the number of places welded by the electrode. To show that a new designed electrode meets the lower 95% confidence limit reliability of 92.5% at 50 000 places, 75 electrodes were selected, and every was tested until 35 000 places were welded. Degradation analysis demonstrated that five electrodes would fail if the tests were proceeded until 50 000 spots. Find if the electrode meets the R92.5/ C95 requirement. Solution From Eq. (9.44) we have Z0 = √
5 − 75 × (1 − 0.925) = −0.274. 75 × 0.925 × (1 − 0.925)
Because Z 0 < z0.95 = 1.645, it is possible to conclude that electrode meets the defined reliability requirement.
8.8 Questions 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
What is the difference between design verification and process verification? What are the targets of reliability and life testing? Explain the contents of consideration process. Explain the specifics of life testing and reliability. What reliability tests do you know? Explain the specifics of burn-in test. Explain the specifics of life test. Explain the specifics of event test. Explain the specifics of accelerated test. Explain the specifics of environmental test. Explain the specifics of development test.
References
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
359
Explain the specifics of qualification test. Explain the specifics of demonstration test. Explain the specifics of quality confidence test. List the standard stages of test planning. List data types that should be presented in a test plan. What are the differences between zero-failure and series life tests? What are the approaches for selection of test samples number? What are the approaches for selection of test stresses level? What are the approaches for selection of test time? Explain some specifics of the zero-failure testing. How do you understand the sample size reduction by tail testing? Explain some specifics of the series life testing. What is the reliability verification using prior information? How do you understand the reliability of verification through degradation testing?
References 1. Haasl D F. Advanced concepts in fault tree analysis [C]. System Safety Symposium, University of Washington Library, Seattle. 1965. 2. Campbell J. Reliability Handbook [M]. Burlington: Clifford/Elliot Publication, 1999. 3. Aleksandrovskaya L, Kruglov V, Kuznetsov A, et. al. Theoretical Foundations of Testing and Experimental Processing of Complex Technical Systems [M]. Moscow: Logos, 2003. 4. US DOD. MIL-HDBK-781 Reliability test methods, plans, and environments for engineering development, qualification, and production [S]. Washington DC: US Department of Defense, 1996. 5. Guangbin Y. Life Cycle Reliability Engineering [M]. Hoboken: John Wiley & Sons, Inc. 2007. 6. Porter A. 2004. Accelerated Testing and Validation [M]. New York: Elsevier. 7. Reshetov D. Mechanical engineering, Encyclopedia, Machine Parts, Structural Strength [M]. Moscow: Mashinostroenie. 1995.
Chapter 9
Risk Analysis of Aircraft Structure and Systems
9.1 Introduction In this chapter, the probabilistic risk analysis of the aeroengine life-limited parts (ELLPs) and aircraft bleed air system was analyzed. For the risk assessment of ELLPs, the airworthiness regulations of ELLPs are studied, the airworthiness requirements of key parts of aeroengine are sorted out, and the elements of engineering plan, manufacturing plan, and management plan of engine life limited parts are analyzed. The determination method and process of ELLPs based on FMEA are put forward. Taking the CFM56-5B turbofan engine with a large bypass ratio as an example, the determination method and process of ELLPs based on FMEA are studied. A probabilistic damage tolerance life prediction and risk assessment system based on residual strength, crack growth, and damage detection are established. Based on the analysis of the traditional damage tolerance theory, the important parameters affecting the residual strength, crack growth, and damage detection are determined, the randomness of the important parameters is studied, the probability density and distribution function are determined, and the life prediction of engine components is carried out on this basis. For the risk warning of the aircraft bleeding air system (BAS), a complete datadriven BAS risk warning methodology based on multivariate state estimation technique (MSET). The improved memory matrix construction method based on vector similarity can not only cover the normal working space but also reduce the estimation error of the MSET model. The similarity function between the observed value and the MSET estimated value describes the abnormality of the system or equipment. Dynamic adaptive risk warning thresholds are set for the similarity sequence, which are continually updated based on changes in the data. A complete risk warning application example of the BAS using QAR data is introduced.
© Science Press 2024 Y. Sun et al., Reliability Engineering, https://doi.org/10.1007/978-981-99-5978-5_9
361
362
9 Risk Analysis of Aircraft Structure and Systems
9.2 Risk Assessment of Aeroengine Life-Limited Parts The aeroengines are power machines that work in complex environments such as high-speed, high-temperature, strong-coupled combustion gases, the structure, and working conditions are extremely complex and need to run in very adverse environments, not only to meet the performance, operational applicability, environments, and other aspects of many special requirements. And as the aircraft power machines, it directly affects the performance of the aircraft. With the development of aeroengine, its performance has been improved, but the safety has always been the focus of aeroengine manufacturers. Reviewing the history, whether it is military aircraft or civil aircraft, hazardous accidents of aircraft caused by safety problems of aeroengines are common, and the safety accidents of aeroengines are largely caused by structural failures of ELLPs, such as the crash of Sioux in 1989 and crash of Pensacola in 1996 [1, 2], the crashes are illustrated in Fig. 9.1. As can be seen, the structural safety of ELLPs plays an important role in the safety of aeroengines. According to the above reasons, countries all over the world attach great importance to the research on airworthiness technology of ELLPs, and issue timely advisory circulars on airworthiness of ELLPs, working procedures and other supporting technical data. For instance, the Federal Aviation Administration (FAA) issued airworthiness “CFR 14 Part 33: airworthiness standards: aircraft engines”, which requires the applicant will establish the integrity of each engine life-limited part by three plans. The European Aviation Safety Agency (EASA) issued airworthiness standard-CSE, which formulated and carried out a life management plan aimed at the ELLPs in AMC E515 [3]. Civil Aviation Administration of China (CAAC) issued “aircraft engine airworthinessregulations-CCAR-33-R2”, which put forward requirements to
Fig. 9.1 Flight accidents caused by ELLPs failure: a The crash of Sioux; b The crash of Pensacola
9.2 Risk Assessment of Aeroengine Life-Limited Parts
363
design service life and validation toward the ELLPs, consumable parts, bearings, and attachment [4].
9.2.1 ELLPS Airworthiness Regulations and AC Analysis 1. Evolution of airworthiness regulations for ELLPs According to FAR-33, the ELLPs refer to the engine rotating and major static structural parts whose primary failure may lead to hazardous engine effects. Typical ELLPs include, but are not limited to, disks, spacers, hubs, shafts, high-pressure casings, and non-redundant mount components [5]. Primary failure means that failure of part is not the result of prior failure of another part or system. A hazardous engine effect is any of the conditions listed in Advisory Circular (AC) 33.75 [6]. In view of the fact that engine parts, especially those critical rotor parts, may cause fatal crashes after fatigue fracture, it is necessary to introduce relevant regulations to improve the safety level of engines for the sake of public safety and the healthy development of the entire commercial transportation industry. In 1971, FAA issued a legislative Notice numbered “Notice No. 71-12”, which intended to add the requirement of “33.14 start-stop cyclic stress (low-cycle fatigue)” in Section 33 of the code of federal aviation to reduce non-containment incidents caused by disk failures. In 1974, the sixth amendment was passed, which formally incorporated “33.14 start-stop cyclic stress (low-cycle fatigue)” into FAR-33. Section 33.14 was revised with the 33-10 amendment of FAR-33, which was issued in 1984. The 33-10 Amendment proposes to modify the current 33.14, which extends the current requirement that only rotor structural parts are involved to all parts whose failure may lead to the hazardous of the aircraft. The term “start-stop stress cycle” was redefined. It also provides that, in addition to the methods specified in the regulations, the applicant may determine the limits of use and the amount of extension by other methods acceptable to the bureau. In 2007, the FAA issued Amendment No. 33-22which officially replaced “FAR 33.70 engine life-Limited parts” with “FAR33.14 start-stop cyclic stress”, and added FAR 33.70. Three plans were put forward for the ELLPs: engineering plan, manufacturing plan, and service management plan. By formulating and executing engineering plan, manufacturing plan, and service management plan, closed-loop management of ELLPs can be achieved, as shown in Fig. 9.2. These three plans constitute a closed-loop system, linking the assumptions made in the engineering plan with how to manufacture parts and how to repair parts in use. The three plans for engineering, manufacturing, and service management must function as a complete system. In 2007, Section 33-22 of the FAR-33 amendments repealed Section 33.14 and replaced it with a new Section 33.70, which required the model certification applicant to determine the integrity of each life-limiting component through the engineering plan, manufacturing plan, and service management plan. At the same time, corresponding to clause 33.14, AC 33.14-1 also along with, replaced by the corresponding
364
9 Risk Analysis of Aircraft Structure and Systems
Fig. 9.2 ELLPs closed-loop management system
clause 33.70 AC plan, including AC 33.70-1 (Guidance Material for Aircraft Engine Life-Limited Parts Requirements) and AC 33.70-2 (Damage Tolerance of Hole Features in High-energy Turbine Engine Rotors). In addition, there are plans to release after AC 33.70-3, for the use of engineering and technical personnel. Table 9.1 summarizes the amendments to ELLPs in FAR-33. EASA also paid great attention to engine critical parts (engine life-limited parts) and put forward airworthiness requirements for critical parts in the first version of CS-E in 2003. Due to the small gap between EASA and FAA in aviation industry technology, the two organizations constantly compromise and recognize each other in the differences in airworthiness certification of relevant aircraft, and gradually converge after continuous revision of FAR-33 and CS-E. Therefore, EASA kept basic consistent with FAA about the airworthiness requirements of the ELLPs, which can be regarded as CS-E and FAR-33 having equal influence. CS-E includes two parts: airworthiness specifications and acceptable compliance methods to meet the requirements of airworthiness specifications. CS-E is characterized by more specific and strong operability and provides specific methods to meet the requirements of this specification. CS-E 515 also put forward three plans for airworthiness requirements of critical parts: engineering plan, manufacturing plan, and service management plan, which constitute a closed-loop system. These plans may propose limitations, which Table 9.1 Amendments relating to ELLPs in FAR-33 Amendment number
Amendment contents
Quantity of amendments
6th amendment
Instructions; engine ratings and operating; limitations; start-stop cyclic stress (low-cycle fatigue); fire prevention…
37
10th amendment
Engine ratings and operating limitations; start-stop cyclic stress (low-cycle fatigue); materials; fire prevention; durability…
26
22th amendment
Turbocharger; engine life-limited parts
3
9.2 Risk Assessment of Aeroengine Life-Limited Parts
365
are published in the airworthiness limitations chapter of the continuing airworthiness statement. CS-E 515 AMC provides a way to establish these plans. At the same time, the identification method of engine critical parts is proposed, and the engineering plan specifies the life assessment method and technology. The concept of approved life and the process of determining the approved life of rotating parts are put forward, including method and material data, life determination method, development and verification test, service life, and product assurance requirements. The main links of the analysis of the approved life determination method are put forward, including working conditions, thermal analysis, stress analysis, life analysis, and damage tolerance evaluation. In addition, the method to determine the approved life of the stator parts under pressure load is described, and the method to maintain the approved life is proposed. EASA and FAA specify the requirements and methods of critical parts’ life determination and life extension, processing, manufacturing, and use management in detail, which is the summary of critical parts’ life determination methods of engines for more than half a century, and the basis for establishing critical parts’ design, test, processing, manufacturing, and use management system. Rolls-Royce, Pratt &Whitney, and General Electric Company have followed the airworthiness requirements and accumulated the record about the experience of safe service for more than half a century [5, 7, 8]. Chinese technology level and manufacturing process have not yet met the requirement of airworthiness regulations. Airworthiness regulations and technology research of aircraft engine need to be studied comprehensively. Based on the reference to FAR-33, the CAAC issued “aircraft engine airworthiness regulations”, which put forward requirements to design service life and validation toward the ELLPs, consumable parts, bearings, and attachments. Chinese engine general specification-GJB 241-2010 was derived from the military standard of the U.S. The life management of critical parts of the military engine is mature relatively [9]. 2. Analysis of the difference in ELLPs airworthiness regulations The airworthiness regulations of ELLPs in CAAC, FAA, and EASA are basically the same. First of all, the definition of ELLPs is uniform, ELLPs refer to rotor and major static structural parts whose primary failure is likely to result in a hazardous engine effect. Furthermore, all of them meet the integrity requirements of ELLPs by making engineering plan, manufacturing plan, and service management plan. Last but not least, the detailed guidelines for engineering plan, manufacturing plan and use management plan in EASA and FAA are basically the same. Meanwhile, FAA, EASA, and CAAC have different airworthiness provisions for ELLPs. Firstly, FAA and CAAC both call engine life-limited parts, while the EASA refers to engine life-limited parts as critical engine parts. Secondly, there are detailed guidelines in the airworthiness clauses of ELLPs in EASA and FAA, which separately analyze the engineering, manufacturing, and service management plans, while there are no detailed guidelines in the airworthiness clauses of CAAC. Lastly, some detailed terms in the detailed guidelines for EASA and FAA engineering plan, manufacturing plan, and service management plan are different. For example, the specific content
366
9 Risk Analysis of Aircraft Structure and Systems
of each chapter is arranged differently, the content of the elements of the engineering plan is different, and the typical process of determining the approved life of the engine rotor is different. 3. Determining Process of ELLPs Definition of ELLPs in accordance with aeroengine airworthiness regulations, the ELLPs are those engine rotating and major static structural parts whose primary failure may lead to hazardous engine effect. To determine whether a part vis a lifelimited part, it is mainly based on the definition and meeting the integrity requirements as a criterion. The determination of the ELLPs must meet the following two conditions: 1) The ELLPs are those engines rotating and major static structural parts whose primary failure is likely to result in a hazardous engine effect. A hazardous engine effect is any of the conditions listed in Advisory Circular (AC) 33.75 [10]: (1) Non-containment of high-energy debris. (2) Concentration of toxic products in the engine bleed air intended for the cabin sufficient to incapacitate crew or passengers. (3) Significant thrust in the opposite direction to that commanded by the pilot. (4) Uncontrolled fire. (5) Failure of the engine mount system leading to inadvertent engine separation. (6) Release of the propeller by the engine, if applicable. (7) Complete inability to shut the engine down. 2) Determine the integrity of structural parts in strict accordance with the engineering plan, manufacturing plan, and service management plan. In order to meet defined integrity requirements, the life management activities included in the engineering, manufacturing, and service management plan must be implemented. (1) Engineering plan. A plan that contains the steps required to ensure each engine’s life-limited part is withdrawn from service at an approved life before hazardous engine effects can occur. These steps include validated analysis, test, or service experience which ensures that the combination of loads, material properties, environmental influences, and operating conditions, including the effects of other engine parts influencing these parameters, are sufficiently well known and predictable so that the operating limitations can be established and maintained for each engine life-limited part. Applicants must perform appropriate damage tolerance assessments to address the potential for failure from material, manufacturing, and serviceinduced anomalies within the approved life of the part. Applicants must publish a list of the life-limited engine parts and the approved life for each part in the Airworthiness Limitations Section of the Instructions for Continued Airworthiness as required by 33.4 of this part.
9.2 Risk Assessment of Aeroengine Life-Limited Parts
367
(2) Manufacturing plan. A plan is that identifies the specific manufacturing constraints necessary to consistently produce each engine life-limited part with the attributes required by the engineering plan. (3) Service Management plan. A plan that defines in-service processes for maintenance and the limitations to repair for each engine life-limited part that will maintain attributes consistent with those required by the engineering plan. These processes and limitations will become part of the Instructions for Continued Airworthiness. Investigating the determination of ELLPs by several aeroengines, the results are illustrated in Table 9.2. AC 33.70-1 points out: if a part is made of various subparts that are finally integrated in an inseparable manner into a unique part, and any one of the sub-parts is identified as an ELLP, then the entire part is treated as an ELLP. In the case of fan rotor, it can be divided into inseparable parts fan disk, turbocharger rotor, fan shaft, and fan straightening blades. In addition, the catalogue and quantity of ELLPs are not invariable. According to the actual use and experience, it is necessary to appropriately increase or decrease the type and quantity of ELLPs, and the ELLPs of different types of aeroengines are also different. For example, there were 26 ELLPs initially identified in the military Spey engine, and then high-pressure turbine disk centering bushings were added based on the usage. Table 9.2 List of ELLPs Engine type
The total of ELLPs
List of ELLPs
CFM56
19
Fan: Fan disk, Turbocharger rotor, Fan shaft Compressor: Front shaft, 1–2-level rotors, 3-level disk, 4–9-level rotors, Rear sealing disk of compressor Turbine: Front shaft, Turbine disk, Sealing disk, Rear shaft, Low-pressure turbine shaft, Low-pressure turbine minor shaft, Conical bearing
CF34-1A
26
Fan: Fan disk, Fan front shaft, Fan drive shaft Compressor: 1-level disk, Front shaft, 2-level disk, Rear drum, Rear shaft CDP seal ring, Rear shaft, 9-level disk Turbine: Balance piston seal ring, HP turbine shaft, 1-level disk, 2-level disk, External torque connector, Internal torque connector, LP turbine shaft, 3-level disk, 3–4-level sealing ring, 4-level disk, 5-level disk, 5–6-level sealing, 6-level disk, 4–5-level sealing rings, Conical bearing
V2500
24
Fan: 1-level fan disk, Minor shaft Compressor: 1–12-level disks, Front shaft, Rear sealing disk of compressor Turbine: 1-level hub, HP 1-level turbine labyrinth seal, HP 2-level turbine labyrinth seal, HP 2-level turbine blade baffle, 3–5-level turbine labyrinth seal, 6-level turbine labyrinth seal, 7-level turbine labyrinth seal, Low-pressure turbine shaft
368
9 Risk Analysis of Aircraft Structure and Systems
9.2.2 Determination Method of ELLPs Based on FMEA 1. Process of determination of ELLPs based on FMEA According to the airworthiness regulations of ELLPs, to determine whether the structural parts are ELLPs, it is necessary to first determine whether the primary failure of the structural parts will cause a hazardous engine effect, which can be determined using the FMEA method. After FMEA analysis is used to determine the structural parts that cause hazardous engine failure, analysis of engineering plan, manufacturing plan, and service management plan are carried out for these parts. Based on the work elements of the above three plans, the parts determined by FMEA are analyzed according to these three plans to determine and maintain a safe life, and finally determine the list of ELLPs. The flow chart for determining ELLPs based on FMEA is illustrated in Fig. 9.3. An FMEA is a structured, inductive, bottom-up analysis that is used to evaluate the effects on the engine system of each possible element component failure. When properly formatted, it will aid in identifying latent failures and the possible causes of each failure mode. The general flow of the FMEA method is illustrated in Fig. 9.4. In system design, the FMEA method can effectively analyze the fault connection between the system and the unit, what are the potential failure modes of the unit, what impact these failure modes have on the function of each level above the unit, and determine whether the consequences of the failure are serious. Through analysis, effective preventive and improvement measures are put forward to ensure that the unit and system have very high reliability and safety in the design process. For example, whether the failure of engine parts will cause the failure of engine parts, or even engine failure, and the failure of different parts, different failures of the same part will cause the type of failure of those parts, and the type of failure of the engine, parts, components, and engine whether the faults at all levels are serious and will cause catastrophic consequences for the engine. After systematic analysis, how to put forward targeted prevention and improvement suggestions. FMEA method is emphasized in engine design stage instead of using the stage to carry out the analysis, is a “bottom-up” research unit and system fault links between repeated analysis of failure mode, classification, evaluation, and improvement measures, as far as possible in the design phase to eliminate all kinds of fault, avoid engine hazardous consequences, in order to improve engine reliability and safety. By FMEA analysis, prevention measures, concrete implementation time and responsibility of the arrangement should be recorded, improvement measures once completed, the obtained results, the need to assess and record the severity, occurrence frequency and tested the level of difficulty, and then calculate the risk priority number limit life of a security risk assessment, risk priority value should be significantly lower than before, which show that after take measures to reduce the risks of failure.
9.2 Risk Assessment of Aeroengine Life-Limited Parts
369
Fig. 9.3 The flow chart to determine ELLPs based on FMEA
The purpose of FMEA analysis is to reduce the overall risk and the possibility of failure mode and to control the risk within an acceptable range. In the process of FMEA analysis, when the severe degree of the failure mode of the structural parts is level 1 or 2 (Table 9.3 illustrates the severe degree), this structural part will cause hazardous consequences for the engine, satisfy the first condition of becoming ELLPs, and then carries on the integrity analysis, if meet the engineering plan, manufacturing plan, and management plan, then the structure is identified as a lifelimited part.
370
9 Risk Analysis of Aircraft Structure and Systems
Fig. 9.4 The flow chart of FMEA
Table 9.3 Severity categories Severe degree Description
Severity category Mishap result criteria
Catastrophic 1
Loss of all functions to the engine system and serious damage to the system
Fatal
2
The engine system part main function loss, the system part damage
Medium
3
Has certain influence on the engine system, the system damage slightly
Mild
4
Has certain influence on the engine system, the system damage slightly
2. Case analysis of ELLPs determination method based on FMEA Taking the large bypass ratio turbine engine CFM56-5B as an example, the research on the determination method and process of ELLPs based on FMEA was carried out. After determining the structural parts that may cause hazardous engine effect through the FMEA method, carry out analysis work of engineering plans, manufacturing plans, and service management plans for these structural parts to determine whether the structural parts are in accordance with these three plans to determine and maintain safety service life, meet the requirements of completeness, remove the structural parts that do not meet the requirements of the three plan elements, and finally determine the engine Define the ELLPs list of CFM56-5B. FMEA is a method for reliability design and analysis of aeroengines, first of all, find out various levels of all possible failure modes, and then determine the fault of the potential impact of the upper and lower and the level of analysis of failure
9.2 Risk Assessment of Aeroengine Life-Limited Parts
371
probability and the severity of the fault impact, then find out the weak link in the process of engine and parts design, effective prevention and improvement measures are put forward. This process needs to be repeated to eliminate all possible failures and avoid hazardous engine effect. 3. FMEA analysis of engine structural parts CFM56 engine system mainly consists of six parts: turbofan engine, engine fuel and control system, engine control system, engine indication system, and engine lubricating oil and engine starting system. The initial agreement level specified in the FMEA analysis is the engine, and the lowest agreement level is the engine parts. The functional block diagram and reliability block diagram of the engine system are illustrated in Figs. 9.4 and 9.5, respectively, and Fig. 9.6 is the functional level diagram determined according to the engine specification. Engine failures mainly include damage, fatigue, looseness, and maladjustment, the failure modes of ELLPs are shown in Table 9.4.
Fig. 9.5 Engine function block diagram
Fig. 9.6 ELLPs reliability block diagram
372
9 Risk Analysis of Aircraft Structure and Systems
Table 9.4 Failure mode of ELLPs Failure type
Failure model
Damage
Fracture, fragmentation, scratch, indentation, bonding, pitting, ablation, corrosion, bending deformation, twisting deformation, elongation deformation, compression deformation, creep
Fatigue
Degradation, peeling, wear, crack, stall, instability, vibration, abnormal sound, overheating
Looseness
Loose, falling off
Maladjustment Clearance out of tolerance, speed out of limit, interference, jamming
Through visual or diagnostic instruments, the staff found failures, including visual inspection, X-ray non-destructive testing, endoscopy detection, and metal chip detection. Each failure mode will have different degrees of impact on each level of the engine, and the severity level of the failure will be determined according to the severity of the consequences, as shown in Table 9.4. Carefully analyze the causes of failure at all levels of the engine system, and formulate various measures to mitigate or eliminate the failure at all stages of engine design, manufacturing and service management. Carefully analyze the causes of failures at all levels of the engine system, and formulate various measures at various stages of engine design, manufacturing, and service management to alleviate or eliminate the failures. The corresponding content according to the standard FMEA formare filled in and the FMEA analysis on the attachments of each part of the engine is performed. After FMEA analysis, the CFM56-5B components and structural parts that determined the catastrophic consequences of the engine failure are the fan rotor, compressor rotor, compressor stator, high-pressure turbine rotor, low-pressure turbine rotor, and turbine stator (Fig. 9.7). Based on the analysis results of FMEA, the analysis of engineering plan, manufacturing plan, and service management plan are carried out for the above structural
Fig. 9.7 Functional hierarchy of ELLPs
9.2 Risk Assessment of Aeroengine Life-Limited Parts
373
Table 9.5 The list of ELLPs parts Component serial number
Component name
70-30-01
Fan rotor
70-30-03
Compressor rotor
Part serial number
List of ELLPs
70-30-01-01
Fan disk
70-30-01-02
Turbocharger rotor
70-30-01-03
Fan shaft
70-30-03-01
Front shaft
70-30-03-02
1–2-level rotors
70-30-03-03
3-level disk
70-30-03-04
4–9-level rotors
70-30-03-05
Rear sealing disk of compressor
72-30-04
Compressor stator
70-30-04-01
Front casing of compressor
72-50-01
High-pressure turbine rotor
72-50-01-01
Front shaft
72-50-01-02
Turbine disk
72-50-01-03
Sealing disk
72-50-03
72-50-04
Low-pressure turbine rotor
Turbine stator
72-50-01-04
Rear shaft
72-50-03-01
1–4-level turbine disks
72-50-03-02
Low-pressure turbine shaft
72-50-03-03
Low-pressure turbine minor shaft
72-50-03-04
Conical bearing
72-50-04-01
Low-pressure turbine casing
parts, judge whether the above structural parts meet the integrity requirements stipulated in the three plan elements and can maintain the safe service life, the structural parts that do not meet the requirements of the three plan elements are eliminated, and the ELLPs list of CFM56-5B engine is finally determined. The ELLPs list is illustrated in Table 9.5.
9.2.3 Probabilistic Risk Assessment of ELLPs Compared with electronic equipment, ELLPs have higher cost of failure tests and fewer data samples. Therefore, this chapter is mainly based on the analysis of the fracture mechanics and damage tolerance of key parts of ELLPs. The actual crack propagation length of ELLPs is “stress”, which is allowed. The crack propagation
374
9 Risk Analysis of Aircraft Structure and Systems
length is “strength”, and an ELLP probabilistic risk assessment model based on damage tolerance is established. There are three main elements of the damage tolerance characteristic of the constituent structure, which are the main contents of the damage tolerance analysis: (1) The residual strength (critical crack size) includes two aspects of work: the maximum allowable damage to the structure under the remaining strength load requirements, and the residual strength that the structure can withstand under the specified damage size. (2) Under the effect of the load spectrum and environmental spectrum of the structure, the crack length is from the detectable crack size (initial crack size) to the critical crack size. This work is completed by crack propagation analysis. The main factors affecting the crack propagation life are load spectrum and material resistance structure type. The resistance of material crack propagation is mainly determined by the fatigue rate parameter, they are a function of material type, material thickness, and environment, and the effect of structural type on crack is determined by the coefficient of tip stress strength factor. The crack growth model connects the above three major factors to provide a basis for analysis. (3) Damage inspection uses specified detection and maintenance means to detect and assess the damage, so as to ensure timely detection, prevention, or repair of the damage caused by fatigue, environmental or accidental damage, so as to maintain the airworthiness of the aircraft within the design service target period, which is the duty of inspection program. The damage inspection should solve the problems of inspection position, inspection location, inspection method, and inspection interval, and these aspects should be considered and implemented in the structural maintenance program. The above three elements of damage tolerance characteristics are equally important. The three elements can act individually or in combination to make the safety of the structure reach a specified level, as shown in Fig. 9.8. 1. Damage tolerance design Damage tolerance design is the foundation of the damage tolerance theory system. To ensure the reliability and safety of the aeroengine structure and to meet the requirements of airworthiness regulation, it is the basic criterion to avoid the hazardous engine effect of ELLPs. The traditional damage tolerance design procedure is illustrated in Fig. 9.9. Corresponding to the damage tolerance design is the safe life design, the difference between the two is obvious. The safety life design focuses on the crack formation life and validates the full-scale fatigue test. Damage tolerance design recognizes that the structure has initial defects, but must meet the specified load-bearing capacity, establish a test cycle for the detectable structure, and give crack growth and residual strength limits for the undetectable structure. 2. Damage tolerance analysis Damage tolerance analysis mainly uses fracture mechanics methods for crack growth and residual strength analysis and test verification, to accurately and quantitatively
9.2 Risk Assessment of Aeroengine Life-Limited Parts
375
Fig. 9.8 Main research contents of probabilistic damage tolerance
Fig. 9.9 Risk assessment process of engine life-limited critical parts based on traditional damage tolerance
376
9 Risk Analysis of Aircraft Structure and Systems
Fig. 9.10 Traditional damage tolerance design procedure
evaluate the residual strength and crack growth life of the structure, and to formulate a structural safety inspection cycle to ensure that the crack will not expand to within the damage detection interval. The degree of structural damage. The traditional damage tolerance analysis process is shown in Fig. 9.10. (1) Residual strength The residual strength indicates the static bearing capacity of a structure containing cracks. The residual strength has a great influence on the safety of aeroengine components. In order to avoid the brittle failure of cracked structures, it is necessary to ensure that the allowable value of the residual strength of the structure under the critical crack size is greater than or equal to its required value [σ ]rs ≥ [σ ]rep .
(9.1)
(2) Crack growth life The crack growth life will first be affected by the properties of the material, and the growth life will vary from material to material. The main influencing factors are structural geometry, crack propagation resistance, load spectrum, etc. The crack propagation life can be calculated by the main influencing parameters: T = f (S, M, G),
(9.2)
in which G denotes the geometric configuration parameter, M denotes the crack propagation resistance parameter, and G denotes the load spectrum parameter.
9.2 Risk Assessment of Aeroengine Life-Limited Parts
377
After the material of the engine parts is determined, the stress ratio R = K min /K max and stress intensity factor ∆K are the main factors affecting the crack growth rate, and the crack growth rate is as follows: < da dN = f (∆K , R).
(9.3)
3. Assessment of damage tolerance Damage tolerance evaluation is an important means to check the design quality and improve the design content. The results of ELLPs’ damage tolerance analysis and test are evaluated by using experience and comprehensive methods. When the traditional damage tolerance is analyzed by the fixed value method, the obtained crack growth life of engine parts has the meaning of median value. Its reliability is guaranteed by dividing the definite crack growth life by the specified dispersion factor, without considering the uncertainty and dispersion of each influencing factor, so the traditional damage tolerance analysis method has certain limitations. The probabilistic damage tolerance analysis method, which combines the probability of random variables with the traditional damage tolerance analysis method, can more accurately and reasonably analyze the crack growth life, structural risk, and residual strength reliability under a certain level of risk. The flow chart of probabilistic damage tolerance analysis is shown in Fig. 9.11. The general method is not easy to calculate the exact solution of probability. Monte Carlo computer simulation sampling method, response surface method, and probability space integration transformation method can be used to deal with the problem of probability solution of random variables. Probabilistic damage tolerance has gradually become a component of solid mechanics and a development direction of engineering technology, but it has not formed a complete theoretical system. Therefore, the research on the theoretical system and key technologies of probabilistic damage tolerance has not only academic value but also practical value in engineering.
Fig. 9.11 Probability damage tolerance analysis process
378
9 Risk Analysis of Aircraft Structure and Systems
4. Probability density and distribution of important parameters based on residual strength The static bearing capacity of cracked parts of aeroengine determines the residual strength. Reasonable determination of the residual strength load of parts can ensure the safety and reliability of aeroengine in service. There are many parameters that affect the residual strength and crack growth of components, and they have certain randomness. In order to reduce the complexity of residual strength analysis, the parameters with a small randomness impact on residual strength and relatively concentrated distribution can be approximated to the determined value, while the parameters with a large randomness impact on residual strength and large dispersion can be used as variables to determine the fracture through probabilistic damage tolerance analysis probability density and distribution of important parameters, such as crack toughness, extreme stress, allowable residual strength, initial crack size, and critical crack size. According to the residual strength theory, the fracture failure of a cracked component at any time can be expressed by the following equation: K (t) > K C ,
(9.4)
in which K(c) denotes the fracture toughness of the material and K(t) denotes the maximum value of stress intensity factor, as shown in the following equation: √ K (t) > K C K (t) = σmax β(a(t)) πa(t),
(9.5)
in which a(t) denotes the crack size at time t and σ max denotes the maximum stress that the structure may bear during its life. Substituting Eq. (9.5) into Eq. (9.4) can obtain the conditions of fracture failure of cracked structure, √ (9.6) σmax β(a(t)) πa(t) > K C . 5. Life prediction and probability risk assessment based on crack propagation Crack growth analysis is one of the main contents of probabilistic damage tolerance analysis. The main purpose of crack growth analysis is to find the law of crack growth rate and crack size, determine the crack growth life, evaluate the safety risk of cracks, and ensure the reliability and safety of engine parts in case of cracks. In the probabilistic damage tolerance analysis, it is necessary to analyze the randomness of crack growth according to the theory of probabilistic fracture mechanics and determine the probability density function of crack growth life of engine components under a certain crack size, or the failure probability of engine components under a certain growth life:
9.2 Risk Assessment of Aeroengine Life-Limited Parts
379
(1) Probability density function of crack propagation In the probabilistic damage tolerance analysis of engine parts, a probabilistic method is used to reflect the random process of crack growth. The crack growth rate of engine parts is given by the following equation: da = q(a)X (t), dt
(9.7)
in which X (t) denotes the random process of crack growth. Combined with engineering practice, the above equation can be changed to da(t) = Z · q(a), d(t)
(9.8)
in which Z is a random variable. Experiments are carried out by simulating the actual working load spectrum and working environment. The random variables Z generally follow a log-normal distribution, and unbiased estimates of their mean and variance can be obtained. ⁀
μz =
n n )2 1 ∑( 1∑ ⁀ ln Z i − μz . ln Z i , sz2 = n i=1 n − 1 i=1
(9.9)
The approximate probability density function of crack growth Z can be expressed as ⎡
)2 ⎤ ( ⁀ ln Z − μ 1 z ⎦· 1. f Z (Z ) = √ exp⎣− 2 sz Z 2πsz 1
(9.10)
(2) Life prediction based on crack propagation After obtaining the probability density function of crack growth Z , the probability density function of crack growth life is derived through derivation: I I ∫a I dz I dv 1 . f t(a) (t) = f Z (z)II II = f Z (z) · 2 dt z=z(u) t q(v)
(9.11)
a0
The integral of Eq. (9.8) can be obtained by 1 Z= t
∫a a0
dv . q(v)
(9.12)
380
9 Risk Analysis of Aircraft Structure and Systems
Substituting Eqs. (9.10) and (9.12) into Eq. (9.11) gives the probability density function of the crack propagation life of the engine parts under the specified crack size. ⎧ [ ( ) ]2 ⎫ ⎪ ⎪ ∫a dv ⎪ ⎪ 1 ⎪ ⎪ ⎪ ln t q(v) − μz ⎪ ⎪ ⎪ ⎨ ⎬ 1 a0 1 · . (9.13) f t(a) (t) = √ exp − 2 ⎪ ⎪ 2sz 2πsz ⎪ ⎪ t ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ (3) Risk assessment of crack propagation based on Monte Carlo sampling It is an important part of the probability damage tolerance safety analysis whether the cracks have reached a certain life, whether there is a risk of failure, and whether they can meet the residual strength requirements of aeroengine components. The Monte Carlo sampling method is used to determine the failure probability of aeroengine components and evaluate their failure risk to ensure the reliability and safety of the engine. The Monte Carlo sampling method is based on the theory of probability and statistics to calculate the numerical solution of random events through random sampling and to evaluate the failure risk of engine components. 6. Determine the important parameters and distribution of sampling analysis The important parameters determined during probabilistic damage tolerance analysis are also random variables of sampling, and their distribution rules are given in Table 9.6. The analysis of important parameters shows that they are independent random variables. When sampling, these four sets of random number sequences (i σmax ), (i K C ), (i Z ), and (i a0 ) can form a random sequence {i a0 ,i Z ,i K C ,i σmax }(i = 1, 2, . . . , n). 7. Damage detection analysis based on common detection methods The current airworthiness regulations clearly stipulate that aeroengine components must be designed and evaluated according to the damage tolerance principle to ensure that during the service life if serious damage occurs, that is, the remaining engine parts before the strength drops to the prescribed damage, before the safety load, Table 9.6 Important parameters and their distribution functions Important parameters
Parameter distribution characteristic
Stress extreme value σmax
Obey extreme value distribution function
Material fracture toughness K C
Obey normal distribution function
Crack growth rate parameter Z
Obey log-normal distribution function
Initial crack size a0
Obey log-normal distribution function
9.2 Risk Assessment of Aeroengine Life-Limited Parts
381
these damages in the engine must be detected in a timely manner with a relatively high probability. Therefore, to choose the damage detection method rationally and determine the detection period are very critical, the following steps are followed: (1) Damage detection method Commonly used are three types of general, surveillance, and detailed visual inspection methods. If the visual inspection fails to meet the requirements, the nondestructive inspection (NDI) method can be used. The common non-destructive inspection methods are given in Table 9.7. (2) Crack detection probability distribution for damage detection When the aeroengine parts reach the inspection time, they need to carry out nondestructive testing. If the crack size is found to exceed the specified value, repair or replacement is required to improve the safety of engine parts. The probability distribution function of crack detection by visual inspection is given by the following equation: ) ] [ ( a − a0 α , POD = 1 − exp − λ − a0
(9.14)
in which a denotes the length of detectable crack, a0 denotes the minimum length of detectable crack, λ denotes the characteristic length, and α denotes the shape factor. The values of a0 , λ, and α are related to the detection method, and the parameter values of a0 , λ, and α corresponding to the three types of visual inspection methods are shown in Table 9.8. Under high-frequency eddy current, ultrasonic, and magnetic particle nondestructive testing technology conditions, the probability distribution function of engine parts surface crack detection is given by the following equation: Table 9.7 Commonly used NDI methods and applications Detection method
The type of material
Crack form
X-ray
Metal, non-metal
Surface, subsurface, interior
Ultrasonic
Metal, some non-metallic
Surface, subsurface
High-frequency eddy current
Metal, magnetic, or non-magnetic
Surface (steel, aluminum, titanium), near surface (0.125 mm) (aluminum, titanium)
Low-frequency eddy current
Metal, non-magnetic, or low-conductivity material
Subsurface (to 9 mm)
Magnetic powder
Steel, magnetic stainless steel Surface, near surface
Dye penetrates
Metal
Surface
382
9 Risk Analysis of Aircraft Structure and Systems
Table 9.8 Parameter value in probability distribution of crack detection
Detection technology
α
a0 /mm
λ/mm
General visual inspection
1.82
7.51
301
Visual inspection of monitoring
1.82
5.03
76
Detailed visual inspection
1.82
3.76
51
⎡ ( < )0.44 ⎤ a aNDI − 0.5 ⎦. POD = 1 − exp⎣− 0.5
(9.15)
9.2.4 Case Analysis The stress–strain field was calculated after the stress at the notch root was obtained at all rotational speeds. The stress field is( taken as an example: to calculate the stress ∫ ( ) ⇀) 1 field, the integral σ F I = V f σi j ϕ r dv needs to be solved. There are two Ω
methods to determine the stress σi j at each location in the failure zone: (1) It is obtained by interpolating several points around it. (2) The strain field function of the whole failure zone is fitted. Considering that the calculated results should be mesh-independent and have high precision, the second method is selected to fit the strain field function of the entire failure zone. Polynomial is used for fitting, and the fitting function is as follows: W (x, y, z) = A1 + A2 z+A3 z 2 + A4 z 3 + y( A5 + A6 z + A7 z 2 ) + y 2 (A8 + A9 z) + A10 y 2 + x(A11 + A12 z + A13 z 2 ) + x y(A14 + A15 z) + A16 x y 2 + x 2 (A17 + A18 z) + A19 x 2 + A20 x 3 . The maximum inheritance method was used to fit the stress field of wheel fatigue danger points. According to the above stress results, node 417,172 is determined to be the danger point of fatigue, as shown in Fig. 9.12. After fitting, the values of A1 –A20 are shown in Table 9.9. After the stress field is obtained by fitting, MATLAB programming is used to calculate the strength of the stress field. It has been calculated that the first-principle stress and stress field strength at the danger point under the rotating speed are shown in Table 9.10. 1. Life calculation based on a strong stress field According to the life analysis method, it can be seen from Table 9.10 that the stress ratio of each stage load is approximately equal to 0. After fitting, the S–N curve can be represented by the following equation:
9.2 Risk Assessment of Aeroengine Life-Limited Parts
383
Fig. 9.12 Fatigue danger point
Table 9.9 Parameter values of the fitting function
A1
1.16 × 104
A11
1.66 × 102
A2
–1.80 × 102
A12
–2.21
A3
8.72 ×
A13
–7.23 × 10–3
A4
–3.06 ×
10–3
A14
–9.48
A5
–3.06 × 10–3
A15
–5.44 × 10–2
A6
–5.50
A16
–9.70 × 10–2
A7
4.59 ×
A17
5.65 × 10–2
A8
1.39 × 102
A18
–3.11
A9
–8.42 ×
10–1
A19
–8.69 × 10–1
A10
–4.10 ×
10–1
A20
1.77 × 10–2
10–1
10–2
Table 9.10 Stress field of fatigue danger points is strong Stage
Rotate speed (r/min–1 )
1
500
2
Dangerous point (MPa)
Stress intensity (MPa)
1.11
1.01
9550
805.32
791.64
9850
833.64
819.03
11,000
927.92
927.33
500 8550
5.36
5.08
946.41
944.55
lg N = 15.7976 − 4.736 lg(Smax − 703.84).
(9.16)
The maximum stress field strength of each stage load is substituted into the above equation to obtain the life under each stage load, as shown in Table 9.11.
384
9 Risk Analysis of Aircraft Structure and Systems
Table 9.11 Stress field of fatigue danger points is strong Stage
Stress intensity σ F I σ F I max /MPa
Nf
n
σ F I min /MPa
1
791.64
1.01
3,919,313
55,300
2
819.03
1.01
1,083,277
35,000
3
927.33
1.01
46,937
6338
4
935.13
39,898
32,658
50.8
In order to calculate the number of cycles of the fourth stage fatigue load, according to the Palmgren–Miner theory, the total damage during structural failure is D=
n1 n2 n3 n4 + + + = 1. Nf1 Nf2 Nf3 Nf4
(9.17)
The number of cycles that can be theoretically supported under the fourth stage load is )] [ ( n1 n2 n3 N f 4 = 32658. n4 = 1 − + + (9.18) Nf1 Nf2 Nf3 2. Life analysis of nominal stress method: (1) Determination of stress concentration factor Establish finite element model after removing pin hole on rotor disk. Apply 29 725N force evenly to the center of the corresponding original rotor disk pin hole, as shown in Fig. 9.13. After calculation, the stress at the same position of the maximum stress point is 348.30 MPa, and this stress is selected as the nominal stress. Stress concentration factor. (2) S–N curve correction The S–N curve can be represented by the following equation: Fig. 9.13 Load application for nominal stress calculation
9.2 Risk Assessment of Aeroengine Life-Limited Parts
385
Smax = 7954N −0.343 + 333.9.
(9.19)
(3) Life calculation By putting the maximum nominal stress of each stage into the modified S–N model, the life of each stage under the load can be obtained, as shown in Table 9.12. In order to calculate the number of fourth-stage fatigue load cycles, according to PalmGren–Miner’s theory, the total damage during structural failure is D=
n1 n2 n3 n4 + + + = 1. Nf1 Nf2 Nf3 Nf4
(9.20)
Therefore, the number of theoretically acceptable cycles under the fourth stage load can be calculated as follows: )] [ ( n2 n3 n1 N f 4 = 939964. + + (9.21) n4 = 1 − Nf1 Nf2 Nf3 3. Comparative analysis of life results In order to investigate the accuracy of fatigue life prediction, the following method is adopted: A certain aeroengine low-pressure three-stage disk has run 55 300 cycles, 35 000 cycles, and 6 338 cycles successively under three loads. The fatigue life of the disk under the fourth load is calculated by using the field strength method and the nominal stress method, respectively. The calculation accuracy of the two methods is compared with the test results. It is known from the test that the life of the disk under the fourth fatigue load is about 15 000 weeks. The results of field intensity analysis were 32 658 cycles and the dispersion coefficient was 2.18. The result of the traditional nominal stress method is 939 964 and the dispersion coefficient is 62.66. It is obvious that the life evaluation result of the three-stage low-pressure disk of the aeroengine using the field intensity method is far better than that of the traditional nominal stress method, and its predicted life falls within the dispersion zone 2.18 three times the actual life, which is within the engineering acceptable error range (Table 9.13). Table 9.12 Strong stress field of fatigue danger points Stage
Stress at dangerous point
Nominal stress
σmax /MPa
Smin /MPa
σmin /MPa
Nf
n
Smax /MPa
1
791.64
1.01
342.70
0.44
41,531,305
55,300
2
819.03
1.01
354.56
0.44
34,496,137
35,000
401.44
0.44
1,091,371
6338
404.82
2.20
946,547
939,964
3
927.33
4
935.13
1.01 50.8
386
9 Risk Analysis of Aircraft Structure and Systems
Table 9.13 Comparison of life evaluation results and test results Test item
Test results
Stress field strength method
Nominal stress method
Lifetime/cycle index
15 000
32 658
939 964
Coefficient of dispersion
–
2.18
62.66
9.3 Risk Warning of Aircraft Bleed Air System Civil aircraft engine Bleed Air System (BAS) is a type of air-source system, which provides compressed air with pressure and temperature regulation for the air source user system, including the engine starting system, environmental control system (air conditioning system), thermal anti-icing system, hydraulic oil tank (supercharged), and water tank (supercharged), which play a vital role in the normal flight of the aircraft. The ground failure of the BAS often leads to flight delays or even cancellations; the in-flight failure may affect the pressurization system and the air conditioning temperature control system, which may degrade the equipment’s performance or threaten passengers’ safety. Early risk warning is an effective strategy to solve the above problems, which can help the operator find the abnormality of the BAS early, reduce unscheduled maintenance, and improve the safety of the aircraft. With the expansion of the application of sensors on aircraft and the popularization of mobile terminals, aircraft has accumulated a large amount of sensor monitoring data during operation, which can be classified into FDR (Flight Data Recorder), QAR (Quick Access Recorder), ACARS (Aircraft Communications Addressing and Reporting System) data, etc., according to the different types of recording/ transmission, but in fact, they come from the same sensors. These data can collect a variety of flight information, including environment, load, status, and performance data during the operation of the aircraft, which can not only be used for flight quality monitoring and evaluation but also provide abundant data sources for aircraft risk warning and fault diagnosis [11]. Cohen et al. [12] studied a large amount of QAR and FDR data collected on the aircraft, evaluated the flight quality of the aircraft, identified relevant hazard sources, and proposed APRAM, a risk assessment model for aircraft performance. Ding et al. [13] used artificial intelligence methods such as machine learning to predict the exhaust temperature of aeroengines. Najjar et al. [14] evaluated the severity of scaling in the heat exchanger of the air-conditioning system based on the aircraft flight recorder data. Chen [15] proposed an intelligent diagnostic approach for aircraft complex systems, which can synchronize design knowledge and real-time monitoring messages to make full use of available information. Yang [16] used a multi-stage classification model and particle filtering-based method to predict the remaining service life of the auxiliary power unit (APU). Sun et al. [17] built a model for aircraft air conditioning system health state estimation and fault prediction based on ACARS data. Although research scholars have carried out some studies using flight record data, there are few studies about the BAS. Shang [18] developed a heat exchanger fouling detection method based on the valve control command and a fault detection method for temperature sensors and valve actuators of the engine bleed
9.3 Risk Warning of Aircraft Bleed Air System
387
air temperature regulation system. Abdelrahman [19] used backpropagation neural network algorithms on modeling failure of the most important valve of the B-737 BAS in desert conditions. Peltier [20] conducted an experimental investigation on the performance of different BAS. However, these studies on the BAS have focused on design improvements and troubleshooting of specific components, without using the flight recorder data to carry out risk warning research. Among the existing research related to system or equipment failure risk warning, the methods can be classified as state estimation, parameter estimation, evolutionary algorithms, and neural networks, which can be further divided into model-based and data-based [21]. The multivariate state estimation technique (MSET) belongs to the state estimation method and is one of the most widely used data-based methods for state monitoring. MSET does not require fault samples and can make full use of the health history flight data generated during the operation of the aircraft. Compared with traditional neural networks, it has the advantages of a simple modeling process, clear physical meaning, and real-time computability. Due to its low computational overhead and high accuracy, MSET has been widely used in nuclear power plant equipment, satellite key components, as well as various types of industrial equipment anomaly detection, electronic product life prediction, etc. [22–26]. However, the construction of an effective memory matrix for the MSET model is still an unsolved problem. In the MSET application described above, the memory matrix is composed of historical observation vectors without a specific sampling method or selected by an extreme value sampling method. These selections can only make the memory matrix cover a part of the operating states without guaranteeing the accuracy of the MSET model. An approach to solve this issue is to select health history data similar to the observation vectors to form the memory matrix. In this study, a similaritybased dynamic memory matrix is proposed to improve the prediction accuracy of the MSET model. The main contribution of this paper is to propose a complete data-driven BAS risk warning methodology based on MSET. The improved memory matrix construction method based on vector similarity can not only cover the normal working space but also reduce the estimation error of the MSET model. The similarity function between the observed value and the MSET estimated value describes the abnormality of the system or equipment. Dynamic adaptive risk warning thresholds are set for the similarity sequence, which are continually updated based on changes in the data. A complete risk warning application example of the BAS using QAR data is introduced.
9.3.1 Risk Warning Method In this section, the algorithm principle and overall framework of the risk warning method are introduced. MSET modeling principles, memory matrix construction process, similarity function, warning threshold, and warning judgment criteria are introduced, respectively. The framework of the risk warning methodology is illustrated in Fig. 9.14.
388
9 Risk Analysis of Aircraft Structure and Systems
Fig. 9.14 Framework of the risk warning methodology
The framework is divided into two parts: the offline modeling process and the online warning process. The first step of the offline modeling process is to collect normal operating data according to the working conditions information. Then the training data from the normal operating datais selected. The selected data is multivariate and interconnected, which should cover the entire normal working conditions. Finally, the memory matrix is obtained and a normal MSET model based on the improved memory matrix construction method is established. In the online warning process, the MSET model is utilized to obtain the estimated value of the new operating data, and the similarity index is calculated according to the deviation between the new operating data and the estimated data. Then the dynamic threshold is calculated based on the similarity index. If the similarity index is lower than the limit, an alarm is triggered to alert the operator to take appropriate measures. 1. Modeling principle of MSET MSET is originally a non-linear, non-parametric modeling method developed by Argonne National Laboratory and later developed by SmartSignal as similarity-based modeling (SBM) technology. By selecting important monitoring parameters of the system or equipment and using historical operating data under normal conditions, the relationship between the parameters can be mined. For the new observation state, it can be judged whether the current operating status of the system or equipment is good by calculating the “similarity” between the actual operating data and the historical health data. The health history data obtained from the normal operating state is used to select the appropriate samples to construct the historical memory matrix, which is denoted as D ∈ Rn×m :
9.3 Risk Warning of Aircraft Bleed Air System
389
⎤ x1 (t1 ) x1 (t2 ) . . . x1 (tm ) ⎢ x2 (t1 ) x2 (t2 ) . . . x2 (tm ) ⎥ ⎥ ⎢ ⎥ D = [X (t1 ), X (t2 ), . . . , X (tm )] = ⎢ ⎥ ⎢ .. .. .. ⎦ ⎣ ... ⎡
xn (t1 ) xn (t2 ) . . . xn (tm )
,
(9.22)
n×m
where n is the number of monitoring parameters and m is the number of historical observation vectors. The n monitoring parameters selected should reflect the working state of the BAS and be convenient for modeling. The m observation vectors contained in the historical memory matrix D are not randomly selected and should be filtered for the appropriate number of states based on comprehensive coverage of the entire normal workspace. X (t j ) = [x1 (t j ), x2 (t j ), . . . , xn (t j )]T is the observed value of n parameters at time t j . Each column represents an observation vector at a specific time, and each row represents the monitoring status of a certain parameter at m moments. Suppose that the observation vector of the system or equipment at a certain moment is X obs , and the estimated value is X est , which is obtained from [17] ) ( )−1 ( T X est = D · D T ⊗ D · D ⊗ X obs ,
(9.23)
where ⊗ is the non-linear operator to solve D T · D non-invertibility. There are many choices of non-linear operators, among which the Euclidean distance operator is found to provide the best combination of estimation accuracy and insensitivity to measurement error. Therefore, the Euclidean distance is selected as the non-linear matrix operator ⌜ I n I∑ (xi − yi )2 . ⊗(X, Y ) = √
(9.24)
i=1
In Eq. (9.24), x i represents the i-th point of state X, and yi represents the i-th point of state Y. The non-linear operator reflects the similarity between the observation vector and the memory matrix through the space distance. To ensure the invertibility of Eq. (9.23) for any matrix D, there cannot be any two identical column vectors in D, i.e.X (t j ) /= X (tk ). 2. Operation of non-linear operators For the matrix multiplication operation, each element of the resulting matrix is the inner product of vectors, and the non-linear operator replaces the inner product of vectors. The elaboration of the operation, when the operands are two matrices or matrix/vector pairs, is as follows. Let A be a matrix of m × p and B be a matrix of p × n, and let C = A ⊗ B, then the elements of the i-th row and j-th column of the matrix C can be expressed as
390
9 Risk Analysis of Aircraft Structure and Systems
⌜ I p I∑ (aik − bk j )2 ( A ⊗ B)i j = √ =
/
k=1
(ai1 − b1 j )2 + (ai2 − b2 j )2 + · · · + (ai p − b pj )2 . [
For example, let D =
a11 a21 a31 a12 a22 a32
(9.25)
] , then 2×3
/ / ⎡/ ⎤ 2 2 (a − a )2 + (a − a )2 (a − a )2 + (a − a )2 11 21 12 22 11 31 12 32 ⎥ ⎢ (a11 − a11 ) + (a12 − a12 ) ⎢/ ⎥ / / ⎢ ⎥ . D T ⊗ D = ⎢ (a21 − a11 )2 + (a22 − a12 )2 (a21 − a21 )2 + (a22 − a22 )2 (a21 − a31 )2 + (a22 − a32 )2 ⎥ ⎢ ⎥ ⎣/ ⎦ / / (a31 − a11 )2 + (a32 − a12 )2 (a31 − a21 )2 + (a32 − a22 )2 (a31 − a31 )2 + (a32 − a32 )2 3×3
(9.26)
3. Memory matrix construction process The construction of the historical memory matrix is a key step of MSET modeling. The most ideal situation is to select all the historical data of normal operating states to cover all the normal states of the BAS. However, in practice, the amount of health historical data accumulated by the aircraft’s annual flight is very large, and the larger scale of the historical memory matrix will affect the calculation speed, which cannot meet the real-time requirements of early warning. Therefore, the size of the historical memory matrix should be minimized while ensuring the prediction accuracy of the MSET model. The traditional memory matrix is usually constructed using extreme value sampling, which can cover the maximum and minimum values of the normal operation state but does not guarantee the prediction accuracy of the MSET model and may produce some bad results. According to the modeling principle of MSET, the predicted value of the input observation vector depends on similar historical data. The greater the similarity, the closer the predicted value of the observation vector is to the actual value. Therefore, this paper selects some healthy historical data similar to the observation vector to form a memory matrix. For each input observation vector, several similar observation vectors are selected from the health historical data set. The similarity is measured by Euclidean distance, which is calculated in the same way as Eq. (9.24). The specific construction steps of the dynamic historical memory matrix are shown in Fig. 9.15. The proposed method is used to construct the memory matrix D ∈ Rn×m (n is the number of monitoring parameters, m is the number of historical observation vectors), which can filter out the historical records in the data pool similar to the input observation vector to form a memory matrix. Compared with the traditional method, the memory matrix constructed by this method is smaller in scale and dynamically updated for each input observation vector, so the modeling efficiency is higher and the interference of irrelevant historical records can be reduced. The number of
9.3 Risk Warning of Aircraft Bleed Air System
391
Fig. 9.15 Flow chart of historical memory matrix construction process
observation vectors in the historical memory matrix depends on the amount of health history data, determined by multiple experiments. 4. Similarity and risk warning threshold According to the non-parametric model of the normal operating state of the system or equipment established by MSET, the difference between the actual observed value and the estimated value can reflect the implicit fault information of the system or equipment. However, there are two major problems to realize warning based on the captured risk information: 1) How to quantitatively measure the similarity between the observation vector and the estimated vector? If the degree of difference between the two vectors can be reasonably measured, the risk information of the system or equipment can be fully excavated. 2) How to scientifically establish the threshold of risk warning? If the threshold value is set too large or too small, it will affect the accuracy of the warning model, and there will be cases of omission and false alarm. The scientific threshold value can correctly distinguish normal fluctuation and abnormal fluctuation, to realize the accurate risk warning. (1) Similarity function The residuals between the observed and estimated values of MSET can reflect system or equipment anomalies. If the residual between the actual observation value and the estimated value of a variable is directly used as a risk judgment indicator, it is easy
392
9 Risk Analysis of Aircraft Structure and Systems
to miss the abnormal information of other variables. Therefore, using the similarity between the observed vector and the estimated vector to achieve risk warning can fully exploit the anomalous information of each variable in the observed vector. There are many ways to measure the similarity between two vectors, such as cosine similarity, Euclidean distance, and so on. As a comprehensive risk index of the system or equipment, the similarity function S(X obs , X est ) between the observed vector and the estimated vector is defined as follows: S(X obs , X est ) =
/ 1+
1 n ∑
(X i obs (tk ) − X i est (tk ))
.
(9.27)
2
i=1
Since each variable contains different risk information and different degrees of risk perception, the weight of each variable on the role of risk warning should be determined. The similarity function of the observed vector and the estimated vector in the above equation is modified to get the following equation: S ' (X obs , X est ) =
/ 1+
1 n ∑
ωi (X i obs (tk ) − X i est (tk ))
.
(9.28)
2
i=1
In Eq. (9.28), ωi represents the weight of the i-th variable’s contribution to the risk warning, which can be quantified by the analytic hierarchy process (AHP). S’ (X obs , X est )∈(0, 1), has symmetry, non-negativity, and uniqueness. When S ' (X obs , X est ) = 1, X obs = X est , X obs , and X est are completely similar; when S ' (X obs , X est ) tends to 0, X obs and X est are completely dissimilar. (2) Dynamic threshold In the current warning research based on MSET, the warning threshold usually selects the maximum value or the minimum value of the similarity function. However, in practical application, the data of the system or equipment varies greatly, even in the normal state will appear a large deviation degree. Besides, the interference of data noise and uncertain factors can also cause misjudgment. If a fixed threshold is used, the data changes cannot be well matched, thereby reducing the accuracy of the warning model. The dynamic threshold is automatically changed according to the actual situation of the observed parameters. According to the idea of interval estimation in statistics, the mean μ and variance σ of the similarity function S ' (X obs , X est ) are used to adapt to the change of the value of the observation vector, to make the warning threshold adaptive to the normal change of the system or equipment and reduce false alarm. During a certain period, the similarity sequence based on MSET is denoted as S ' (X obs , X est ) = [S1 , S2 , . . . , S N , . . .],
(9.29)
9.3 Risk Warning of Aircraft Bleed Air System
393
where S i is the similarity between the observed vector and the estimated vector at the time i, obviously S i is a random variable. Under normal circumstances, the mathematical mean is E(S i ) = μ < ∞, and the variance D(S i ) = σ 2 . According to Chebyshev’s inequality, ∀ε > 0, there are P(|Si − μ| ≥ ε) ≤ σ 2 /ε2 .
(9.30)
Let ε = n' σ and n' be the bandwidth coefficient. For a given false alarm rate α = 1/(n' )2 , then at time N, the normal interval C of S i is [ ] C : μN − n'σ N , μN + n'σ N ,
(9.31)
where μN is the mean value of similarity at time N, σ N–1 is the variance of similarity at time N, and the expressions of μN and σ N are μN =
σN
⌜ I I =√
N 1 ∑ Si , N i=1
1 ∑ (S i − μ N )2 . N − 1 i=1 N
(9.32)
It can be seen from Eq. (9.32) that the calculation of the mean and variance of the sequence of similarity at N moments needs to use all the S i before that moment. As time goes by, the amount of calculation will gradually increase, and the memory space required by the computer will also increase. To reduce memory space and improve operating efficiency, the iterative method is used to improve Eq. (9.32) as follows:
σN
N − 1 N −1 1 μ , μN = S N + N N / N − 2 N −1 2 1 (σ ) + (S N − μ N −1 )2 . = N −1 N
(9.33)
According to Eq. (9.33), the average value of similarity at time N only needs the average value of similarity at time N–1 and the similarity value at time N. And, the variance at time N only requires the variance of similarity at time N–1, the similarity value at time N, and the average value of similarity at time N–1. The improved mean and variance can greatly reduce the amount of calculation and improve the speed of calculation. 5. Warning judgment criterion It can be seen from Eq. (9.28) that the maximum similarity value is 1, but the warning threshold μN + n' σN may exceed 1, only the warning threshold on the bottom side is selected. In the actual risk warning monitoring, when two consecutive samples of
394
9 Risk Analysis of Aircraft Structure and Systems
the similarity sequence are lower than the warning limit μN –n' σN , it is determined that the operation state of the air induction system is beyond normal conditions, and a risk warning is carried out on it. The mean μ and variance σ at the alarm time do not participate in the update of the alarm threshold at the next time, that is, the alarm threshold at the next time is equal to the alarm threshold at the previous time.
9.3.2 Application of Risk Warning on the BAS In this section, the application case will be introduced in the following parts: the BAS description, MSET modeling variables selection, the modeling variables weight determination, the data preprocessing, and the results and discussion. 1. BAS description The researched object in this application is the BAS of the Airbus A320 series engine. The control components of the BAS include High-Pressure Valve (HPV), Intermediate-Pressure Bleed Check Valve (IPCV), pressure regulating valve (PRV), Fan Air Valve (FAV), Over Pressure Valve (OPV), Temperature Limitation Thermostat (TLT), Temperature Control Thermostat (TCT), Precooler Exchanger (PCE), and Bleed Management Computer (BMC). The principal configuration of the BAS is shown in Fig. 9.16. Fig. 9.16 Schematic drawing of the BAS. Both No. 1 and No. 2 are pressure sensors, and No. 3 is a temperature sensor. No. 1 sensor monitors the upstream of the PRV (pressure downstream of the HPV) and determines the HPV operating status. No. 2 sensor monitors the pressure downstream of the PRV, which is the bleed air pressure (BAP). No. 3 monitors the temperature downstream of the PCE, which is the bleed air temperature (BAT)
9.3 Risk Warning of Aircraft Bleed Air System
395
The high-temperature and high-pressure air from the engine compressor are reduced by the pressure-regulating components in the system to a constant pressure of 40 PSI, and the temperature-regulating components reduce the temperature to 200 °C, which is sent to the air intake manifold to provide downstream air conditioning system, cabin pressurization system, thermal anti-icing system, and other users. The BAS is controlled by two BMCs, which monitor the position indication from the valve microswitches (HPV, PRV, FAV, and OPV), the outlet temperature of the precooler, and the electrical signals from the transducers TLT and TCT to convert pressure and regulate pressure. Two BMCs are cross-linked, and if one fails, the other will perform most of its work. The monitoring data of the BAS is recorded by the collection component and stored in the recorder, and the decoding software is used to convert binary-arranged original data in the recorder into unitized and intuitive engineering data values. Monitoring parameters mainly include the bleed air temperature, bleed air pressure, and position signals of various control valves. Since the BAS needs to meet the needs of downstream users such as air conditioners and booster systems, it will be affected by the working conditions of other systems and the external environment. Therefore, other sensor parameters, such as Mach number, altitude, and atmospheric conditions, which describe the performance of the engine and the aircraft, need to be included in further analysis. Simple signal statistics and expert experience are used to select the initial set of monitoring parameters of the BAS, as shown in Table 9.14. These monitoring parameters contain information about faults and can reflect the operating status of the BAS. The monitoring data acquisition frequency can be adjusted between 1 and 8 Hz, and the sampling time is the duration of the entire flight cycle of the aircraft. During the operation of the BAS, there are pre-set threshold alarm mechanisms, which can shut down the single-engine BAS when the BAS fails, to avoid further expansion of the failure. However, this kind of protection mechanism belongs to ex-post protection, and sometimes it is difficult to discover potential hidden dangers, so it is impossible to realize the risk warning of the BAS failure. 2. MSET modeling variables selection The selection of modeling variables is based on the premise of practical engineering application, which can reflect the dynamic changes of the BAS and accurately describe the system behavior. The selection of feature vectors should follow the principle of availability, sensitivity to failure risks, and minimization. Table 9.14 shows the monitoring parameters of the BAS. Each parameter is related to the BAS, but not all of these parameters are suitable for modeling. The following procedure can be used to construct the fault monitoring architecture. First, select all the monitoring parameters in Table 9.14 according to the available principles; second, remove the following parameters based on the principle of sensitivity to failure risks: ENG_BLEED_PB, WING_A_I_SYSON, FAV_FC, FAV_FO, PRV_ENG, HPV_ENG, and PACK_PB; finally, remove the remaining monitoring parameters according to the simplest principle. For the redundant variables in the monitoring parameters, the Pearson correlation coefficient method is used to eliminate the variables with low linear correlation.
396
9 Risk Analysis of Aircraft Structure and Systems
Table 9.14 Initial subset of QAR parameters for the BAS Parameter
Description
Unit
Type
FLIGHT_PHASE
The stage of the aircraft flight, such as takeoff, taxi out, etc
/
Discrete
ALT
The flying altitude of the aircraft
FEET
Continuous
N1
Engine low-pressure rotor speed
% RPM
Continuous
N2
Engine high-pressure rotor speed
% RPM
Continuous
SAT
Static air temperature
°C
Continuous
TAT
Total air temperature
°C
Continuous
MACH
Flight Mach number
/
Continuous
ENG_BLEED_PB
BAS press-button
/
Binary
FAV_FC
FAV not fully closed
/
Binary
FAV_FO
FAV not fully open
/
Binary
PRV_ENG
PRV not fully closed
/
Binary
HPV_ENG
HPV not fully closed
/
Binary
BAT
Bleed air temperature
PSI
Continuous
BAP
Bleed air pressure
°C
Continuous
PACK_PB
Pack press-button
/
Binary
WING_A_I_SYSON
Wing anti-ice system on
/
Binary
The Pearson correlation coefficient calculation equation is as follows: n ∑
r=/
(xi − x)(yi − y)
i=1 n ∑
i=1
(xi − x)
n ∑ 2
,
(9.34)
(yi − y)2
i=1
where x and y are, respectively, the sample mean of variables x and y, and n is the sample number of variables. r ∈[–1, 1], the larger the r, the higher the correlation. |r| is weakly correlated from 0 to 0.3, 0.3 to 0.7 is moderately correlated, and 0.7–1.0 is strongly correlated. One month of historical data of the BAS for a certain aircraft is selected and the correlation between the remaining parameters is calculated according to Eq. (9.13). The output result of the correlation matrix is shown in Table 9.15. From Table 9.15, we can find a strong correlation between SAT, TAT, and MACH, which is because SAT is calculated from TAT and MACH. Taking BAT and BAP as the key variables, the parameters (|r| > 0.7) that have a strong correlation with the BAT and BAP are selected as the modeling variables of the warning model, including BAT, BAP, N1, N2, and MACH, which are strongly correlated parameters that may also reveal sensor failures.
9.3 Risk Warning of Aircraft Bleed Air System
397
Table 9.15 Correlation coefficients for monitoring parameters of the BAS Item
BAT
BAP
TAT
SAT
ALT
N1
N2
MACH
BAT
1
0.778
–0.468
–0.578
0.586
0.749
0.811
0.716
BAP
0.778
1
–0.536
–0.659
0.671
0.920
0.842
0.814
TAT
–0.468
–0.536
1
0.966
–0.945
–0.657
–0.511
–0.812
SAT
–0.578
–0.659
0.966
1
–0.997
–0.771
–0.621
–0.928
ALT
0.586
0.671
–0.945
–0.997
1
0.778
0.628
0.946
N1
0.749
0.920
–0.657
–0.771
0.778
1
0.890
0.874
N2
0.811
0.842
–0.511
–0.621
0.628
0.890
1
0.761
MACH
0.716
0.814
–0.812
–0.928
0.946
0.874
0.761
1
3. Modeling variables weights determination Analytic Hierarchy Process (AHP) combines qualitative and quantitative decisionmaking, which can quantify experts’ experience to a certain extent and is suitable for the calculation of complex object weights. Here, AHP is selected to determine the weight of modeling variables. Because each variable has a different proportion in the failure risk of the BAS and the different environments of each variable measurement sensor, the measurement reliability of each component is also different. The influencing factors are determined as the amount of failure risk information and measurement reliability, and AHP is used to construct a three-layer hierarchical structure, as shown in Fig. 9.17. Based on the engineering experience of airline technicians and system principles, the evaluation indicators of each layer are compared with each other to determine the weight of the target, and the comparison matrix of each layer is constructed. Finally, the weights of BAT, BAP, N1, N2, and MACH are calculated as 0.34, 0.34, 0.11, 0.11, and 0.10, respectively. 4. The calculation process of AHP The specific steps of AHP are as follows: Fig. 9.17 Hierarchical structure of five variables importance for risk warning
398
9 Risk Analysis of Aircraft Structure and Systems
(1) Establish the hierarchy: define the target layer (first layer), the criteria layer (second layer), and the alternatives layer (third layer) and identify the influential factors for each layer. (2) Construct the comparison matrix: start from the second level of the hierarchy and compare the priority of each two factors in the same layer to obtain the quantitative comparison matrix. (3) Individual hierarchical ranking: calculate the eigenvector corresponding to the maximum eigenvalue of each comparison matrix, and then normalize to obtain the weight vector of each factor. (4) Consistency check: if the consistency check of the comparison matrix passes, then the resulting vector is the weight vector of the factors at the lower layer to the upper layer. (5) General hierarchical ranking: calculate the combined weight vector of the alternative layers to the target. A hierarchical structure is shown in Fig. 9.18. By consulting with airline technicians, relative priorities are assigned to different criteria using the AHP hierarchy of 1–9, and then the priorities are compared between the elements of each layer. The second layer of evaluation criteria includes risk information and measurement reliability. Determining the weights of the second layer relative to the first layer obtains the comparison matrix A as [ A=