141 69 6MB
English Pages 171 [164] Year 2023
Translational Systems Sciences 37
Takafumi Nakamura
System of Human Activity Systems
A Novel Way to Visualize Invisible Risks
Translational Systems Sciences Volume 37
Editors-in-Chief Kyoichi Kijima, School of Business Management Bandung Institute of Technology, Tokyo, Japan Hiroshi Deguchi, Faculty of Commerce and Economics Chiba University of Commerce, Tokyo, Japan
Takafumi Nakamura
System of Human Activity Systems A Novel Way to Visualize Invisible Risks
Takafumi Nakamura Department of Business Management Daito Bunka University Itabashi-ku, Tokyo, Japan
ISSN 2197-8832 ISSN 2197-8840 (electronic) Translational Systems Sciences ISBN 978-981-99-5133-8 ISBN 978-981-99-5134-5 (eBook) https://doi.org/10.1007/978-981-99-5134-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
The demand for effective ways to effectively manage risk and achieve safety in technical systems is growing daily. Risk management tends to be viewed as part of the process of planning technical systems, if not simply part of daily operations. However, risk must also be considered for unpredictable and emergent events that require a response in a manner different from planned operational procedures. This book proposes a meta-methodology to holistically view system failures to prevent them from being repeated. The methodology has three characteristics that facilitate double-loop learning to change the mental model underlying the activity: (1) it is a common language to understand system failures; (2) it complements the shortcomings of existing methodologies by including the perspectives of various stakeholders to achieve technical safety, and (3) it addresses dynamic aspects to avoid the side effects caused by the introduction of ad hoc interim measures. In proposing a new meta-methodology, (1) the classification of system failures (Van Gigch [78]) was used as a common language; (2) Jackson’s [38] meta-methodology (SOSM: a system of system methodology) was used as the basis for a methodology that is inclusive of diverse stakeholders' perspectives, and (3) a new meta-methodology was proposed to address the dynamic aspects of the system. The three failure classes are Class 1 (deviation from standard), Class 2 (interface failure), and Class 3 (evolutionary failure), respectively. Based on the above considerations, the new meta- methodology proposed in this book is named SOSF (System of system failures). v
vi
Preface
Then, by reviewing the current methodologies from the perspective of SOSF, their shortcomings are clarified, and two new methodologies are proposed. By applying them to the ICT technology domain, the author shows that the proposed meta- methodology and the two methodologies overcome the shortcomings of the current methodologies. The two methodologies are based on the extraction of failure cases of the target system for a certain period (e.g., year, month), and the problems and countermeasures revealed from the cases. To anticipate future risk trends and improve the appropriateness of countermeasures, the SOSF is further expressed in a three-dimensional space, and individual failures are represented as points in the SOSF space. In other words, by introducing metrics into the SOSF space, failures of the same system can be traced chronologically in a quantitative manner. By relating the metrics to the system risk trend, it is possible to determine whether the system has shifted to a higher risk direction by observing the system failures over time. Thus, the goal is to control the risk trend ex-ante. The two methodologies and the example of efforts to predict risk trends are also unique and important contributions in that they can shift ICT system failures from reactive to proactive measures by facilitating proactive double-loop learning from previous system failures. Sorting out the mechanisms that tend to cause failures and developing workarounds is an invaluable contribution to society as a whole. Furthermore, by generalizing the application of SOSF to the entire human activity system (HAS), the author reconsiders SOSF from the perspective of a metasystem of the human activity system (i.e., SOHAS: System of HAS) which opens up the prospect that SOSF can be applied to HAS in general. In this attempt, the HAS to be applied is defined as human error, and the system of human errors (SOHE) is defined as an example of SOHAS (SOHE ∈ SOHAS), and its results are confirmed. Fundamental solutions to social problems surrounding Japan today (e.g., aging population, aging social infrastructure, food and medicine safety, public transportation safety, response to financial globalization, securing pension resources, creating sustainable electricity, and cybersecurity) are urgent issues. These issues are not simply Japan’s problems in today’s globally connected world beyond national borders and cultures but are relevant to the entire world. Moreover, none of these problems can be solved by a single academic discipline, and a metasystem approach is needed to formulate appropriate countermeasures from a bird’s-eye view of the entire range of problems pursued in this book. The author hopes that this book will provide hints for problem-solving to executives, managers, and leaders of various organizations who struggle with daily problems and that theorists pursuing systems theory will make further progress in their research using the metasystem approach pursued in this book as the basis of their logical structure. At the end of this book, the author has clarified the direction of future research on the application of SOHAS to these social problems as a HAS. The author hopes to prevent the repetition of similar problems in different fields by accumulating successful cases (specifically, by organizing and accumulating various methodologies on the common foundation of meta-methodology) through the application of the meta-methodology proposed in this book to social problems surrounding a globally connected world today. Itabashi-ku, Tokyo, Japan Takafumi Nakamura
Acknowledgment
The meta-methodology that forms the basis of this book owes much to the findings compiled by the author under the guidance of Professor Kyoichi Kijima in the doctoral course of Value Systems, Graduate School of Decision Science and Technology, Tokyo Institute of Technology. The meta-methodology described in this book is applied to ICT systems, and the author would like to acknowledge the support of Fujitsu Limited and Fujitsu Fsas Inc., to which the author belonged. Furthermore, it is a great pleasure for me to present this book to the world from Springer. This book is based in part on a book published in Japanese by Nakanishiya Shuppan in 2022, titled “A Metasystem Approach to Overcoming System Failures.” This book is therefore an extension of that theory and case study. The publisher at that time was located near the Yoshida campus where the author was a student at Kyoto University. Last but not least, the author would also like to thank my wife, Kyoko, for her understanding that I was able to publish this book devoting much of my time to my interests. I would like to express my utmost gratitude to her.
vii
Abstract
This book has unique features from conventional books that prevent system failures in that it provides a method that views human activities from a meta-methodological perspective based upon an interdisciplinary understanding of human activities. And this book, which has the above characteristics, proposes a common methodological basis to apply to various problems surrounding society today (e.g., aging social infrastructure, food and medicine safety, public transportation safety, creation of sustainable electricity, cybersecurity). Furthermore, since the failure of human activities is expressed in a three-dimensional space and the topological metrics are implemented, the failure trajectories can be quantitatively monitored in time series to take effective preventive measures. As for the implementation of the topological metrics, the causes of each failure are classified into two dimensions, the degree of coupling between system elements and the interaction between the target system and the external environment. Due to the nature of the introduction of the topological metrics, each diverse system can share the general topological metrics. Consequently, understanding various failures over cross-industries is possible due to common meta-systemic language, mutual learning between different industries and the solution of social problems can be widely promoted. The system of system failures (SOSF) proposed and confirmed the effectiveness of this meta-methodology for ICT systems, and the SOSF is extended to human activity systems (SOHAS: a system of human activity systems) as a whole. The application examples show that SOHAS proactively promotes double-loop learning from previous system failures. They are unique and significant in changing human activities’ failures through a shift from reactive to proactive ways. Therefore, the SOHAS becomes an academic foundation for theoretical research on meta- methodology, and it has an impact on practitioners to prevent system failures by accumulating knowledge of failures and learning from other industries. The clarification of mechanisms that are prone to failure and corresponding preventative measures will be invaluable to society as a whole.
ix
Contents
1
Introduction������������������������������������������������������������������������������������������������ 1 1.1 Purpose of this Book�������������������������������������������������������������������������� 1 1.2 Structure of this Book ������������������������������������������������������������������������ 3
2
Survey of Current Methodologies������������������������������������������������������������ 5 2.1 Limitations of Current Structuring Methodologies and Risk Analysis Techniques ������������������������������������������������������������������ 5 2.2 Limitations of Current Troubleshooting Techniques�������������������������� 8 2.3 Approaches from Social Systems Science������������������������������������������ 10 2.4 Approaches from Self-Organization �������������������������������������������������� 12 2.5 Epigrams from the Past���������������������������������������������������������������������� 12
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings������������������������������������������������������������������ 15 3.1 System of System Failures (SOSF)���������������������������������������������������� 16 3.1.1 Three Success Factors of Double-Loop Learning and New Methodologies���������������������������������������������������������������� 16 3.1.2 Introduction of System of System Failures (SOSF) �������������� 20 3.2 Failure Factors Structuring Methodology ������������������������������������������ 23 3.2.1 Overview of Maintenance Systems���������������������������������������� 23 3.2.2 FFSM: A New Methodology for Learning from Failures������ 26 3.3 System Failure Dynamic Model �������������������������������������������������������� 32 3.3.1 Understanding System Failures through Dynamic Models������������������������������������������������������������������������������������ 32
4
Application to ICT System Failures �������������������������������������������������������� 43 4.1 Scenarios for Applying SOSF to ICT System Failures���������������������� 43 4.1.1 SOSF and the Diagnostic Flow of System Failures���������������� 44 4.1.2 SO Space Map������������������������������������������������������������������������ 44 4.1.3 OP Matrix������������������������������������������������������������������������������� 44 4.1.4 A New Cycle of Learning to Avoid System Failures�������������� 47 4.2 Application of FFSM to Long-Time Down Incidents������������������������ 49 xi
xii
Contents
4.2.1 Phase 1 (Structural Model Analysis: ISM) ���������������������������� 49 4.2.2 Discussion of Phase 1 Analysis���������������������������������������������� 53 4.2.3 Phase 2 (Quantification Theory Type III) Analysis���������������� 54 4.2.4 Discussion of Phase 2 Analysis���������������������������������������������� 57 4.2.5 Phase 3 (Exploring the System: Become Aware of the Meaning) �������������������������������������������������������������������������������� 60 4.2.6 Discussion of Phase 3 Analysis���������������������������������������������� 60 4.3 Application of SFDM to Server Noise Problems ������������������������������ 62 4.3.1 Design Failure or Installation Failure?����������������������������������� 62 5
Discussion of the Application Results������������������������������������������������������ 65 5.1 Results of the Application of the FFSM �������������������������������������������� 66 5.2 Results of the Application of SFDM�������������������������������������������������� 68
6
Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk�������������������������������������������������������������������������������������� 71 6.1 Background ���������������������������������������������������������������������������������������� 71 6.1.1 Summary of SOSF and Further Extension����������������������������� 72 6.2 Review of Current Methodologies (Revisit Chap. 2)�������������������������� 73 6.2.1 Features of Existing Structuring Methodologies and Risk Analysis Techniques �������������������������������������������������������������� 73 6.2.2 Issues and Challenges of Current Troubleshooting Methodologies������������������������������������������������������������������������ 74 6.3 Overview of Introducing Topology into SOSF Space������������������������ 75 6.4 Proposed Methodology for Introducing Topology (Risk Quantification/Visualization Methodology)������������������������������ 76 6.4.1 Normal Accident Theory and IC Chart���������������������������������� 76 6.4.2 Close-Code Metrics as an Example Taxonomy of System Failures������������������������������������������������������������������������������������ 80 6.4.3 Introduction of the Metric into SOSF Space�������������������������� 80 6.5 Application Examples to ICT Systems ���������������������������������������������� 84 6.5.1 Application Example 1: Topological Presentation of SRF for Various ICT Systems �������������������������������������������������������� 84 6.5.2 Application Example 2: Application to ICT Systems Complexly Coupled with Cloud and Network Technologies �������������������������������������������������������������������������� 87 6.6 Results and Discussion of Application to ICT Systems��������������������� 88 6.6.1 Results of the Application Example 1������������������������������������ 88 6.6.2 Results of the Application Example 2������������������������������������ 90
7
Reconsidering SOSF from the Perspective of HAS�������������������������������� 93
8
Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic Measures and its Application to Human Error) �������� 97 8.1 Background ���������������������������������������������������������������������������������������� 97 8.1.1 Socio-Technical Systems and Safety�������������������������������������� 98
Contents
xiii
8.2 Current Methodologies to Achieve System Safety ���������������������������� 100 8.2.1 Risk Management and Crisis Management���������������������������� 100 8.2.2 Static (Safety and 4 M) and Dynamic (Individual and Team) Perspectives���������������������������������������������������������������������������� 100 8.2.3 Safety Is a Systems Problem�������������������������������������������������� 102 8.2.4 Two Major Organizational Theories �������������������������������������� 103 8.3 Proposal of the Human Error Framework������������������������������������������ 104 8.3.1 General Perspectives on Crisis������������������������������������������������ 104 8.3.2 Contributions of Human Error (Team Errors and Individual Errors) ������������������������������������������������������������������������������������ 105 8.3.3 Hypotheses������������������������������������������������������������������������������ 109 8.4 Application to ICT Systems���������������������������������������������������������������� 110 8.5 Results and Discussion of Application to ICT Systems��������������������� 111 9
Total System Intervention for System Failures and its Application to ICT Systems ������������������������������������������������������������������������������������������ 119 9.1 Introduction���������������������������������������������������������������������������������������� 119 9.2 TSI for SF Methodology as an Application Procedure���������������������� 121 9.2.1 Simple Linear System Failure model (Domino Metaphor)���������������������������������������������������������������� 122 9.2.2 Complex Linear System Failure model (Swiss Cheese Metaphor)�������������������������������������������������������������������������������� 122 9.2.3 Non-linear or Systemic Model (Unrocking Boat Metaphor)�������������������������������������������������������������������������������� 123 9.3 Application to ICT Systems���������������������������������������������������������������� 125 9.3.1 Misunderstanding Class 2 or 3 Failure as Class 1 Failure (Problem)�������������������������������������������������������������������������������� 126 9.3.2 Erosion of Safety Goals Accompanied by the Incentive to Report Fewer Incidents ���������������������������������������������������������� 128 9.3.3 Fix that Fails Archetype (Side Effect)������������������������������������ 128 9.3.4 Double-Loop learning for Class 2 Failure Archetype (Solution)�������������������������������������������������������������������������������� 131 9.3.5 Double-Loop learning for Class 3 Failure Archetype (Solution)�������������������������������������������������������������������������������� 132 9.3.6 Double-Loop learning for Fix that Fails Archetype (Solution)�������������������������������������������������������������������������������� 133 9.4 Conclusion������������������������������������������������������������������������������������������ 133
10 Conclusions and Toward Future Research���������������������������������������������� 137 10.1 Conclusions�������������������������������������������������������������������������������������� 137 10.2 Toward Future Research ������������������������������������������������������������������ 139
xiv
Contents
Appendix A. Taxonomy of System Failures ���������������������������������������������������� 141 Appendix B. Sample Incident Matrix�������������������������������������������������������������� 143 Appendix C. Sample Incidents with Attributes���������������������������������������������� 145 Afterword������������������������������������������������������������������������������������������������������������ 147 References ���������������������������������������������������������������������������������������������������������� 149 Name Index�������������������������������������������������������������������������������������������������������� 153 Subject Index������������������������������������������������������������������������������������������������������ 155
Abbreviations
BIC Balancing intended consequences BSI British Standards Institution BUC Balancing unintended consequences DEMATEL Decision-Making Trial and Evaluation Laboratory EC Engineering Change FFSM Failure Factors Structuring Methodology FMEA Failure Mode and Effects Analysis 4 M Man, Machine, Media, Management FTA Fault Tree Analysis HAS Human Activity System HRO High Reliable Organization IC Interaction and Coupling ICT Information and Communication Technology IEC International Electrotechnical Commission IS Inquiring System ISM Interpretive Structural Modeling ISO International Organization for Standardization MECE Mutually exclusive and collectively exhaustive NAT Normal Accident Theory OP Objective-Problem RAS Reliability, Availability, Serviceability RIC Reinforcing intended consequences RUC Reinforcing unintended consequences SFDM System Failure Dynamic Model SO Subjective-Objective SOHAS System of Human Activity System SOHE System of Human Errors SOSF System of System Failures SOSM System of Systems Methodologies SSM Soft Systems Methodology VSM Viable System Model xv
Chapter 1
Introduction
Abstract This chapter describes the overall purpose of this book and the structure of the chapters. A meta-methodology (system of system failures (SOSF)) is proposed to prevent system failures by double-loop learning, its effectiveness is confirmed in an ICT system, and its application is then extended to all human activity systems (HAS). Specifically, a meta-human activity system (SOHAS: a system of human activity system) will be developed and applied to human error. After confirming the effectiveness of SOHAS, the direction for expanding the application area is described at the end of this book. Keywords System of system failures (SOSF) · Human activity system (HAS) · System of human activity systems (SOHAS) · Double-loop learning · Information and Communication Technology (ICT)
1.1
Purpose of this Book
The purpose of this book is to propose a meta-methodology for learning from past system failures and achieving safety. This worldview is revealed by examining various methodologies for managing current system failures. These methodologies use a reductionist approach and are furthermore based on static models (Nakamura and Kijima [54–59], Nakamura [60–62]). Most of these methodologies have difficulty in dealing proactively with emergent problems and are unable to avoid the various side effects derived from quick (i.e., temporary) fixes, leading to a repetition of similar system failures. The main reason for this situation is that system failures are viewed as stand-alone, static events, resulting in organizational learning being limited to single-loop learning rather than double-loop learning, which rectifies the model of the model (i.e., meta-model) of the activity (i.e., organizational behavioral norms). This indicates the need for a meta-methodology that promotes double-loop learning, thereby increasing countermeasures’ effectiveness and managing system failures’ dynamic aspects. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_1
1
2
1 Introduction
Today’s ICT infrastructure is highly dependent on computing systems that are essential to daily life and business activities. This leads to the fact that a prolonged downtime of the ICT infrastructure has a significant impact on our daily life and business activities. These events include the Fukushima nuclear power plant accident following the major earthquake on March 11th, 2011, in Japan, the outbreak of BSE (Bovine Spongiform Encephalopathy), the concealment of serious defects by automobile manufacturers, the explosion of space shuttles (Challenger and Columbia), and system failures in banks and securities firms. If major incidents are inevitable, how can we react to their impact and complexity promptly to ensure an acceptable level of safety and security? This book is written to answer this question. According to Wang and Roush ([83], Chap. 2, p. 44): “Engineers should design in anticipation of failures that could result in loss of assets, damage to the user’s environment, injury or loss of life. Through analysis and study of technical failures and their mechanisms, modern engineers can learn what not to do and how to design to reduce the likelihood of failure.” In this book, the author proposes a meta-methodology, which is called a System of System Failures (SOSF), along with a flow for diagnosing failures, which aims to overcome the shortcomings of current methodologies. The author then proposes two new methodologies under SOSF, the first of which is a methodology to re- examine the current framework of perception of maintenance technology through structuring and visualizing the causes of failures, which is called the failure factors structuring methodology (FFSM). The second is a methodology to manage the dynamic aspects of system failures and to check for side effects of countermeasures, which is called the system failure dynamic model (SFDM). And this meta-methodology is applied to the field of information technology and its effectiveness is clarified. Furthermore, to expand the scope of application of SOSF to the entire human activity system (HAS: Human Activity System), SOHAS (System of Human Activity System), which is an extension of SOSF, is constructed. Next, the author took human error as an example of a human activity system and verified its effectiveness. Specifically, SOHAS, which integrates and encompasses various methodologies that already exist in the target human activity system from a meta-viewpoint, can theoretically be applied to all human activity systems. This provided a basis for expanding the number of cases in which SOHAS can be applied in the future. Verification of effectiveness and refinement of SOHAS can be expected through the accumulation of case studies. The salient feature of this book is that it promotes system safety by learning from past failures, not individually but holistically, to find measures to prevent failures that cannot be obtained elsewhere.
1.2 Structure of this Book
1.2
3
Structure of this Book
Following the introduction in this Chapter, Chap. 2 surveys the current methodology and identifies its shortcomings. In Chap. 3, a meta-methodology, a System of system failures (i.e., SOSF), is first proposed to overcome the current methodologies. Then, two new methodologies are proposed to complement the shortcomings of the current methodologies by reviewing them through SOSF. They are the Failure Factors Structuring Methodology (i.e., FFSM) and the System Failure Dynamic Model (i.e., SFDM). The results of their application to ICT systems are presented in Chap. 4. In Chap. 5, it is shown that the results of the application overcome the current methodological shortcomings. Chap. 6 then introduces the metrics in the SOSF space, which allows for the quantitative capture of the failure trajectory of the target system. This makes it possible to evaluate the trend of the entire system from the perspective of risk and to discuss the points of change of the target system for the future. The effect was verified with a new case study. Then, in Chap. 7, the author applied the meta-methodology to system failures and confirmed the results, but by extending it to all human activities, the author has developed a method to create a meta-methodology for human activity systems (SOHAS: System of human activity system). In Chap. 8, the author discusses how to create a meta-methodology (SOHAS: System of human activity system) by extending this methodology to all human activities. Chap. 8 describes the meta-methodology (SOHE: System of Human Errors) and its application to human activity systems. Chap. 9 summarises the methods described so far as Total System Intervention for System Failure and describes examples of their application. Chap. 10 concludes with a summary of the book and future research directions. Figure 1.1 shows the relation between SOSF, FFSM, and SFDM, and Fig. 1.2 shows the relation between SOSF, SOHAS, and HAS. Figure 1.2 shows the SOSF of Fig. 1.1 in terms of SOHAS. Finally, Fig. 1.3 shows the structure of this book.
Fig. 1.1 The relation between SOSF, FFSM, and SFDM
4
Fig. 1.2 The relation between SOSF, SOHAS, and HAS
Fig. 1.3 The structure of the book
1
Introduction
Chapter 2
Survey of Current Methodologies
Abstract This chapter investigates current methodologies for failure. A survey of current troubleshooting and risk analysis methodologies shows that the majority of approaches are element-reductive and have limited effectiveness in their application. In the field of social systems science, the author has examined the Normal Accident Theory (Perrow), Man-made Disaster (Turner and Pidgeon), Heinrich’s law, and the “normalized deviance effect” as the cause of space shuttle accidents. As a result, it was found that although they attempt to understand system failures from a broader perspective than troubleshooting and risk analysis methodologies, the models are abstract and difficult to apply to real-world system failures. The self-organization approach is another important contribution from the field of cybernetics, but it is difficult to apply to reality. Finally, the warnings from our predecessors pointed out the importance of the viewpoints from which we look at the world and the essentials of decision-making that were full of insights. In summary, the author pointed out the following three issues with the current methodology: 1. It does not address the worldviews of multiple stakeholders. 2. It does not address emergent failures. 3. It does not address the dynamic behavior of the system to avoid the normalized deviance effect. Keywords Troubleshooting · Risk analysis · Normal Accident Theory · Man-made disaster · Normalized deviance effect
2.1
Limitations of Current Structuring Methodologies and Risk Analysis Techniques
Unfortunately, existing structuring methodologies and risk analysis techniques do not fully satisfy the requirements described in the previous chapter. Table 2.1 summarizes the existing methodologies for clarifying the structure of a problem. Among © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_2
5
6
2 Survey of Current Methodologies
Table 2.1 Overview of existing structural methodologies
Element features System structure
Purpose
Decision dynamic model analysis (DEMATEL) Can be defined quantitatively The model is difficult to modify once created (black box)
The model is very flexible and easy to modify. Causal relationships of elements can be defined. Possible to measure the relationship quantitatively. To determine To clarify the relationship numeric between values (i.e. optimal values factors or estimates)
Applications Systems in which the causal relationships can be defined quantitatively
Ability to manage emergent properties Related literature
Interpretive structural modeling (ISM) Ambiguous but qualitative
–
Developed by Battell Geneva institute in the 1970s (Warfield [84])
Hayashi’s quantification theory type III Ambiguous but qualitative
Cognitive mapping Ambiguous and complex
Enables quantification of the qualitative elements, as well as grouping and visualization
Enables analysis of cognitive structure based on a causal chain.
To extract the main factors behind complex symptoms
Decision- making related to social problems
Soft systems methodology (SSM) Ambiguous and complex Enables structuring of a problem from the gap between the ideal and real-world (current status).
To provide a link to the problem- solving management layers All human Complex social systems. processes Political decisions.
Systems that include human processes and for which causal relationships cannot be defined quantitatively –
Groups of symptoms
–
–
○
Developed by Warfield in the 1970s (Sage [71]; Warfield [84], [85])
Developed by Hayashi in the 1950s (Hayashi [24]; Gifi [20]; Van de Geer [77]; Greenacre, [21],[22])
Cognitive mapping techniques have been widely used in strategic management and political science to depict and explore the cognitive structures of organization members (Huff [26])
Developed by Checkland in the 1970s (Checkland [13], [14], [15]; Jackson [38], [40]; Kijima [45])
2.1 Limitations of Current Structuring Methodologies and Risk Analysis Techniques
7
them, SSM (Checkland and Scholes [14]) looks at managing emergent properties of the problem and implementing preventative measures. There are two widely used failure analysis techniques: failure mode effect analysis (FMEA: IEC 60812 [33]) and fault-tree analysis (FTA: IEC 61025 [34]). FMEA handles single-point failures by decomposing the system into its constituent elements, connecting them to the top events in a bottom-up fashion, and expressing the relationships among all the elements in a tabular form. FTA, on the other hand, represents combinations of failures in a tree structure in a top-down manner and visualizes them as a logic diagram. Both methodologies are mainly utilized in the design phase and rely heavily on personal experience and knowledge. In particular, FTA tends to overlook the combination of failure modes, especially emergent failures. Major risk analysis techniques (including FMEA and FTA) are discussed in Bell [6]; Wang and Roush [83], Chap. 4; Beroggi and Wallace [7]. Most studies analyzing failure have their basis in FMEA or FTA. However, FMEA and FTA are rarely conducted simultaneously, and when they are, they are conducted as separate and independent tasks rather than being analyzed with an emphasis on relationships. As noted above, current methodologies tend to lose a holistic perspective on the root causes of system failures. In addition, most of them do not effectively incorporate double-loop learning as a preventive measure, even when the structure of the problem is identified. As a result, systems often repeat similar failures. In other words, the current methodology does not encompass all the required elements (the necessary characteristics of a methodology as discussed in Sects. 3.1.1 and 3.1.2) within its methodology. The methodologies described above are a group of methodologies that mainly clarify the current risk structure, while regression analysis in statistics is widely used to clarify the current risk structure and predict future risk trends. However, in forecasting the future, it is assumed that the principle of uniformity of nature holds, so it is not possible to adequately forecast discontinuous changes in human activities. Hume [27] named the assumption that nature would move in the same way throughout the past and future the uniformity of nature. In other words, the uniformity of nature is the “principle that the course of nature continues always uniformly the same.” Table 2.2 shows the relationship between Table 2.2 The income and consumption paradox
C (Consumption expenditure) 2015 ¥315,379
D model measurement (Disposal prediction error income) ¥427,270 ¥311,741.68 ¥3,637
2016
¥309,591 ¥428,697
¥312,049.12
¥-2,458
2017
¥313,057 ¥434,415
¥313,281.03
¥-224
2018
¥315,314 ¥455,125
¥317,742.90
¥-2,429
2019
¥323,853 ¥476,645
¥322,379.27
¥1,474
2020
¥305,811 ¥498,637
¥327,117.33
¥-21,306
8
2
Survey of Current Methodologies
Fig. 2.1 Regression analysis and model equations for disposable income and consumption
consumption expenditures and disposable income from 2015 to 2020. The data were obtained from the Statistics of Japan’s e-portal [74]. The Japanese government provided 100,000 yen to all households in 2020 to recover from the drop in consumption caused by the Corona disaster, but consumption expenditures did not increase. The author finds that the extracted model equations do not explain consumption expenditures in 2020. Figure 2.1 clarifies the model equation by regressing consumption expenditure and disposable income from 2015 to 2019. According to this regression analysis, the regression model equation is C = 219,689 + 0.215D. Using this regression model equation to estimate consumption expenditure in 2020, a large error of ¥ -21,306 is found in Table 2.2 (the 2020 row in Table 2.2 is highlighted). This indicates that new methods that do not assume the principle of uniformity of nature are needed to predict not only natural phenomena but also human activities. The author will discuss measures to deal with this issue in Chap. 6.
2.2
Limitations of Current Troubleshooting Techniques
All technical systems are designed to achieve a goal or set of goals. The failure of a system in the process of reaching a goal in such a system points to the inadequacy of the technical design, regardless of whether the result is a catastrophe or not. If so, system failure can be defined as follows. System failure is a characteristic of
2.2 Limitations of Current Troubleshooting Techniques
9
sub-systems, namely not contributing to the goal fulfillment of the supersystem. Or system failure is “the termination of the ability of an item to perform its required function” (Turner and Pidgeon [76]). ICT troubleshooting techniques that dominate the status quo are founded on a predefined goal-seeking model; Van Gigch [79] points out the main drawbacks of improving goal-seeking model systems as follows. (1) Engineers try to find the cause of dysfunction within the system domain. The rationale for system improvement justifies the system itself as final (without considering that the system exists for satisfying requirements from the larger system in which it is contained). (2) The engineer seeks a way to restore the system to its normal state. Permanent solutions do not come from improving the operations of the current system. That is, operational improvements do not lead to lasting improvements. (3) Engineers tend to hold on to old assumptions and goals that may no longer be valid. Organizational assumptions and goal formations are not clear throughout the organization. It makes no sense to practice system improvement in the context of this organizational culture. (4) Engineers act as “planners’ followers” rather than “planners’ leaders”. Another manifestation of the problem of false assumptions and incorrect goal pursuit stems from different conceptions of the roles of planning and planner. In the context of system design, as a planner who influences trends, the planner must be the one who leads. He should not be a planner who follows a planner who protects trends. This book focuses on the aspects of system failure that current methodologies cannot adequately manage from the perspective pointed out by Van Gigch [79]. In summary, these aspects are soft, systemic, emergent, and dynamic (i.e., they encompass the worldviews of multiple stakeholders). Technology is changing faster than we can handle system failures; the CPU speed-to-price ratio is growing exponentially as known from Moore’s Law. Moore’s law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every 2 years. Moore’s law is an observation and projection of a historical trend. Rather than a law of physics, it is an empirical relationship linked to gains from experience in production. Furthermore, the number of stakeholders associated with computer systems continues to grow. Computer architects cannot satisfy the requirements of ICT system owners unless they also include their customers’ customers (i.e., end users) as stakeholders. The environment of speed and complexity surrounding ICT systems has been increasing over time. The problem is that once a system failure occurs in such an environment, it is very difficult to identify the true root cause. Most troubleshooting methodologies view system failures as the result of a sequence of events. Furthermore, they focus primarily on the technical aspects of system failures. These models are only suitable for a technical perspective surrounded by single-minded stakeholders in a relatively simple system. The following four key characteristics are commonly pointed out for current troubleshooting methodologies surrounding the ICT systems environment: understanding system failures in an elemental reductionist approach (e.g., a chain of events of work or errors) is not beneficial for designing better systems (Rasmussen [66]; Leveson [49]). In addition, Perrow [64] argues that technical methods of achieving safety (such as issuing warnings or building protective mechanisms to
10
2 Survey of Current Methodologies
increase safety) will eventually fail because complexity makes system failures inevitable. 1. Current methodologies are technically well constructed (e.g., ISO [37] and IEC [28] standards). However, they are not always effective in understanding the true meaning of measures, and from outside the technical arena it is difficult to understand whether these measures are real solutions or merely interim measures. Furthermore, most methodologies are based on an elemental reductionist worldview. 2. Current mainstream troubleshooting methodologies use cause-consequence analysis (or event chain analysis) to find the true root cause. This analysis can be done in a forward sequence (like FMEA or an event tree, working along a timeline from a component to the final event to identify the behavior of higher-level components when it malfunctions) or in a backward sequence (like a fault tree, following the timeline backward from the final event to identify the components at the problem location). The FMEA and the FTA are often used (IEC 60812 [33], IEC 61025 [34]). Toyota Motor Corporation’s corporate slogan is “repeat ‘why’ five times” to get to the root cause. This helps to find the “what” to explore countermeasures to the problem. However, depending on how it is used, this method may be used as a tool to find victims, attributing blame to a specific individual or group rather than finding the true root cause. 3. Rapidly evolving technology leads to various misunderstandings among stakeholders of ICT systems. This gap in responsibility cannot be adequately addressed by current methodologies. 4. Improvements that deviate from operational standards will eventually fail. Van Gigch [79] argues that dealing with system problems by improving the operations of the existing system will lead to failure. Current troubleshooting methodologies focus on the following main issues: • The system does not meet its stated goals. • The system does not produce the expected results. • The system does not work as originally intended. The basic assumption of improvement is based on hard systems thinking, which assumes that goals and operating norms are static and predetermined in the design phase. The above four characteristics prevent the examination of system failures from a holistic perspective and make it impossible to manage the soft, systemic, emergent, and dynamic aspects of system failures.
2.3 Approaches from Social Systems Science There are several approaches to overcoming system failures from the perspective of social systems science. The first is Heinrich’s law (Heinrich et al. [25]). This law is well known in the industry and states that behind every serious injury, there are 29
2.3 Approaches from Social Systems Science
11
minor injuries and 300 more troubles. This suggests that there are sufficient precursors to serious system failures and serious injuries. However, an elemental reductionist approach (i.e., a chain of work or error events) is insufficient to improve the system (Rasmussen [66]; Leveson [49]). In conclusion, it is useful to analyze the frequent near-accidents to detect organizational problems and intervene before a fatal accident happens. The second is to view organizational failures as systems; Bignell and Fortune [8] argue that looking at problems that lead to failure depends on the values held by the decision makers. Thus, categorizing systems that lead to failure leads to more in-depth and effective learning of system failures compared to other approaches. The third is “Normal Accident Theory“Perrow [64] proposed Normal Accident Theory to understand social system failures. Normal accident theory assumes that errors are inevitable in all systems. It incorporates several defense mechanisms to correct errors and break the chain that leads to failure. However, in a system with many complex interactions that differ from those that interact in a linear order, two or three errors, themselves insignificant, can interact in ways that the designer or operator cannot anticipate and break through the defense mechanisms. If one assumes that the system is tightly coupled (no gaps, no alternatives, no variations, no loose ends), the initial errors propagate sequentially, leading to system downtime. While much rarer than the more common incident of a single point of failure, the possibility of an unexpected interaction of multiple system failures is a “normal” and unavoidable system property. In addition, Perrow [64] argues that traditional, technical efforts to issue numerous warnings and incorporate protective mechanisms will ultimately fail because the complexity of the system makes system failures inevitable. This suggests that a new model is needed that can manage the dynamic aspects of system failure. The model should ensure the effectiveness of countermeasures through the promotion of double-loop learning. A new model that can manage dynamic system failures is described in Sect. 3.3, and a new method that combines Normal Accident Theory and SOSF to capture system failures in a quantitative and time series manner and predict future risk trends is described in Chap. 6. The fourth is a “man-made disasters model.” Turner and Pidgeon [76] proposed an original “man-made disasters model” that looks at the organizational preconditions leading to technical accidents as a system. The Chornobyl accident (a nuclear accident that occurred on April 26, 1986, at 1:23 a.m. (Moscow Standard Time) in Reactor No. 4 of the Chornobyl Nuclear Power Plant in the Ukrainian Soviet Socialist Republic, a member of the Union of Soviet Socialist Republics (former USSR)). The Challenger explosion (January 28, 1986, when the U.S. space shuttle Challenger broke apart 73 seconds after launch, killing all seven crew members) and the Columbia midair breakup (February 1, 2003, when the U.S. space shuttle Columbia re-entered the atmosphere over Texas and Louisiana, killing all seven crew members). It highlights that the interaction of technology and organizational failure should be taken into account in the search for the causes of many modern large-scale accidents, such as the gas leakage from a chemical plant in Bhopal, Madhya Pradesh, India in 1984. So-called “organizational accidents” are errors or events that lurk hidden against culturally accepted values that grow over time and
12
2 Survey of Current Methodologies
surface as a mass of organizational knowledge failures. Vaughan [80] describes the “erroneous decision-making” caused by the “normalization of deviance” as follows. Vaughan [80] studied the Challenger disaster as a case of bad decision-making and found that NASA decision makers, using decision characteristics, strategies, and models, completely misjudged the potentially disastrous consequences of the Challenger launch. The Challenger catastrophe stemmed from a misguided process that led to a fatal outcome. Such an accident must not be repeated. In summary, the approach from social systems science can be more substantive than the technical methodologies described in the previous section because it seeks to understand system failures from a broader perspective of the social environment. However, the abstract nature of these models makes it difficult to apply them to avoid system failures or to use them as concrete examples.
2.4 Approaches from Self-Organization There are several contributions from the field of cybernetics: Deutsch [16] proposed consciously stored feedback as a second message to trigger internal change; Mesarovic et al. [51] proposed a multi-level hierarchical system. Imada [35] argued that self-organizing movements are not organized by external environmental changes but are triggered from within the self. And he claims there are four prerequisites to promoting self-organizational change. They are (1) to promote organizational fluctuations to establish an organizational order, (2) to put a priority on a creative individual over the organization as a whole, (3) to accept chaos, and (4) to reject control centers. These studies are valuable in shifting organizational learning from reactive to proactive.
2.5 Epigrams from the Past 1. “Death sneaks upon us from the back door, not the front” (Yoshida [90]). 2. “Watch out for the undertow.” “Death, it seems, does not like to wait until we are prepared for it. Death is indulgent and enjoys when it can, a flair for the dramatic” (Irving [36]). The world is full of uncertainty, as this Japanese essayist and American novelist observed about the unpredictability of life. 3. “Analysis destroys wholes. Some things, magic things, are meant to stay whole. If you look at their pieces, they go away” (Waller [82]). This quote reminds us of a reductionist approach that deprives holistic features of an object.
2.5 Epigrams from the Past
13
4. “Fools say they learn from experience; I prefer to learn from the experience of others” (Otto von Bismarck Cited in Samuel [72]). 5. “Discover something new by looking back on a past” (Old Chinese saying). Those two quotes show the importance of learning from others. Furthermore, there is such urgency for moving to the implementation of the proposed activities; we should not wait until we have all the information. We all face the dilemma between “analysis by paralysis” and “extinct by instinct.” In this regard, the quote from General Collin Powell’s Primer in Leadership (Harari [23]) is practical: 6. Part I: “Use the formula P=40 to 70, in which P stands for the probability of success and the numbers indicate the percentage of information acquired.” Part II: “Once the information is in the 40 to 70 range, go with your gut.” And also, there is a well-known quote attributed to Eleanor Roosevelt that says, “Do something that scares you every day.” In the same context as this quote, the title of Kenzaburo Oe’s novel is “Leap before you look [63].” The motif of the novel consists of the following poem by W.H.Auden [2]. The titles of the novel and the poem are the same. • • • •
The sense of danger must not disappear: The way is certainly both short and steep. However gradual it looks from here. Look if you like, but you will have to leap.
It is interesting to note that politicians, novelists, and poets all claim the same thing in the process of changing themselves across countries and cultures. The other epigram is the pitfall of goal-seeking model. It often leads us to the place where we are all burned out. 7. “The meaning of stability is likely to remain obscured in western cultures until they rediscover the fact that life consists in experiencing relations, rather than in seeking goals or ‘ends’….. The barren contradiction of life, where this truth is overlooked, seems to me to be well, though unconsciously, expressed in line by Louis Untermeyer which has, significantly, became a favorite quotation in North America – From compromise and things half done keep me with stern and stubborn pride, and when at last the fight is won, God keep me still dissatisfied (Vickers [81])
8. Never underestimate the power of a small group of committed people to change the world. In fact, it is the only thing that ever has.” As noted in Approaches from Self-organisation in previous sect. 2.4, the American cultural anthropologist Margaret Mead emphasizes that any change begins with a determined individual.
14
2 Survey of Current Methodologies
The various epigrams make us aware that the world is full of uncertainty and make us humble although they are not much practical help to struggle with the real problems we face. The end note is the passage from the great poet William Wordsworth [89] (1770–1850). He also noticed the importance of holistic properties, at the end of his life, in his work “The Prelude.” Dust as we are, the immortal spirit grows. Like harmony in music; there is a dark. Inscrutable workmanship that reconciles. Discordant elements, makes them cling together. In one society. (The prelude, 1850) (Wordsworth [89]).
Chapter 3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
Abstract In this chapter, the author proposes a new methodology to overcome the three shortcomings of the current methodology described in the previous Chapter. Section 3.1.1 introduces the SOSF with reference to the system of system methodologies (SOSM) (Jackson) meta-methodology, the Viable System Model (VSM) (Beer), the organizational structure models (Kickert), and others. The SOSF is represented in a three-dimensional space, and for each of the three dimensions, the first dimension is the participants’ properties dimension (i.e., unitary, plural, and coercive). The second dimension is the system dimension (i.e., simple and complex). The third dimension is the failure classes dimension, which is divided into three classes, Classes 1, 2, and 3. It explains that Class 1 failures are own failures, class 2 failures are failures of other system interfaces and Class 3 failures are evolutionary failures. In Sect. 3.2, the author introduces FFSM, which covers the dimension of a unitary participant in the SOSF space, and in Sect. 3.3, we introduce SFDM, which covers the entire SOSF space. FFSM utilizes interpretive structural modeling (ISM) and the Quantification theory type III structuring methods, while SFDM uses System Dynamics and various Archetypes to overcome the challenges. Keywords SOSM (the system of system methodologies) · VSM (the Viable System Model) · Failure classes · ISM (the Interpretive structural modeling) · Quantification theory type III · System Dynamics Based on the review in Chap. 2, there are three major shortcomings in the current methodologies: 1. Lack of methodology to cover multistakeholders’ worldviews. 2. Lack of methodology to cover emergent failures. 3. Lack of methodology to cover dynamic behavior of system failures to avoid normalized deviance effects. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_3
15
16
3 Proposal of a New Methodology to Overcome Current Methodological Shortcomings
The author first proposes a new meta-methodology (Sect. 3.1) called System of System Failures (SOSF) as a countermeasure of the first shortcoming then confirms the lack of methodologies that are the root cause of the second and third shortcomings. Then the author proposes two new methodologies, they are Failure Factors Structuring Methodology (FFSM) (Sect. 3.2) and System Failure Dynamic Model (SFDM) (Sect. 3.3), respectively. Finally, the author presents a total picture of SOSF that complements above current methodological shortcomings.
3.1 System of System Failures (SOSF) 3.1.1 Three Success Factors of Double-Loop Learning and New Methodologies The most important success factor is the ability to question current norms of activity (i.e., mental models). To overcome the shortcomings of the current methodologies discussed above, it is necessary to incorporate double-loop learning. This is because the skill of double-loop learning is the ability to question basic assumptions, which leads to the opportunity to rethink mental models. As shown in Fig. 3.1, the current mental model not only modifies the activity but also creates the activity toward a more desirable goal (Morgan [53]; Argyris and Schoen [1]; Senge [73]). Mental models imply modeling that extracts properties from the concrete world of things to perceive phenomena in the physical world. It is easy for mental models to conceptualize the properties obtained from the standpoint of things from a higher standpoint than things. This process of abstraction can be applied to the mental model itself to obtain a model called a meta-model. For this reason, any discussion of mental models must include a discussion of the meta-model from which the mental model is derived. This is because a meta-model cannot create a mental model. The relationship of meta-models to mental models is the same as the relationship between “design theory” and “design” or between “decisions about decisions” and “decision making” (Van Gigch [79]). Double-loop learning is feedback consciously stored internally by a second message (information that changes internal conditions) (Deutsch [16]; Imada [35]). It then corresponds to the decision-making hierarchy of the self-organizing layer (Mesarovic et al. [51]). Double-loop learning should work on all three hierarchies shown in Table 3.1. That is, the reality layer is the transformation of the activity, the model layer is the transformation to the desired goal, and the meta-layer is the modification of the mental model. Figure 3.1 illustrates single- and double-loop learning in a multiple-stakeholder environment based on double-loop learning (Morgan [53]). The dashed line in Fig. 3.1 indicates a particular stakeholder that is seeking to achieve a goal. It can be said that one single stakeholder is not sufficient to overcome the shortcomings of the current methodology. Therefore, it is necessary to extend the methodology to
3.1 System of System Failures (SOSF)
17
Double-loop learning Class 3 failure Meta Model
Single loop learning Mental Model
Action Class 1 failure Achieved Goal
Stakeholder A
Comparison
Perceived Gap
Desired Goal Class 2 failure Double-loop learning Class 3 failure Meta Model
Single loop learning Mental Model
Action
Class 1 failure
Achieved Goal
Stakeholder B
Comparison
Perceived Gap
Desired Goal
Fig. 3.1 Single and double-loop learning under a multi-stakeholder environment Table 3.1 Relations between the organization structure (Kickert) and VSM (Beer) models Meta Model Reality
Organization structure Aspect system: What Subsystem: Who Phase system: When
Objective Mental model Operating norm Operation
VSM System 5 System 4 Systems 1–3
double-loop learning to address the situation of multiple stakeholders. There are three critical success factors to overcome the shortcomings of the current methodology in this context. First, there should be a common language to represent the mental models among stakeholders (the mental model boxes displayed for each stakeholder in Fig. 3.1). Otherwise, failures caused by gaps in mental models among stakeholders will not be effectively resolved. Second, there should be a
18
3 Proposal of a New Methodology to Overcome Current Methodological Shortcomings
meta-methodology to realize double-loop learning (box of meta-model in Fig. 3.1). This meta-methodology should be the same across stakeholders, otherwise, it is difficult to achieve MECE (mutually exclusive and collectively exhaustive). For this reason, Fig. 3.1 represents the boxes of the meta-model to be shared among stakeholders. Third, a class of failures should be identified that identifies the origin of the failure. This is essential to ensure the effectiveness of the measures: there are three origins of system failures: (1) mental models, (2) mental model gaps among stakeholders, and (3) meta-models. These three origins of failure correspond to three failure classes: failure class 1 (failure due to deviation from the norm), failure class 2 (interface failure), and failure class 3 (predictability failure or failures of foresight). The above three critical success factors are organized as follows: 1. We should have a common language to understand system failures. It is important to examine system failures from diverse perspectives. System safety can be achieved through the activities of various stakeholders. One such common language is the taxonomy of system failures (Appendix A) developed by Van Gigch [78]. According to it, there are six categories of system failures: (1) technological, (2) behavioral, (3) structural, (4) regulatory, (5) rational, and (6) evolutionary failures. 2. The meta-methodology must result in more substantive and correct measures, rather than ad hoc interim measures that have long-lasting side effects. To remedy a system malfunction or failure, one must first translate the specific failure into the model world to examine the nature of reality as a whole. Then, it is necessary to discuss in the modeling phase (i.e., meta-modeling phase) the model of the failure of the system to seek why the failure occurred, what the countermeasures are, and what the learning process should be in the organization so that the failure is not repeated in the future. Kickert [44] described the organizational structure corresponding to the organizational objectives and separated the organizational structure into three hierarchies. They are the aspect system, the subsystems, and the phase system. These three layers correspond to “what,” “who,” and “when,” respectively. Systems 1–3 of the VSM model (Beer [4, 5]) are at the operational level, and systems 4 and 5 are at the meta-level where operational norms are determined through communication with the external environment of the system surrounding the organization. There are hierarchical similarities between Kickert’s and Beer’s models as follows: VSM systems 1 through 3 correspond to the part of the system that manages “when” (phase system). This level ensures internal harmony and maintains internal homeostasis. Systems 1, 2, and 3 correspond to when operations should be performed, how operations should be coordinated, and how to maintain overall corporate control. VSM system 4 corresponds to the strategic management of the enterprise, corresponding to the subsystem that manages the “who.” This level integrates internal and external inputs to draw up the corporate strategy (i.e., external homeostasis) and to identify “who” should be responsible for this strategy.
3.1 System of System Failures (SOSF)
19
VSM System 5 corresponds to normative corporate management, corresponding to the outer system (the aspect system) that manages “what.” This level forms (i.e., predicts and plans) long-term policies and decides what to implement. Kickert’s organizational model and Beer’s VSM model decompose the organization into three levels. That is reality (operations), model (adaptation), and meta (evolution). The reality and model layers pursue the “who” and “how,” while the meta-layer pursues the “what.” This difference is essential to ensure the effectiveness of measures. Table 3.1 summarizes the relationship between the organizational structure model (Kickert [44]) and the VSM model (Beer [4, 5]). 3. Three classes of failures should be identified that identify the origin of the failure to avoid aspects of the dynamic effects of system failures (i.e., erosion of safety goals over time and the dynamic behavior is explained in more detail later in Sect. 3.3.1). These classes of failures should be intentionally identified in conjunction with the VSM model. System boundaries and problem characteristics (i.e., predictable or unpredictable) should be clarified. Classes of failures can be logically identified according to the following criteria: • Class 1 (failure of deviance): The root causes are within the system boundary, and conventional troubleshooting techniques are applicable and effective. • Class 2 (failure of interface): The root causes are outside the system boundary but predictable at the design phase. • Class 3 (failure of foresight): The root causes are outside the system boundary and unpredictable at the design phase. The class of failure depends on whether the root cause is inside or outside the system boundary, and a Class 3 failure in one particular person may correspond to a Class 1 or 2 failure in another person. Therefore, the definition of failure is relative and recursive, and it is important to identify the problem owners from two perspectives. That is the group of stakeholders and the VSM system (i.e., any of the systems 1 through 5). Without clarity on these two aspects, a class of failures cannot be identified. It is necessary to recognize the system level of the organization to modify the norms of operation. In order not to repeat system failures, it is not enough to change only systems 1–3 of the VSM model or the phase system of the organizational structure model that pursues when and how. As pointed out above, the current technical model focuses mainly on the operational area, and this causes side effects due to the interim measures. Event chain models usually focus on events whose proximity to each other immediately leads to failure. The origin of a system failure can often be traced back many years before the failure occurs. In this situation, Beer’s VSM model and Kickert’s organizational structure model help to understand the true root cause. In a static environment, it is effective to control the safety of work and maintenance activities through a manual of activity rules developed in a top-down fashion. However, in a dynamically changing environment, static approaches are insufficient and require a fundamentally different system model. Therefore, in Sect. 3.3, the
20
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings Participants Unitary
Systems Complex
Hard systems thinking
System dynamics Organizational Cybernetics Complexity theory
Soft systems approaches
Simple
Pluralist
Coercive Emancipatory systems thinking
Postmodern systems thinking
Fig. 3.2 Systems approaches related to problem context in the System of System Methodologies (SOSM)
author describes why failure measures often have unexpected side effects and the dynamic model that ultimately helps to introduce more effective measures.
3.1.2
Introduction of System of System Failures (SOSF)
Based on the above considerations, the author proposes a new methodology called a System of System Failures (SOSF) that facilitates double-loop learning and satisfies the three critical success factors mentioned above. Double-loop learning is essential in determining the adequacy of operational norms (i.e., mental models) (Morgan [53]; Argyris and Schoen [1]; Senge [73]). It also provides a metamethodology for changing mental models to overcome system improvement shortcomings (Van Gigch [79]; Rasmussen [66]; Leveson [49]; Perrow [64]). Among the meta-methodologies proposed in the general context, the System of System Methodologies (SOSM) proposed by Jackson [38] is a typical and excellent example: the main features of SOSM are: i) a meta-systemic approach (soft systems thinking that fosters double-loop learning), and ii) a complementarity principle that encompasses multiple theoretical frameworks (i.e., a contingent approach that combines and utilize various methodologies from diverse theoretical frameworks depending on the problem situation). Figure 3.2 illustrates the SOSM framework. The various systems thinking is arranged in a two-dimensional space, where the two dimensions are the participants and systems dimensions, respectively: the current troubleshooting techniques discussed in Sect. 2.2 (i.e., FTA [34], FMEA [33], IEC [28], and ISO [37]) belong to the unitary-simple domain of SOSM. In particular, the SOSF is designed by mapping each failure type of the taxonomy of system failures (Van Gigch [78]) onto the SOSM space (Fig. 3.3). Since this book focuses on the technical domain, it does not cover the coercive (coercive)
3.1
System of System Failures (SOSF)
21 Participant
Unitary
Plural Failures of rationality
Simple
Failures of technology Failures of behavior
System
Failures of evolution
Complex
Failures of structure Failures of regulation
Fig. 3.3 System of System Failures (SOSF) SOSM (Jackson)
SOSF
Meta methodology
Meta failures
Meta methodology for understanding failures
Methodology
Taxonomy of failures (Van Gigch [78])
Methodologies
Reality
System failures
Fig. 3.4 Meta modeling of system failures and SOSF by using SOSM
domain of SOSM. Stakeholders achieving technical safety are fully covered in the unitary and pluralist areas of the SOSM space; individual failure types can simply be assigned from the SOSM space to the SOSF space. Figure 3.4 shows the structure linking SOSM and SOSF. The left side of Fig. 3.4 represents the layer of abstraction from reality through the methodology to meta-methodology. In the world of system failures, the system failures displayed at the bottom of Fig. 3.4 map to the reality layer. The common language (i.e., the taxonomy of system failures) maps to the methodology layer. Meta-failures (i.e., SOSF) are mapped to the metamethodological layer. Thus, SOSF is an extension of SOSM to the world of system failures. It is meant to point out the recursive character of SOSF, which depends on
22
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
the perspective from which we look at the system. When the system under consideration is decomposed into subsystems, each subsystem possesses its SOSF. Thus, a technology failure may be positioned as an evolutionary failure from the viewpoint of a subsystem one level below. Furthermore, from the viewpoint of a system one level above, an evolutionary failure may be positioned as a regulatory failure. To satisfy the third feature pointed out in Sect. 3.1.1 (distinguishing three classes of failure), the author introduces a third dimension. That is the dimension of classes of failure. Figure 3.5 shows the SOSF space, which is three-dimensional by adding the dimension of system failures to the two-dimensional SOSF (Fig. 3.3). As noted above, it is important to identify who (i.e., stakeholders) and where (i.e., system level, the vertical axis in Table 3.1) for recursive properties. Table 3.2 organizes a general notation of system failures to ensure that measures are mutually exclusive and collectively exhaustive (MECE). The “who,” “where,” and “what” correspond to the stakeholders, systems 1–5, and failure classes, respectively. The horizontal arrows in Table 3.2 are comparisons between stakeholders to identify the responsible party. If stakeholders are identified, the system levels (1–5) and
Fig. 3.5 Three-dimensional SOSF space with the taxonomy of system failures Table 3.2 General notation of system failure
3.2
Failure Factors Structuring Methodology
23
Fig. 3.6 Three-dimensional SOSF space with FFSM
objectives (what, who, and when) are identified using vertical arrows. This method ensures the effectiveness of the double-loop learning to transform the model of the model (i.e., the meta-model of operational norms). Based on the survey of current methodologies and the meta-methodologies introduced in this section, the main system failure methodologies (i.e., FTA [34], FMEA [33], IEC [28], and ISO [37]), as shown in Fig. 3.6, are mainly in the Unitary and belong to the Class 1 domain. The FFSM in Fig. 3.6 is explained in the next section.
3.2
Failure Factors Structuring Methodology
The author proposes a new methodology that covers all failure classes in a unitary domain in the SOSF space. The new methodology is called FFSM (failure factors structuring methodology) and is shown in Fig. 3.6.
3.2.1
Overview of Maintenance Systems
A maintenance system consists of three parts. They are the maintenance worldview, the systems to be maintained (i.e., the maintenance objective systems), and the maintenance system itself (IEC 60300-3-10 [31]; IEC 60706-2 [32]; Bignell and Fortune [8]; Reason and Hobbs [69]). Figure 3.7 illustrates the relationship between the three.
24
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
Group of Maintenance Objective Systems
Maintenance worldview
Design Subsystem
Configuration Subsystem Operations Subsystem Evaluation Subsystem
Feedback (double-loop learning) Feedback (single-loop learning)
Maintenance System
Periodic Maintenance System Failure
Applied Maintenance
Achieve Operational Goal (Safety and Security)
Phase 1 Phase 2 Phase 3 Feedback
Fig. 3.7 Maintenance system overview
Maintenance Worldview The maintenance worldview provides a cognitive framework for identifying maintenance systems (IEC 60300-3-10 [31], 60706-2 [32]). The cognitive framework specifies the maintenance system and the system to be maintained. The maintenance worldview is influenced by several interrelated factors, including organizational culture, customs, the level of safety and security achieved by the maintained system, and public opinion. For example, acceptable system downtime and maintenance organization (hardware, software, etc.) are implicit assumptions of the maintenance worldview regarding the design, deployment, operation, and modification of maintenance systems. Maintenance System A maintenance system used to maintain a target system typically encompasses the following four subsystems (IEC 60300-1 [29], 60300-2 [30]). The maintenance system achieves its design goals through the stable, safe, and reliable operation of the system under maintenance.
3.2 Failure Factors Structuring Methodology
25
1. The purpose of the design subsystem is to define the availability of the system to be maintained and to establish a maintenance plan. The maintenance plan includes the design of planned maintenance (scheduled maintenance) and the implementation of temporary maintenance in case of system failure. 2. The configuration subsystem is required to organize maintenance, write maintenance manuals, and build the system to be maintained based on the maintenance plan defined by the design subsystem. 3. The operations subsystem is used by the maintenance organization to maintain the system to be maintained using the maintenance manuals established and written by the configuration subsystem. 4. The evaluation subsystem is used to evaluate both the maintenance system itself and the systems to be maintained (i.e., the maintenance objective systems) based on both the operational history and the history of system failures. This subsystem implements the learning loop displayed in Fig. 3.7. Single-loop learning rests on the ability to detect and correct errors relative to a given set of operating norms, while double-loop learning depends on being able to take a second look at a situation by questioning the relevance of the operating norm. (Morgan [53], ‘pp 86–89’). Maintenance Objective Systems Following IEC 60300-3-10 [31] and 60706-2 [32], this book considers a maintenance system to have the following characteristics: 1. The system subject to maintenance consists of a single (or multiple) logical (or physical) device(s). 2. The maintained system becomes obsolete in software (i.e., security measures) or hardware (i.e., corrosion) over time. 3. The system to be maintained is operated and maintained by operators and maintenance technicians according to the maintenance manuals prepared in advance. Failure of a system to be maintained (i.e., maintenance objective system) is called a system failure. Overview of Managing System Failures A system failure is an unexpected result of the maintenance system or the system being maintained according to its design. The operations subsystem is responsible for managing system failures. It restores system operations by detecting system failures and selecting recovery methods and recovering from them, despite the diversity of root causes of system failures. The evaluation system feeds back problem occurrences and their remedies to the appropriate subsystems within the maintenance system or implements appropriate preventive measures for a group of systems to be maintained. However, the evaluation subsystem usually operates reactively only after a system failure has occurred.
26
3 Proposal of a New Methodology to Overcome Current Methodological Shortcomings
This is one reason why serious incidents occur in rapid succession. To overcome this problem, an effective avoidance methodology is needed. For this reason, various types of feedback are needed to transform the worldview of the current maintenance system, as shown in Fig. 3.7.
3.2.2 FFSM: A New Methodology for Learning from Failures Why do system failures happen, seemingly without end? One reason is that learning from system failures is simply not fully understood by organizations. If this is the case, then we need a methodology that not only isolates root causes but also manages consequences. In the following, the author first list the characteristics that such a methodology should have and compare some typical existing methodologies according to the list. This comparison will lead to an innovative methodology that achieves a methodology with the required features. The author then proposes a methodology that fully satisfies the required features. Required Characteristics of the Methodology The author believes that such a methodology should have the following features because viewing a system holistically and reflecting learning upon a current cognitive worldview is indispensable for managing risk proactively (Checkland and Scholes [14]). 1. Its characteristics provide a means to generate countermeasures derived from the analysis of individual system failures while we maintain the system. In other words, it supports the practice of double-loop learning. 2. Its feature is that it allows us to obtain the observations necessary to modify or dismiss the current maintenance worldview, thereby allowing us to structure, visualize, and get a bird eye view of the factors that may have caused the system’s failure. 3. As a result, it supports decision-making through a structured understanding of the nature of the problems inherent in system failures and how they are related. The above characteristics mean that FFSM is a procedure that internalizes the double-loop learning mechanism as a structuring methodology and reviews the perspective of the conceptual world in addition to the perspective of the real world from a holistic viewpoint. FFSM: A New Methodology for Learning from Failures The author now proposes a new methodology called FFSM. This methodology takes into account the shortcomings of the current methodology described above and satisfies the requirements described in Sect. 3.1.
3.2
Failure Factors Structuring Methodology
27
This new methodology promotes double-loop learning through a holistic view of the system. Failures in complex systems are typically caused by a combination of various factors. Each factor is often qualitative, which also causes the need for individual human intervention in the maintenance system. Therefore, the proposed methodology must be able to respond qualitatively and quantitatively. To this end, it is important to clarify the quantitative relationships that exist among qualitative factors, and problem groups, and identify known and latent factors as a whole. The methodology also provides the awareness needed to modify the maintenance worldview (i.e., to facilitate double-loop learning). Figure 3.8 provides a general overview of the FFSM, and Table 3.3 describes the objectives of each phase. Figures 3.9, 3.10, and 3.11 describe the detailed flow of phases 1, 2, and 3, respectively. Each phase of the FFSM is described below: Phase 1: Structuring the Entire System (Dependencies and Relationships among Factors) This phase structures the causes of system failures. Steps 1 and 2 in Fig. 3.9 define a problem (failure) groups and corresponding factors. This is important to analyze the problem as a whole rather than individual problems. To do this, it is necessary to derive the relevance of quantitative factors from qualitative factors. To this end, the ISM (Sage [71]; Warfield [84], [85]) will be applied in this phase. Phase 2: Visualization of Factors of Failure and Grouping of Similar Factors This phase will reveal hidden factors that cannot be extracted by analyzing individual cases of failure. To find such hidden factors, Quantification theory type III Single-loop learning
Problems
Define the Maintenance System(worldview)
Phase 1 Structuring
Phase 2 Visualizing
Phase 3 System Exploration
Decision-Making
Feedback
Double-loop learning
Fig. 3.8 A system able to provide its maintenance system (FFSM) Table 3.3 Objectives of Phases 1, 2, and 3 Feature • Holistic approach. (structuring factor relationships) Phase 2 • Holistic approach. (grouping factors and problems) Phase 3 • Viewing a system from a conceptual viewpoint as well as a real-world viewpoint. • Double-loop learning. Phase 1
Objective Discover root causes by clarifying the relationships between factors Extract hidden factors behind complex symptoms by grouping factors and problems Discover preventative measures for emergent properties by mapping factors into maintenance subsystems
28
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings Step 1: Define Problem Groups Problem n (Failure n)
Problem 1 (Failure 1) Extract
Phase 1 Structuring
Step 2: Define Factor Groups Factor m
Factor 1
Step 3: Structure factor relation Factor m
Factor l Factor n
Go to Phase 2
Fig. 3.9 Detailed flow of Phase 1
(Hayashi [24]; Gifi [20]; Van de Geer [77]; Greenacre [21], [22]) is used. This method is one of correspondence analysis (Greenacre [21], [22]) and is useful to quantify and visualize all factors of failure that are qualitative (Fig. 3.10).
3.2
Failure Factors Structuring Methodology
29 From Phase 1
Step 4: Group factors and Problems Group l Phase 2 Visualizing
Factor m,n Problem a,b
Hidden factor Group m Factor p,q Problem e,f
Go to Phase 3
Fig. 3.10 Detailed flow of Phase 2 From Phase 2
Step 5: Find preventative measures
Phase 3
System Exploration
Mapping factors/problems into the maintenance frame Evaluation Design
Operation Configuration
Step 6: Decision Making and Action Check result Go to Step 1
Fig. 3.11 Detailed flow of Phase 3
Phase 3: System Exploration (Obtaining Observations): Factor and Example Mapping into a Maintenance Framework This phase facilitates double-loop learning and enables the discovery of workarounds to manage emergent problems. Figure 3.11 shows the detailed flow of phase 3. Figure 3.12 shows the learning loop for recognizing maintenance failures: four subsystems are classified into four quadrants defined by two dimensions. The
30
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
Fig. 3.12 Learning loop for recognizing maintenance system failures
vertical dimension is time (pre- vs. post-operation) and the horizontal dimension is the boundary between the conceptual world and the real world. The closed loop in each quadrant represents a closed learning loop (single-loop learning) within each quadrant. The evaluation subsystem provides feedback to the systems to be maintained (i.e., the maintenance objective systems) so that the countermeasures acquired by analyzing the operation history and failure history can be horizontally deployed (to other systems to be maintained). The arrows (⇨) in Fig. 3.12 indicate feedback to the maintenance system itself, including changes in the maintenance worldview (double-loop learning). By mapping the factors identified in Phases 1 and 2 onto the maintenance frame shown in Fig. 3.12, it is possible to gain insights that lead to a change in the existing framework of the maintenance cognitive frame. Table 3.4 organizes the three phases of the FFSM, and Table 3.5 shows the new structure of the maintenance system. Boundary Conditions for FFSM Applicable Areas Systems to which FFSM is applicable are not limited to maintenance systems. • The system components encompass human activity processes. • The behavior of the system depends on human knowledge. • Causal relationships for results are complex and unclear.
3.2 Failure Factors Structuring Methodology
31
Table 3.4 Failure factors structuring methodology (FFSM)
Phase 1 (structuring)
Phase 2 (grouping)
Phase 3 (making observations, Modifying worldviews)
Analysis method Structured model analysis
Purpose Bird’s eye view of the entire structure (to clarify factor structure) Extraction of Hayashi’s factors, a grouping Quantification of similar theory problems Type III Mapping of factors Observation of new maintenance into maintenance cognition frame frame (Fig. 3.12)
Effects Prioritizes countermeasures, clarifies causal relationships Identifies countermeasures, clarifies directions of countermeasures Detection of shortcomings in the current maintenance system
Ability to manage emergent properties –
–
○
Table 3.5 New maintenance system configuration
Worldview of maintenance Period Applicable boundary
Methodology (input) Maintenance system functions
Production output
Configuration Operation Evaluation Design subsystem subsystem subsystem subsystem (cognitive frame of maintenance) organizational culture, ritual, attitudes to safety, awareness of the cost of safety, maintenance organization, and so on…. System design System System operation Evaluation configuration Individual All maintenance Individual – All maintenance objective systems maintenance maintenance objective systems objective objective systems systems – Maintenance system – Maintenance worldview Maintenance planning (design Operation design, Evaluation design, review) failures failure information FFSM Evaluation of Apply proactive Develop Proactive maintenance, apply maintenance operating maintenance operation (failure manuals, define reactive design, reactive rates, MTBF, cost, maintenance maintenance maintenance etc.) organization design Feedback Maintenance Operating Operating log, fix improvements and manuals manuals applications horizontal (functional enhancements, bug deployment, modification of fixes), system maintenance recovery worldview
32
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
From the above, we can see that various human activity systems apply to FFSM. As will be discussed in Chap. 8, this shows the diversity of application areas of this meta-methodology as well as FFSM.
3.3
System Failure Dynamic Model
The author proposes a new methodology that covers the entire SOSF space. The new methodology is called the System Failure Dynamic Model (SFDM) and is shown in Fig. 3.13.
3.3.1
Understanding System Failures through Dynamic Models
System failures caused by deviations from operational standards occur frequently but are rarely understood. For example, deviant system failure is believed to lead to NASA’s Challenger and Columbia space shuttle disasters (The Columbia Accident Investigation Board Report [75], Chap. 6, pp. 130). This normalized deviance effect is hard to understand from a static failure analysis model. NASA points out the notion of “History as Cause” for repeated disastrous failures (The Columbia Accident Investigation Board Report [75], Chap. 8). And this normalized deviance is tightly related so-called “incubation period” before catastrophic disasters (Turner and Pidgeon [76]; Vaughan [80]). These considerations imply the usefulness to focus on the dynamic aspects of the cause and effect of system failures rather than
Fig. 3.13 Three-dimensional SOSF space with SFDM
3.3 System Failure Dynamic Model
33
the static aspects. Dynamic model analysis is applicable in all technology arenas, including high-risk technology domains like that of NASA. However, certain pitfalls exist concerning the introduction of countermeasures. Ad hoc interim measures (i.e., Quick fixes) may appear to work in the short term, but over time they gradually become less effective, or they may even compromise the organizational capacity to a state that is worse than the original state. This phenomenon can be explained through a dynamic model that we call the safety archetype. There are well-known archetypes of “fixes that fail,” “eroding safety goals,” and “degrading the incident reporting scheme” (Braun [10]). The traditional dynamic model incorporates several key notations that are useful for examining system failures. Table 3.6 summarizes the symbols used in the dynamic model. In particular, the notation of system boundaries in the dynamic model notation (indicated by a solid line and introduced to ensure that the methodology works) is useful to avoid side effects introduced by reinforcing incorrect countermeasures. The R and B symbols are used in combination with IC or UC. For example, BIC stands for balancing the intended consequences loop and RUC stands for reinforcing the unintended consequences loop. The + sign means that an increase (decrease) in state 1 causes an increase (decrease) in state 2. Conversely, the - sign means that an increase (decrease) in state 1 causes a decrease (increase) in state 2. The archetype of a problem and its side effects reveals the leverage point at which the countermeasure is introduced. Various Archetypes Related to Failures of Technical Systems (Archetypes) There are three different problem Archetypes and corresponding solution Archetypes for each. (1) a system failure archetype for all failure classes, (2) an archetype that misinterprets failure classes 2 and 3 as failure class 1, and (3) an archetype that misinterprets failure classes 1 as failure classes 2 and 3. Here, (3) above is included
Table 3.6 Symbols used in dynamic models
Symbol/Notation R B = IC UC + − Problem Side effect Solution
Feature Reinforcing loop Balancing loop Time delay of an effect System boundary Intended consequences (combination with R or B) Unintended consequences (combination with R or B) Positive feedback loop Negative feedback loop Problem type of dynamic model Side effect type of dynamic model Solution type of dynamic model
34
3 Proposal of a New Methodology to Overcome Current Methodological Shortcomings
Fig. 3.14 Problem and solution archetypes in engineering system failures through time
in the special case in (1) and is therefore explained in (1), so (3) above is excluded. Figure 3.14 shows the evolution of the safety archetype of a technical system over time. Both (1) and (2) have a solution archetype obtained by single-loop learning (column 3 in Fig. 3.14). These solution archetypes appear to work for a short period, but gradually various side effects appear (fourth column of Fig. 3.14). The prototype of the solution (Archetype) with double-loop learning gets to the true root cause to promote the safety of the technical system (column 5 of Fig. 3.14). In the following sections, the author describes each scenario of the dynamic model presented in Fig. 3.14. Turner and Pidgeon [76] found that organizations responsible for failure commonly have “failure of foresight.” They say that the disaster had a long “incubation period” characterized by several discrepant events signaling potential danger. These events are typically overlooked or misunderstood and accumulate unwittingly; Turner and Pidgeon [76] decompose catastrophic disasters over time into six stages, from the initial phase to cultural readjustment (Turner and Pidgeon [76], p. 88). Table 3.7 displays the characteristics of each stage and the relationship between each stage, the failure classes, and the safety Archetype described above. System Failure Archetype: Problem System failures require measures that act on the root cause and ultimately reduce the class 1 failures. This is a simple scenario since the failure and its cause exist in the same system. This archetype is displayed in the order (i) to (iii) in Fig. 3.15. The
3.3 System Failure Dynamic Model
35
Table 3.7 Six stages of development system failures and their relation to safety archetypes State of development Stage I) Initial beliefs and norms
Stage II) Incubation period
Stage III) Precipitating event Stage IV) onset Stage V) Rescue and salvage Stage VI) Full cultural readjustment
Feature Failure to comply with existing regulations Class1
Events unnoticed or misunderstood because of erroneous assumptions
Events unnoticed or misunderstood because of difficulties in handling information in complex situations Effective violation of precautions passing unnoticed because of ‘cultural –lag’ in formal precautions Events unnoticed or misunderstood because of a reluctance to fear the worst outcome ―
Failure class Safety archetype Class1 System failure archetype (Fig.3.15) Goal introduction (Fig.3.15) Reinforcement of current action (Fig.3.15) Class3 Complacency (Fig.3.15) Class2 Misunderstanding class 2 or and 3 3 failure as class 1 (Fig.3.16) Class2 Fix that fail (Fig.3.17)
Class1 and 3
Erosion of safety goals (Fig.3.16)
Class3
The incentive to report fewer incidents (Fig.3.16)
―
―
― ―
― ―
― ―
The establishment of a new level of precautions and expectations
Class3
Close disjunction between stakeholders (Fig.3.18) Introduction of absolute goal (Fig.3.19) Enlargement of system boundary (Fig.3.17)
arrow (i) with the + sign indicates an increase in its activity with an increase in system failures. The arrow (ii) with the -sign represents the removal of the root cause by the increase in activity, while the arrow (iii) with the +sign represents the decrease in system failures by the removal of the root cause. However, since the outcome of this Archetype is a BIC loop, it will eventually reach a saturated state. If this saturation state exceeds the intended purpose or goal, then there is no problem. If not, then another solution is needed for the halfway state of achievement as the intended goal is not achieved.
36
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
Action
(ii) Root Cause
(iii)
+
BIC + +
+
(v)
(i)
RIC System Failure
(viii)
(iv)
+
(vi) -
BUC Oversight
Compare Goal & Adjust Action
Perceived Safety
+
(vii)
Fig. 3.15 System failure archetype
System Failure Archetype: Solution A simple solution to system failure is to determine a goal, compare it to the current situation, and modify the activity. This allows the modification of activities to continue until the goal is achieved. The RIC loop on the right side of Fig. 3.15 breaks the state of balance. The flow of this archetype is indicated by (iv) and (v) in Fig. 3.15. The arrow (iv) with the + sign increases the countermeasure “compare goals and modify activities” as the “system failure” increases. Arrow (v) with the + sign indicates that an increase in “compare goals and modify activities” leads to an increase in “actions.” This is a simple scenario of an archetype solution to a system failure. This is a typical case of a single learning loop, a characteristic possessed by most current troubleshooting techniques. Complacency Archetype: Side Effects This problem archetype is the side effect archetype of the system failure archetype (solution) explained above. The activity loop of the system failure archetype (solution) continues for some time. This raises safety awareness within the system boundary, but eventually, oversights occur and the system fails again. This halfway state of accomplishment is the reason for repeated system failures over long time intervals. The sequence of this archetype is shown in Fig. 3.15, from (vi) to (viii). Arrow (vi) with the - sign indicates that the “perception of safety” increases as “system failures” decrease. Arrow (vii) with the + sign induces “oversight” system failures as the “perception of safety” increases, and arrow (viii) with the + sign indicates that “oversight” system failures are “more likely” to occur. The “oversight” system failures are linked to an increase in “system failures” by the arrow with the + sign (viii).
3.3 System Failure Dynamic Model
Action
(ii) +
BIC
Quick Fix
+
+
(vii) (i)
Compare Goal & Reinforce Action
No effect Open loop
RIC (vi)
(iii) +
No effect Open loop
(v)
37
Class 1 Failure
BUC Root Cause
- (ix)
+
+
BUC (viii)
Pressure to Adjust Goal; Incentive to Report Fewer Incidents
(iv)
Fig. 3.16 Misunderstanding system failure archetype
Misunderstanding Class 2 or 3 Failure as Class 1 Archetype: Problem Figure 3.16 explains why this archetype leads to system failures due to ad hoc interim measures (i.e., quick fixes) and inappropriate measures. Ad hoc interim measures (i.e., quick fixes) may reduce system failures in the short term, but their effectiveness gradually declines to a level that does not meet organizational goals. The BIC loop in the upper part of Fig. 3.16 remains open because ad hoc interim measures do not produce further effects. The BUC loop at the bottom of Fig. 3.16 remains open because it has no intrinsic effect on the original problem as a result of a misunderstanding of the class of system failures. The order of this archetype is indicated by (i) through (v) in Fig. 3.16. Arrow (i) with the + sign indicates that an increase in “Class 1 failure” causes “action,” while arrow (ii) with the + sign indicates that a “Quick Fix” solution to class 1 failure is introduced. The dashed arrow (iii) indicates that the “ad hoc interim measures (i.e., Quick Fix)” contribute only marginally to the reduction of “Class 1 failures.” The dashed arrow (iv) indicates that the “root cause” is outside the system boundary and is not affected by the dashed arrow (iv). Therefore, the arrow (v) with the + sign has the impact of increasing “Class 1 Failures.” Misunderstanding Failure Classes 2 and 3 as Failure Class 1 Archetype: Solution In the single-loop learning scenario (Fig. 3.16), where activities are reinforced based on deviations from predetermined goals, the RIC loop simply repeats the scenario where further interim measures are introduced as activities to improve the situation. The arrows (vi) and (vii) in Fig. 3.16 illustrate this process. Arrow (vi) with the + sign indicates that the increase in “Class 1 Failures” accelerates the “intensification of activities relative to the target.” Arrow (vii) with the + sign indicates that the “intensify activity compared to target” is accelerated and the “activity”
38
3 Proposal of a New Methodology to Overcome Current Methodological Shortcomings
is intensified. This RIC loop generates various side effects that incentivize the degradation of safety goals and the reduction of incident reporting. These side effects are very difficult to detect because they are missed by simply checking quantitative performance since alarms for malfunctioning organizational activities do not work. Van Gigch [79] has pointed out that such system improvements fail. In such halfachieved situations, root causes outside the system boundaries need to be addressed. rosion of Safety Goals and Incentive to Report Fewer Incidents: E Side Effects This side effect caused by the RIC loop occurs because the RIC loop is further strengthened. After all, there is no further reduction in system failures. A BUC loop emerges, in which increased pressure to achieve a goal leads to changing the goal (i.e., lowering the target value) or hiding the current status of quality and safety from the management layer. In this context of half-achieved status, it becomes difficult for managers inside the system boundary to see the actual status of achievement. This is why many Japanese manufacturers have a slogan of “3R-ism,” which asks managers to see if they have identified a problem at a “real site,” confirmed it with “real objects,” and discussed it with a “real person in charge,” before taking any action. The arrows (viii) through (ix) in Fig. 3.16 shows the sequence of this archetype. The + signed arrow (viii) indicates that an increase in “Class 1 Failures” creates “pressure to adjust targets and incentives to reduce incident reporting,” and the -signed arrow (ix) shows the BUC loop that covers up the “Class 1 Failure.” Fix that Fails Archetype: Side Effects Figure 3.17 displays a typical case of local optimization. The activity applied to the root cause is a short-term countermeasure to the problem, leading to Class 2 and 3 failures with delayed and unanticipated effects outside the system boundary. For example, an operations manager may shift personnel from a team responsible for proactive tasks to a team responsible for reactive tasks due to a sudden increase in system failures, but this may create an RUC loop that generates additional system failures. Such a management approach wastes human resources and leads to a loss of organizational capability in the long run. The sequence of archetypes is represented by arrows (i) through (vii) in Fig. 3.17. Arrow (i) with the + sign indicates that an increase in “Class 2 and 3 failures” leads to an increase in “action” within the system boundary. Since this “action” is not for the correct “root cause” at this stage, the dashed arrow (ii) with the + sign indicates that the “action” is not against the correct “root cause” at this stage. Alternatively, the dashed arrow (ii) with the time-delay symbol “=” may increase “class 2 or 3 failures” as a side effect of the locally optimal measure. Arrows (iii) and (iv) with the + sign provide the function of “adjusting goals and reinforcing activities” without reducing “Class 2 and 3
3.3 System Failure Dynamic Model
39
+
Action
(iv)
+ (ii) +
(vi)
-
Adjust Goal & Reinforcing Action
(i)
BIC
RIC Class 2 or 3 Failure
+
(iii)
+ (vii)
Root Cause
(v) Enlarge System Boundary
Fig. 3.17 Fix that fails archetype (side effect)
Action
+
+ (iii)
(i)
BIC
RIC
(iv)
Class 2 Failure
Adjust Goal & Define Ultimate Solution
(vi)
+ (v) -
+
+
(ii)
Awareness Gap Between Subjective and Objective Responsibility
Root Cause
Fig. 3.18 Double-loop learning for Class 2 failure (solution)
failures.” Arrows (v) through (vii) are double-loop learning solutions of the Archetype that fail to address this side effect. Double-Loop Learning for Class 2 Failure Archetype: Solution As noted above, the focus needs to be on the potential side effects of halfaccomplished situations and ad hoc interim measures. Gaps in perceptions based on implicit assumptions of stakeholders need to be bridged through discussion to bridge the gap in responsibility sharing. Figure 3.18 shows the solution to the Archetype, which misinterprets failure classes 2 and 3 shown in Fig. 3.16 as class 1
40
3
Proposal of a New Methodology to Overcome Current Methodological Shortcomings
+
Action
(v)
+ (ii)
(i)
BIC
RIC Class 3 Failure
(vi)
Adjust Goal & Define Ultimate Solution
(viii)
+ -
+
+ Awareness Gap Between Current and Ideal Goals
(vii) Root Cause
(iv)
+ (iii) Introduce Ideal (Absolute) Goal
Fig. 3.19 Double-loop learning for Class 3 failure (solution)
failure. The order of this archetype is indicated by arrows (i) through (vi) in Fig. 3.18. Arrow (i) with the + sign indicates that an increase in “class 2 failures” will increase activity inside the system boundary. This “action” causes various side effects that degrade safety goals and reduce incident reporting, as described in the section on Erosion of safety goals and incentive to report fewer incidents (Side Effect). Arrow (ii) with + sign reviews the mental gap between stakeholders and redefines or realigns the ultimate goal. Arrows (iii) with + signs trigger new “action.” Arrow (iv) with the - sign indicates that the new “action” corresponds to a true “root cause” outside the system boundary. Arrow (v) with the + sign shows the “Class 2 Failure” decreasing and arrow (vi) with the + sign shows the “Adjusting Goal and Define Ultimate Solution.” Double-Loop Learning for Class 3 Failure Archetype: Solution As explained in Sect. 2.3, the speed of technological progress and increasing complexity is unpredictable. Therefore, current goals become obsolete over time. This is the root cause of failure that cannot be attributed to anyone within the framework of current organizational activities. In other words, system failures emerge for which no one is responsible. This type of failure can be avoided by regularly monitoring the achievement of goals and benchmarking competitors. The sequence of archetypes is indicated by arrows (i) through (viii) in Fig. 3.19. Arrow (i) with the + sign indicates the increase in “action” within the system boundary due to the increase in “class 3 failures.” At this stage, this “action” does not correspond to the correct “root cause,” so the dashed arrow (ii) indicates that it does not affect reducing the “Class 3 Failures.” Arrows (iii) and (iv) with the + sign indicate that the “gap between the current goal and the ideal goal” is noticed and corrected to the “ideal goal and the ultimate solution.” Arrow (v) with the + sign leads to new “action” and arrow (vi) with the – sign correctly addresses the “root cause” so that “Class 3
3.3 System Failure Dynamic Model
41
failures” are reduced. Arrow (viii) with the + sign further leads to the “ultimate solution to modify the goal.” Double-Loop Learning for Fix that Fails Archetype: Solution The solution to this archetype is to raise the perspective on the problem (Fig. 3.17). By extending the assumed system boundaries, Class 2 and 3 failures become Class 1 failures. The archetype of this solution is indicated by arrows (v) through (vii) in Fig. 3.17. Arrow (v) extends the system boundary to incorporate the “root cause” within the system boundary. This makes Class 2 and 3 failures into Class 1 failures. Arrow (vi) with the - sign then attacks the right “Root Cause,” which decreases “Class 2 or 3 Failure” via arrow (vii).
Chapter 4
Application to ICT System Failures
Abstract This chapter first introduces the various tools that will be utilized when the meta-methodology SOSF is specifically applied. These are the diagnostic flow of system failures, the SO space map, the OP matrix, and a new learning cycle. After preparing these tools, the two methodologies introduced in the previous section, FFSM, and SFDM, are applied to real incidents in the world of ICT failures. First, FFSM is applied to the problem of prolonged downtime of ICT systems. Specifically, the application will proceed in the following three phases. In Phase 1, structural model analysis is conducted using ISM to structure the multiple causes of long downtime and clarify the relationships among the individual factors. In Phase 2, the author utilizes the Quantification theory type III to identify the hidden factors that cause long downtime. In Phase 3, the results obtained from Phases 1 and 2 will be used to identify preventive measures. The next example is an attempt to apply SFDM to the server noise problem and determine whether the problem is caused by the server design or the server operation. The system dynamics methods and various system archetypes are utilized to solve the problem. Keywords ISM (Interpretive structural modeling) · Quantification theory type III · System Dynamics · System Archetype · Information and Communication Technology (ICT)
4.1
Scenarios for Applying SOSF to ICT System Failures
SOSF is a meta-methodology to be applied to system failures. While SOSF can examine the dynamic aspects of system failures, a long-term perspective, as NASA points out “history as cause” (Sect. 3.3.1), must be consciously used when applying SOSF in practice. Reason [68] described the lifespan of a hypothetical organization through production-protection space in Fig. 4.1 and explains why organizational © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_4
43
44
4
Protection
Application to ICT System Failures
Bankruptcy
Unrocked Boat
Better Defenses Converted to Increased Production Catastrophe
Production
Fig. 4.1 Lifespan of a hypothetical organization through production-protection space
accidents repeat, with this history ending in catastrophe. This is why we must identify the side effects of the organizational dynamic.
4.1.1
SOSF and the Diagnostic Flow of System Failures
Introduced by Jackson [38], SOSM has several paradigms such as hard systems thinking and soft systems thinking (Checkland [13]; Checkland and Holwell [15]). Figure 4.2 shows the failure flow as a cognitive filter, which allows us to examine all paradigms in the SOSF space. It is essential to fully check all possible causes and identify their organizational paradigms; the SO space map and the OP matrix are special tools used in the system failure diagnostic flow and are described in the next section.
4.1.2
SO Space Map
SO represents the subjective and objective viewpoints. Therefore, the SO space map is two-dimensional if goals and causes are the objects of investigation, they correspond to the responsible sources. There are two stages in a realistic scenario for the use of the SO space map. The first is to set goals and identify causes when diagnosing system failures. As Fig. 4.3 shows, the subjective perspective of each stakeholder is the same (horizontal arrow), and the objective perspective of each stakeholder is mutually exclusive and collectively exhaustive (MECE) (vertical arrow).
4.1.3
OP Matrix
OP stands for objective and problem. The OP matrix examines the mismatch between objectives and problems to verify that the current objectives completely cover the past failures of the system (Fig. 4.4). The first quadrant, labeled (P,
4.1 Scenarios for Applying SOSF to ICT System Failures
45
Fig. 4.2 Diagnostic failures flow as a cognitive filter
O) = (OK, OK), is a normal situation since the goal has been achieved and there is no repetition of similar problems. The second quadrant with (P, O) = (NG, OK) indicates that there may be a discrepancy in perception among stakeholders. This indicates that there is a problem that is not anyone’s responsibility in the current framework of organizational activities. The third quadrant labeled (P, O) = (OK,
46
4 O (objective)
Application to ICT System Failures
SO space Map (goal setting) C’s responsibility (B’s view)
C’s responsibility (C’s view)
C Disjunction: should be MECE B’s responsibility (B’s view)
B
B’s responsibility (C’s view)
Disjunction: should be identical
A
Stakeholder
A
B
C
S (subjective)
Fig. 4.3 SO space map (goal setting)
Problem repetition
O
P
OK
Goal achievement
OK NG
Normal Sign of system failure
NG Disjunction with SO space map Hard approach effective
Fig. 4.4 OP matrix (objective-problem)
NG) may be a sign of system failure that has not yet fully occurred. Perhaps the goal should be revised to capture the reality of the recurring problem. The fourth quadrant labeled (P, O) = (NG, NG) may benefit from a hard systems thinking approach (i.e., current troubleshooting methodologies). A realistic and desirable use scenario for the OP matrix is to use it during regularly scheduled management reviews. The OP matrix is used to understand why system failures occur and why fixes do not work over time. Figure 4.5 illustrates a vicious cycle of repeated system failures. This should be noted even if the first quadrant is a situation where (P, O) = (OK, OK), i.e., in a management review of the safety of a technical system. From this state, there can be a transition to the fourth quadrant, i.e., (P, O) = (NG, NG), due to the complacency described in Sect. 3.3.1, “Complacency Archetype: side effects.” The misinterpreting of the system failure described in “Misunderstanding Class 2 or 3 failure as Class 1 Archetype: Problem” may lead to a transition to another quadrant. The transition to the third quadrant, i.e., (P, O) = (OK, NG), as the “Erosion of
4.1
Scenarios for Applying SOSF to ICT System Failures
47
Fig. 4.5 Vicious circles indicating repeated system failures
safety goals and incentive to report fewer incidents: Side Effects” due to the movement to reinforce the current activities described in “Misunderstanding Class 2 or 3 failure as Class 1 Archetype: Problem.” The “archetype of the incentive not to report an incident” described in “Erosion of safety goals and incentive to report fewer incidents: Side Effects” may cause a transition to the second quadrant, i.e., (P, O) = (NG, OK). These transitions would be followed by a transition to the first state, the first quadrant, giving managers the false impression that the safety goal has been achieved. This is exactly what happened in the space shuttle disaster caused by NASA’s deviation from the norm, and illustrates a case of an organizational time transition leading to a catastrophe (Fig. 4.1). The diagnostic flow (Fig. 4.2) examines all of the SOSF space to gain a holistic understanding of system failures, thus ensuring that measures are mutually exclusive and collectively exhaustive (MECE). The OP matrix can also capture the dynamic aspects of system failures over time, facilitating double-loop learning to change the model of activities.
4.1.4 A New Cycle of Learning to Avoid System Failures So far, the author has introduced various tools related to SOSF (diagnostic flow of failures, SO space map, and OP matrix). These tools facilitate double-loop learning that transforms the model of the activity. The dynamic model analysis is effective in raising the level of the problem owner’s perspective. By extending the system
48
4
Application to ICT System Failures
Change mental model of action Enlarge system boundary Close stakeholders disjunctions Set absolute goals
Meta Model Change mental model
OP Matrix
SO Space Map (goal setting)
SOSF with system failures flow diagnosis
New Ideal Status
Model Change operating norm
Ideal Status
Goals
Debate
Reality
Goals
Change operating process
System Failure
(accommodated) Design
t
Debate
Operation
New Goals New Process
System Failure
New Design
Fig. 4.6 New learning cycle for preventing system failures
boundaries, class 2 and 3 failures become class 1 failures. This is an example of extending thinking to the meta-model layer in pursuit of a real-world solution. Thus, the first step is to extend the system boundary as widely as possible until the problem owner can manage the problem. By doing so, Class 2 and 3 failures become real problems under their control. According to the model used when discussing the dynamic model, the SO space map can be used to introduce an ultimate goal and resolve gaps in perception among stakeholders. Figure 4.6 shows a new cycle of learning that avoids system failures. The vertical dimension is consciously separated into reality, model, and meta-model to ensure that double-loop learning is facilitated. The learning cycle is extended throughout the system life cycle with the horizontal dimension separating the design and operational phases. The scenario for identifying system failures follows the basic steps of the system failure diagnostic flow: 1. Identify the location in the SOSF space for the system failure of interest. 2. Identify side effects due to dynamic movement. 3. Extend the system boundary as far as possible to convert system failures from classes 2 and 3 to class 1. 4. Formulate countermeasures according to the situations 1) to 3) above.
4.2 Application of FFSM to Long-Time Down Incidents
49
4.2 Application of FFSM to Long-Time Down Incidents This section describes the results of applying FFSM to a maintenance system that manages incidents of PC servers being down for long periods. In this case, it is necessary to analyze PC server incidents that occurred over a certain period to clarify the structure of each factor leading to long downtime and to assign appropriate quantitative weights to these factors. Sample data for analysis. Period: April to July 2004. Number of samples: 58 incidents that occurred during the above period (incidents for which more than 3 h elapsed from occurrence to recovery). For each incident, the following data classification was applied to create a 58 × 8 incident-factor matrix (Appendix B). For all incidents, associate the appropriate factor (or factors) from among the 8 factors for the long downtime. The 8 factors extracted from the engineer’s experience-based knowledge of dealing with long downtime are as follows: S1: Product. S2: Isolation (diagnosis of defective parts). S3: Maintenance organization (skills, size, and deployment). S4: Maintenance parts sufficiency (parts placement and delivery). S5: Failure of maintenance parts. S6: Fix not applied (EC* not applied). S7: Recovery process. S8: Software bugs. EC*: EC stands for Engineering Change and refers to the correction of problems in the field by engineers.
4.2.1 Phase 1 (Structural Model Analysis: ISM) Figure 4.7 shows the direct influential matrix X* obtained by analyzing the causal relationships among the eight factors (S1–S8) described above. The direct influential matrix is a causality matrix whose rows and columns have values from S1 to S8, respectively. X* = (xjk): xjk = 3, 2, 1 (denotes the strength of the causal relationship when there is a direct causal relationship from row j to column k. 3: strongly related; 2: moderately related; 1: mildly related). x jk = blank ( when there is no direct causal relationship from row j to column k ) Figure 4.8 shows the adjacent matrix A. Each element is either 1 or blank, with 1 indicating a direct causal relationship and blank indicating no direct causal relationship. Figure 4.9 shows the vector graph generated from Figs. 4.7 and 4.8, with arrows indicating any direct causal relationships between elements.
4
50 Fig. 4.7 Direct influential matrix X*
S1 S1
S2
Application to ICT System Failures S3
S4
3
S5
S6
S7
S8
2
1
3
2
S2
3
S3
2
2
S4
1
3 1
S5
2
1
1
S6
1
2
1
3
1
3
1
S7 S8
Fig. 4.8 Adjacent Matrix A
Fig. 4.9 Vector graph
S8
S1 S2
S7
S6
S3
S5
S4
51
4.2 Application of FFSM to Long-Time Down Incidents
Figure 4.10 shows the reachable matrix T. The following Boolean algebraic operation is performed to add the unit matrix I to the adjacent matrix A. If (A + I)r − 1 ≠ (A + I)r = (A + I)r + 1 = T, then T = (t ij) is a reachable matrix. Check the following conditions: R i ∩ Ai = R i where, S = {S1, S2, ......, S8} is a node group. Ri = {Sj ∊ S | tij = 1} is a node group (lower node) reachable from node i. Ai = {Sj ∊ S | tji = 1} is a group of nodes reachable from node i (upper node). Ri∩Ai = Ri means that Si is the lowest node since Ai encompasses Ri. The first step is to find the lowest node among the eight nodes. As Table 4.1 shows, S7 is the lowest node among the level 1 node. Next, we find the next lowest node by removing the lowest level 1 node S7 in Table 4.1. As Table 4.2 shows S2 is the lowest level 2 node. By repeating this process, we find that S4, S5, and S8 are Fig. 4.10 Reachable matrix T S1
S1
S2
1
1
S2
1
S3
1
S4
S3
S4
S5
S6
S7
S8
1
1
1
1
1
1 1
1
1
1
1
S5
1
S6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
S7 S8
1 1
1
1
1
1
Table 4.1 Level 1 node S1 S2 S3 S4 S5 S6 S7 S8
Ri 1,2,4,5,6,7,8 2,7 2,3,4,5,6,7,8 2,4,5,7,8 2,4,5,7,8 2,4,5,6,7,8 7 2,4,5,7,8
Ai 1 1,2,3,4,5,6,8 3 1,3,4,5,6,8 1,3,4,5,6,8 1,3,6 1,2,3,4,5,6,7,8 1,3,4,5,6,8
Ri∩Ai Ri∩Ai = Ri S1 1 1 2 3 4,5,8 4,5,8 6 7 ○ 4,5,8
S2 1 1 1 1 1 1 1
S3
S4 1
S5 1
S6 1
1
1 1 1 1
1 1 1 1
1
1
1
1
S7 1 1 1 1 1 1 1 1
S8 1 1 1 1 1 1
52
4 Application to ICT System Failures
Table 4.2 Level 2 node S1 S2 S3 S4 S5 S6 S8
Ri 1,2,4,5,6,8 2 2,3,4,5,6,8 2,4,5,8 2,4,5,8 2,4,5,6,8 2,4,5,8
Ai 1 1,2,3,4,5,6,8 3 1,3,4,5,6,8 1,3,4,5,6,8 1,3,6 1,3,4,5,6,8
Ri∩Ai 1 2 3 4,5,8 4,5,8 6 4,5,8
Ri∩Ai = Ri
S1 1
○
S2 1 1 1 1 1 1 1
S3
S4 1
S5 1
S6 1
S8 1
1
1 1 1 1 1
1 1 1 1 1
1
1 1 1 1 1
1
Table 4.3 Level 3 node S1 S3 S4 S5 S6 S8
Ri 1,4,5,6,8 3,4,5,6,8 4,5,8 4,5,8 4,5,6,8 4,5,8
Ai 1 3 1,3,4,5,6,8 1,3,4,5,6,8 1,3,6 1,3,4,5,6,8
Ri∩Ai 1 3 4,5,8 4,5,8 6 4,5,8
Ri∩Ai = Ri
S1 1
S3 1
○ ○ ○
S4 1 1 1 1 1 1
S5 1 1 1 1 1 1
S6 1 1
1
S8 1 1 1 1 1 1
the lowest level 3 nodes (Table 4.3). Then we see that S6 is the lowest level 4 node (Table 4.4) and S1 and S3 are the lowest level 5 nodes (Table 4.5). All 8 nodes were classified into 5 levels. The importance of this level indicates that the higher-level factors are the root causes of the lower-level factors. To consider indirect causation in causal analysis, it is necessary to introduce the normalized direct influential matrix X (Fig. 4.11). This normalized direct influential matrix X is obtained by dividing each element by the maximum value (11) in the sum of the numbers in each row of the direct influential matrix X*. The total influential matrix Z (Fig. 4.12), including indirect causality, can be obtained by performing the following operations on X:
Z = X1 + X 2 + X 3 +… = X∗ ( I − X ) −1
The elements of the total influential matrix Z represent the relative weights of each causal relationship. Figure 4.13 shows the overall structure of the eight factors present at the five levels. The number appended to each arrow represents an element of the total influential matrix Z. The value of each element of the normalized direct influential matrix X in Fig. 4.11 equals the value of each element of X* in Fig. 4.7 divided by the maximum load factor 11. (For example, the S1–S2 component of the direct impact matrix X* in Fig. 4.7 is 3, and dividing it by 11 yields 0.27, the S1–S2 component of the normalized direct impact matrix X shown in Fig. 4.11).
4.2 Application of FFSM to Long-Time Down Incidents
53
Table 4.4 Level 4 node S1 S3 S6
Ri 1,6 3,6 6
Ai 1 3 1,3,6
Ri∩Ai 1 3 6
Fig. 4.11 Normalized direct influential matrix X
S1 S1
Ri∩Ai = Ri
S1 1
S3 1
○
S2
S3
S4
0.27
S5
S6
S7
S8
0.18
0.09
0.27
0.18
S2 S3
S6 1 1 1
0.27 0.18
0.18
S4
0.09
0.27 0.09
S5
0.18
0.09
0.09
S6
0.09
0.18
0.09
0.27
0.09
0.27
0.09
S7 S8
Table 4.5 Level 5 node S1 S3
4.2.2
Ri 1 3
Ai 1 3
Ri∩Ai 1 3
Ri∩Ai = Ri ○ ○
S1 1
S3 1
Discussion of Phase 1 Analysis
The top-level factors (i.e., root causes) at level 5 in Fig. 4.13 are S1 (product) and S3 (maintenance organization); S6 (fix not applied) is at level 4; S4 (maintenance parts sufficiency), S5 (failures of maintenance parts) and S8 (software bugs) are at level 3; S2 (isolation (diagnosis of defective parts)) is at level 2; S7 (recovery process) is at level 1. In this structured model analysis, S2 (isolation (diagnosis of defective parts)) is at the second lowest level and S7 (recovery process) is at the lowest level. This result is counter-intuitive since these factors tend to be treated as direct causes of long downtime incidents. This indicates that simply improving the isolation (diagnosis of defective parts) (S2) and recovery process (S7) is not sufficient to reduce long downtimes. In addition, Level 4, S6 (fix not applied), indicates that promoting the application of EC (Engineering Change: correction) to PC servers is an excellent
54
4
Fig. 4.12 Total influential matrix Z
S1 S1
S2
S3
0.37
Application to ICT System Failures S4
S5
S6
S7
S8
0.02
0.21
0.09
0.45
0.20
S2
0.27
S3
0.21
0.03
0.18
0.35
S4
0.02
0.01
0.09
0.11
0.01
S5
0.20
0.09
0.02
0.18
0.09
S6
0.12
0.02
0.18
0.14
0.02
0.29
0.01
0.09
0.36
0.01
S7 S8
countermeasure. The main top factors for S7 (recovery process) are S1 (product) (0.45), S8 (software bugs) (0.36), and S3 (maintenance organization) (0.35). The numbers in parentheses represent the relative weight of each causal relationship. This suggests that the root causes of the long time required for recovery are product quality and maintenance organization related to software work; the main top factor for S2 (isolation (diagnosis of defective parts)) is S1 (product) (0.37), followed by S8 (software bugs) (0.29). The numbers in parentheses indicate that, as before, product and software bugs are the root causes of the long time required for recovery.
4.2.3
Phase 2 (Quantification Theory Type III) Analysis
A total of 58 incidents were analyzed using a factor matrix for the eight long-time factors (Appendix B). The author will now apply the Quantification theory type III analysis method to the factor matrix. Quantification theory type III is a well-known method of multi-variable quantification analysis. For this analysis, the author used a PC software called “Excel Statistics 2002 for Windows [41].” Table 4.6 displays the factorial axes up to 6 axes, eigenvalues, and contribution ratios. The cumulative contribution to the three most important axes in Table 4.6 is 53%. This indicates that about half of the long downtime incidents are related to these three axes. The first three axes are named as follows. Each name indicates a hidden factor extracted in the Phase 2 analysis. • Axis 1, Isolation for faulty parts. • Axis 2, Software recovery. • Axis 3, Hardware Maintenance Organization.
4.2 Application of FFSM to Long-Time Down Incidents
55
Level-5
S1
S3
Level-4
0.09
0.18
S6
0.18 Level-3
0.02
0.21
0.03
0.03 0.02
0.09 0.02
S4
0.01
0.09
S5
0.09
0.20
S8
0.01 0.09
0.02 Level-2
0.37 0.12
0.20
0.21
S2 0.29
0.11
0.45
0.27
Level-1
0.35 0.14
S7
0.18 0.36
Fig. 4.13 Overall structure of eight factors in five levels
Figure 4.14 shows the factor scores for Axis 1. The factor scores with positive values (in this case, S2 (isolation (diagnosis of defective parts)) and S3 (maintenance organization) are related to the maintenance process, while those with negative values (S8 (software bugs) and S6 (fix not applied)) are related to the product. Thus, the first axis is related to the maintenance process to identify faulty components.
56
4
Application to ICT System Failures
Table 4.6 Factor axes and attributes Contribution Eigenvalue ratio 1st axis 0.7960 19.65% 2nd 0.7124 17.58% axis 3rd axis 0.6558 16.19% 4th axis 0.6090 15.03% 5th axis 0.5343 13.19% 6th axis 0.4157 10.26%
Accumulated distribution ratio 19.65% 37.23%
Correlation coefficient 0.8922 0.8440
53.41% 68.44% 81.63% 91.89%
0.8098 0.7804 0.7309 0.6448
Fig. 4.14 Factor scores for the first axis (Isolation of faulty parts)
Figure 4.15 shows the factor scores on the second axis. The factor scores with positive values (S8 (software bugs) and S3 (maintenance organization)) are related to software recovery, while those with negative values (S4 (maintenance parts sufficiency) and S6 (fix not applied)) are related to the product. Thus, the second axis is related to the software recovery process. Figure 4.16 shows the factor scores on the third axis. The factor scores with positive values (S3 (maintenance organization) and S1 (product)) relate to hardware maintenance organizations, while the factor scores with negative values (S8 (software bugs)) relate to software bugs. Thus, Axis 3 is related to hardware maintenance organizations. Figure 4.17 shows a plane consisting of Axis 1 (isolation for faulty parts) and Axis 2 (software recovery), with each factor mapped on the plane. Figure 4.18 is a plane consisting of Axis 1 (isolation for faulty parts) and Axis 3 (hardware maintenance organization), with each factor mapped on the plane. The maintenance organization factor (S3) is located in the upper right first quadrant of Figs. 4.17 and 4.18 and is closely related to all three axes (isolation for faulty parts, software recovery, and hardware maintenance organization). The isolation factor (S2) is close to zero concerning the second and third axes but has a high positive value concerning the first axis. This indicates that the failed component-specific factors form a single isolated group independent of Axes 2 and 3. The software bugs factor (S8) similarly
4.2 Application of FFSM to Long-Time Down Incidents
57
Fig. 4.15 Factor scores for the second axis (Software recovery)
has a high positive value for the second axis, a high negative value for the first axis,
Fig. 4.16 Factor scores for the third axis (Hardware maintenance organization)
and a value close to zero for the third axis. Therefore, we can say that the software bugs factor is strongly related to the software product that causes long downtimes. The other factors have values close to zero on all axes and therefore do not show any particular characteristics.
4.2.4
Discussion of Phase 2 Analysis
Each sample has three values associated with the three axes (Appendix C), and each sample incident is mapped to the first axis–second axis plane depending on the three scores (Fig. 4.19). The numbers in parentheses in Fig. 4.19 represent the number of sample incidents in the same group. In other words, “m” in the notation of Gm (n) represents the group number (1 ≤ m ≤ 12) and “n” the number of sample incidents. The numbers next to each data point in Fig. 4.19 indicate the number of sample incidents with the same factorial value. Fifty-eight sample incidents were
58
4
Application to ICT System Failures
Fig. 4.17 Factor distribution in the first–second axes space
Fig. 4.18 Factor distribution in the first–third axes space
categorized into 12 groups, which are shown in Appendix C, based on their proximity to each other in the spatial coordinates. Figure 4.20 overlaps the factor mapping (Fig. 4.17) and the sample incidents mapping (Fig. 4.19), showing that G1, G7, and G12 are located near S2 (isolation (diagnosis of defective parts)), S4 (Maintenance parts sufficiency (parts placement and delivery)) and S5 (failure of maintenance parts) and S8 (software bugs) factors, respectively. Therefore, G1, G7, and G12 correspond to RAS (an acronym for
4.2 Application of FFSM to Long-Time Down Incidents
59
Fig. 4.19 Sample incidents mapping in the first–second axis space
Fig. 4.20 Sample group distribution in the first–second axes space
Reliability, Availability, and Serviceability, indicating non-functional requirements of ICT systems), maintenance parts, and software (i.e., software recovery or bugs). On the other hand, G8 and G10 are groups that cannot be unambiguously associated with the main factors related to long downtime. However, all incidents in G8 have the same symptom of simultaneous failure of multiple hard disks, which is different from the symptom in G10. Therefore, G8 and G10 are named the simultaneous failure of the multiple hard disks group and the other group, respectively.
60
4 Application to ICT System Failures
Table 4.7 Decision criteria Decision criteria
Counter- measures
I. Design • S1:Product design. • S4:Spare parts logistics (parts deployment). • S6:EC application planning.
(single-loop learning) Implement preventative measures within the design phase
II. Configuration • S3:Maintenance organization (software / hardware engineer’s organization). • S5:Faulty spare parts. • S6:EC application deployment. • S8:Software recovery (software cause). • Software engineer’s intervention. (single-loop learning) Implement preventative measures within the configuration phase
III. Operation • S2:RAS (including multiple parts usage). • S7:Software recovery (including hardware causes; i.e., multi-dead HDD). • Human error.
(double-loop learning) Alter existing cognitive frame and implement preventative measures into the design or configuration phase
4.2.5 Phase 3 (Exploring the System: Become Aware of the Meaning) This section is devoted to further discussion of G8 and G10, which, among other things, consist of a large number of factors and for which no feasible countermeasures could be found in Phases 1 and 2. The quadrant numbers in Appendix C correspond to the quadrant numbers in Fig. 3.12 and indicate the quadrant responsible for managing the causes. Table 4.7 shows which quadrant is responsible for managing the causes and the measures taken to address them. System failures whose root causes are related to RAS, software recovery, and human error can seek countermeasures to transform the existing framework of maintenance awareness (i.e., double-loop learning) through the new insights gained from this analysis.
4.2.6 Discussion of Phase 3 Analysis Using the criteria presented in Table 4.7, 48 (83%) of the total 58 incidents are classified in quadrant III. In addition, 20 (34%) and 14 (24%) incidents are classified in quadrants I and II, respectively (Fig. 4.21). In terms of proactive prevention, there are more reactive measures (Quadrant III) than proactive measures (Quadrant I and Quadrant II), which indicates that further learning is essential. In other words, the ability to analyze the causes of failure and understand why preventative measures could not be taken is necessary. According to the analysis of the operations subsystem (Quadrant III in Fig. 4.21 and Appendix C), three main factors affect long downtimes. The first factor is the RAS function (43 incidents), the second factor is
4.2 Application of FFSM to Long-Time Down Incidents
61
Unpredictable Alter existing cognitive frame
Observe new cognitive frame
Learn new cognitive frame (a) Develop RAS features (b) Foster Hybrid engineer (c) Develop fail-safe feature
Conceptual World
48 incidents IV. Evaluation III. Operation I.
Design
Real World
II. Configuration
20 incidents
(a) Unable to isolate (RAS) 43 Incidents (37 unable to start up) (b) Software recovery 30 Incidents (15 Multi dead recovery) (c)Human error 1 Incident
14 incidents
Define maintenance cognitive frame and operation rule
Define operational organization
Predictable
Fig. 4.21 Factors contributing to PC server extended downtimes and a maintenance cognitive frame Table 4.8 A new cognitive frame obtained through Phase 3 analysis Clarified the cause for the extended downtime (a) Unable to isolate the cause (b) Unable to recover with hardware engineer (need software engineer’s intervention) (c) Human error leads to a catastrophic situation
New worldview Development of RAS features Fostering of hybrid engineers Development of fail-safe features
Related factor S1 S3 S1
the skill of the technicians (30 incidents), and the third factor is human error (1 incident). Only one incident is related to human error, but considering that human error is a characteristic of human activity and its occurrence is inevitable, it must be properly managed. Measures for human error are discussed in Chap. 8 as a case study of human activity systems. As a result, the Phase 3 analysis leads to the following three new worldviews for PC server maintenance (Table 4.8). (a) Develop RAS functions (related to system startup). (b) Train multi-skilled (hardware and software) technicians (to handle multiple simultaneous hard disk failures). (c) Develop fail-safe functions to avoid major accidents caused by human error.
62
4
Application to ICT System Failures
Countermeasure (a) is to manage emergent failures related to the interface between components. This is because the initial design related to the interface between components cannot handle exceptional problems (emergent problems caused by combinations of components) at the time of system startup. Measure (b) is intended to manage interface problems between hardware and software engineers. Measure (c) is intended to manage human errors (man–machine interface) that lead to major accidents such as long downtime.
4.3 Application of SFDM to Server Noise Problems 4.3.1
Design Failure or Installation Failure?
This case study concerns a noise complaint from several customers using PC servers in an office environment. It took a long time for the server manufacturer to change the design specifications to a noise level acceptable for office use of the server. Initially, there was a design convention for noise levels, and the server’s noise level complied with that convention, so it was not perceived as a design failure. However, the problem was that the designer’s assumption that the server would be used inside the machine room was not communicated to the users. Initially, the claim situation was never remedied because it was perceived to be a Class 1 system failure of structure within the SOSF space (Failure of structure). This was because it was not escalated to the management level due to the degradation of safety goals and the side effects of reduced incident reporting. If all of the SOSF space is reviewed using the diagnostic flow in Fig. 4.2, the failure of this system is identified as a Class 3 evolutionary failure where the designer’s goals and the installer’s (or customer’s) goals become O
I
(G) Installer should set up server in machine room environment. (L) Installer’s System 1 must be rectified.
(G) Customer specified installation in office environment. The installation manual does not prohibit the installation of that server in an office environment.
(G) Servers noise level conforms to current norm.
(G) Server noise level should encompass office use. (L) System 4 should escalate change in server environment to designer’s System 5.
D
Designer
Fig. 4.22 SO space map for the server noise problem
Installer
S
4.3
63
Application of SFDM to Server Noise Problems
misaligned over time. In addition, as shown in the SO space map in Fig. 4.22, this perceived gap originates outside the designer’s system boundaries. In Fig. 4.22, (G) or (L) is shown before each stakeholder’s claim. (G) represents the goal or objective of each stakeholder and (L) represents the system level of the malfunctioning VSM. Since the root cause is in the framework that should be treated as a soft system (Checkland [13]; Checkland and Holwell [15]), measures for systems 1–3 alone are not sufficient, and measures for systems 4 and 5 are needed to change the design norms for server noise levels. It is important to reflect the need for measures in systems 4 and 5 on other server types (e.g., UNIX servers as opposed to PC servers). This will lead to changes in the design norms of the other servers and avoid problems in the other servers. Figure 4.23 shows the difference in avoidability levels across the reality, model, and meta-model. This difference can be verified using the dynamic model. If this problem is treated as a Class 1 failure, the long-term side effects are degradation of safety goals and reduced incident reporting (Fig. 4.24). Figure 4.25 shows that treating the noise problem as a Class 3 failure can lead to essential improvements. In this sense, it is easy to see that the meta-methodology is powerful and system 4 and 5 problems should be escalated to the management level. This would also lead to lower unnecessary costs leading to the final decision to lower the noise level criteria during the design phase.
Error Space
Prevention for all PC servers
Reality Modeling
Change noise level design norms for both PC and UNIX servers
Prevention for specific PC server
Meta Modeling
Error cause Error prevention space Change noise level design norm for PC servers
Ad hoc solution for specific user (soundproof wall)
Prevention for all PC servers and all UNIX servers
Fig. 4.23 PC server noise failure and prevention level
4
64
Application to ICT System Failures
Erosion of Goals; Incentive to Report Fewer Incidents
+
Installation in Machine Room
RUC
+ -
BIC Class 1 (failure of structure and control) Noise Problem
+ +
Class 3 Installation in Office Environment
Fig. 4.24 Class 1: evolutional noise failure (with the erosion of goals or incident reporting)
Introduce Absolute Goal
Goal Awareness Gap Adjust Noise Design Goal + Installation in Machine Room + BIC Class 1 (failure of evolution: design error)
BIC
Noise Problem Class 3 (evolutional error) +
Installation in Office Environment
Fig. 4.25 Class 3: evolutional noise failure
Chapter 5
Discussion of the Application Results
Abstract This chapter summarizes and discusses the facts discovered through the application of the previous chapter. In the case of the application of FFSM to the problem of prolonged downtime of ICT systems, Phase 1 identified the root causes, Phase 2 identified the factors behind the complex phenomenon, and Phase 3 found a workaround for the emergent problem. This indicated that FFSM facilitates double-loop learning even in the technically well-established world of server maintenance and that FFSM can apply to a wide range of human-activated systems. The boundary conditions for FFSM Applicable areas clarified in Sect. 3.2 are: – The system components encompass human activity processes. – The behavior of the system depends on human knowledge. – Causal relationships for results are complex and unclear. The countermeasures for the human errors detected in Phase 3 are discussed in Chap. 8. In Sect. 5.2, three learnings were obtained regarding the results of the application of SFDM. 1. Extend the system boundary as far as possible to convert failure Classes 2 and 3 to 1. 2. Close stakeholder disjunctions (eliminate the perception gap among stakeholders) to reduce Class 2 failures. 3. Determine absolute targets to reduce Class 3 failures. Keywords Failure factors structuring methodology (FFSM) · System failure dynamic model (SFDM) · Human error · Human activity system (HAS)
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_5
65
66
5 Discussion of the Application Results
5.1 Results of the Application of the FFSM The following points were clarified in each phase of the FFSM. First, in the structured model analysis of Phase 1, the overall structure of factors affecting long-time downtime was clarified numerically. In Phase 2, a quantification theory type III analysis, three axes of factors contributing to long downtime were extracted and each sample incident was mapped into the factor space. The 58 sample incidents were then classified into 12 groups based on the proximity of their coordinates in space (Appendix C), and countermeasures were determined for each group. However, because G8 and G10 had multiple factors and were located in the center of the spatial map (Fig. 4.20), it was difficult to find appropriate countermeasures. The Phase 3 analysis revealed three new worldviews on PC server maintenance. All of them are related to factors S1 (product) and S3 (maintenance organization), which are the top-level factors (i.e., root causes) obtained from the Phase 1 analysis (i.e., Fig. 4.13 shows S1 and S3 as the top-level causes and Table 4.8 shows that all three new worldviews are related to S1 or S3. It is also important to check the results of these measures by applying the FFSM periodically. Finally, the author confirms that FFSM manages emergent characteristics by introducing the three success factors introduced in Sect. 3.1. This means that it was as effective as SSM (Checkland and Scholes [15]), which is capable of managing emergent characteristics. However, Jackson [38] points out that one of the criticisms of SSM is that “for hard systems thinkers, SSM provides only a limited perspective on why problems occur” (Jackson [38], pp. 202–207). In this book, FFSM demonstrates its potential to complement and overcome the shortcomings of SSM by providing a quantitative rationale for why problems occurred and the appropriate countermeasures to address them. Table 5.1 summarizes the results revealed by the FFSM. In other words, Phase 1 clarified quantitative causal relationships concerning the causes of failure, Phase 2 revealed the three hidden factors of long downtime, and Phase 3 identified three new worldviews for preventive measures. Table 5.2 compares the expected effects of the current (reactive) and new (proactive) worldviews. It shows that the improvement under the new (proactive) worldview (83%) is much better than the improvement under the current (reactive) worldview (58%). In summary, the results of the application of FFSM confirm that FFSM has the important ability to transform industrial maintenance systems by shifting from reactive to proactive methods. The conditions for the applicability of FFSM, as indicated in Sect. 3.2.2, point out that it spans the entire range of human activity systems. The countermeasures for human errors detected in Phase 3 of FFSM are discussed in Chap. 8.
5.1 Results of the Application of the FFSM
67
Table 5.1 Summary of the results obtained by FFSM Objective Phase 1 To discover root causes
Result • S1 (product) and S3 (maintenance organization) are the uppermost factors (i.e. root causes) and S2 (isolation) and S7 (recovery process) are the lowest factors contributing to extended downtimes. • The major upper factors (i.e. root causes) of S7 (recovery process) are S1 (product) (0.45), S8 (software bug) (0.36), and S3 (maintenance organization) (0.35). The number in parentheses indicates the relative weight of the related factor. This indicates that product and software-related maintenance organizations are the root cause for an extended period being needed for recovery. • The major upper factors (i.e. root causes) of S2 (isolation) are S1 (product) (0.37) followed by S8 (software bug) (0.29). The numbers in parentheses also indicate that the product and software bugs are the root causes for an extended period being needed for recovery. • Among the sample incidents, 45% had multiple factors (i.e. 26 out of 58 incidents) (appendix B). Phase 2 To extract hidden • The hidden factors contributing to extended downtimes have factors behind three causes represented by three axes: The first axis (isolation complex for faulty parts), the second axis (software recovery), and the symptoms third axis (hardware maintenance organization). • G1, G7, and G12 are the groups whose dominant factors causing extended downtime are RAS, spare parts, and software (i.e. software recovery or bug). • G8 (multi-dead) and G10 (miscellaneous) are groups that cannot be related to a dominant factor leading to extended downtime. Phase 3 To discover • Phase 3 analysis creates a new worldview of PC system preventative maintenance (table 4.8 and fig. 4.21). measures for • All three worldviews are countermeasures for emergent emergent problems, none of which can be managed proactively in the properties design or configuration phase. • The three new worldviews (table 4.8) are new inputs to FFSM to confirm further improvement.
Table 5.2 Expected performance improvement for reactive vs. proactive measures
Reduction in reactive maintenance
Gain due to proactive maintenance
Reactive measure under the current worldview • 20 incidents (34%) in quadrant I will reduce extended downtime. • 14 incidents (24%) in quadrant II will reduce extended downtime. N/A
Proactive measures under the new worldview N/A
• 48 incidents (83%) in quadrant III will reduce extended downtime.
5
68
5.2
Discussion of the Application Results
Results of the Application of SFDM
The diagnostic flow of failures (Fig. 4.2) encourages the transformation of mental models, operational norms, and current processes. In this case study, the mental model was changed, absolute (ideal) goals were introduced through competitor benchmarking, operational norms were changed, and design goals (i.e., noise levels) and current processes were changed. The OP matrix in Fig. 5.1 shows the dynamic transition of the six stages of the development system failures (Table 3.7). Double-loop learning is achieved through the incubation period with some side effects (i.e., misunderstanding failure class, reinforcing current action, and incentive to report fewer incidents). The concept of inquiring system (IS) introduced by Van Gigch [79] describes how black box concepts are clarified in the decision-making process. Epistemology consists of the thinking and reasoning processes by which an IS acquires and guarantees its knowledge. Furthermore, epistemology transforms evidence into knowledge, problems into solutions, designs, and decisions. Learning at the meta-level modifies mental models, and at the model level, it modifies desired goals and current operational norms, and at the reality level, it modifies operations. The outcome of double-loop learning is the epistemology of experience, and the application example in Sect. 4.3 shows that the proposed meta-methodology facilitates doubleloop learning in practice. The perceptions gained through this case study are the following: 1. Extend the system boundary as far as possible to convert failure classes 2 and 3 to 1. Problem repetition (side effect) P O Goal achievement
OK
OK Stage VI (Fig. 4.25)
Stage I
NG Incentive to fewer incidents
report
(Fig. 4.24)
Stage II
(Incubation Period)
NG
Fig. 5.1 Learning cycle through an incubation period
Misunderstanding failure class (Fig. 3.16) Reinforcing current action (Fig. 3.16)
5.1 Results of the Application of the FFSM
69
2. Close stakeholder disjunctions (Eliminate the perception gap among stakeholders) to reduce Class 2 failures. 3. Determine absolute targets to reduce class 3 failures. This case study demonstrates the effectiveness of the methodology. If the level of measures is raised to the meta-model layer, the effectiveness of the measures is magnified. Otherwise, similar problems will recur after a certain time. The current mainstream methodology is effective only for Class 1 failures. By utilizing the SOSF and related tools to reflectively recognize system failures, it is possible to achieve technical safety even in an uncertain and rapidly changing environment.
Chapter 6
Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
Abstract A method is presented for mitigating system failures. Current state-ofthe-art methodologies and frameworks have strength as a common language to understand system failures holistically with various stakeholders. On the other hand, there is a shortcoming in quantitative aspects. This is the major obstacle to assessing the effectiveness of various measures to mitigate system risk. To overcome this shortcoming, this chapter expresses system risk numerically through coupling and interaction factors between system configuration elements as well as system failures frequency rate, these three numerical numbers (i.e., coupling, interaction, and frequency) create three-dimensional space, and measuring its trajectory through time visualize system risk trends which are the targets to create effective preventative measures to system failures. A root cause of a system failure is discovered by using a System Dynamics technique to a trajectory of a system risk location, then based upon the root cause, effective countermeasures are extracted. Lastly, this methodology is applied to the system failure cases with various ICT systems, and countermeasures are extracted. An application example of ICT system failures exhibits the effectiveness of this methodology. Keywords Risk management · Crisis management · Normal Accident Theory (NAT) · High-Reliability Organization (HRO) · Information and Communication Technology (ICT) · System Dynamics
6.1
Background
In this book, a meta-methodology for holistically examining system failures is proposed to prevent their further occurrence. This methodology was introduced as a meta-methodology called a system of system failures (SOSF). SOSF is represented in a three-dimensional space. In addition, a topological method used to monitor © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_6
71
72
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
failure events within an SOSF space was presented to visualize the trajectory of system failures. A method was developed for quantifying the risk factors for a system failure that enables the factors to be quantified, monitored, and compared among the systems, and whose usage promotes system safety and reliability. The method was introduced using an interaction and coupling (IC) chart based on normal accident theory. An IC chart is used to classify object systems based on an interaction (linear or complex) and coupling (tight or loose); however, its effectiveness is limited by a subjective classification. The proposed method quantitatively (i.e., objectively) measures the risk factors and thus compensates for the subjectivity of an IC chart. Application examples in information and communication technology (ICT) engineering demonstrate that the proposed method applied to quantitatively monitor the risk factors helps improve the safety and quality of various object systems.
6.1.1 Summary of SOSF and Further Extension In this chapter, a new methodology is proposed to enhance the safety and reliability of a system by visualizing its risk over time. In this sense, this chapter is the core chapter that gave the book its title. The methodology used for understanding the current state of a failure has numerous shortcomings (Nakamura [61, 62]). The first is the lack of a common language for discussing a failure, making it difficult for stakeholders involved in a failure to have a common understanding of the problem. Consequently, similar failures can occur repeatedly. A meta-methodology called a system of system failures (SOSF) was introduced to address this first issue. Second, it is difficult to understand how the nature of a failure changes over time, rather than understanding each failure separately. For the second issue, SOSF is recognized as a failure space, and a topology is introduced into the space allowing individual failures to be recognized with relevance rather than an individual understanding. In this chapter, the first and second shortcomings were interlinked and organized to improve the safety and reliability of the system. Finally, the new methodology was applied to ICT failures, and its effectiveness was confirmed. ICT was chosen because its environmental changes are severe. These changes were identified in Gartner’s study on IT trends [19]. There are three major concerns related to ICT risk: virtualization, fabric technology, and big data. These three concerns will remain the same for 2023. An applied example shows that the risk quantification/ visualization methodology helps improve the safety and reliability of ICT systems by capturing the evolution and increasing the complexity of ICT technology with virtualization technology. This application opens up the possibility of an expansion to other social systems. Chap. 6 is composed of six sections. The first Sect. 6.1 describes the significance of introducing metrics into the SOSF proposed so far to visualize risks. Section 6.2 describes a survey of current methodologies. For this chapter, the author reviews again to Chap. 2 to summarize the relevant issues. Section 6.3 introduces the new
6.2 Review of Current Methodologies (Revisit Chap. 2)
73
SOSF meta-methodology, which overcomes the shortcomings of current approaches. Section 6.4 introduces new metrics into the SOSF space for a topological representation of system failure risk factors. Section 6.5 provides two application examples of ICT system failures. Finally, Section 6.6 concludes this chapter by describing the results of two examples and the effectiveness of this risk visualizing way and areas of future research.
6.2 Review of Current Methodologies (Revisit Chap. 2) 6.2.1 Features of Existing Structuring Methodologies and Risk Analysis Techniques Two methodologies are widely used, i.e., failure mode effect analysis (FMEA) [33] and fault tree analysis (FTA) [34]. FMEA reveals in a table form the linear relationship between the causes and consequences that lead to the final event in a bottom-up manner. By contrast, FTA is a method for clarifying the causes of a final event in a top-down manner as a logic diagram. Both methodologies are primarily employed in the design phase. However, they are heavily dependent on personal experience and knowledge. FTA, in particular, tends to miss some failure modes among failure mode combinations, particularly emergent failures. There are two main reasons for missing failure modes. First, FMEA and FTA are rarely applied simultaneously. Accordingly, the sufficiency of the identified elements is not ensured in a mutually exclusive or collectively exhaustive manner. As the second reason, current approaches use a linear link between cause and effect, making it difficult to see complex issues involving many different stakeholders. Other major risk analysis techniques (including FMEA and FTA) have been described in various studies (Bell [6]; Wang and Roush [83], Chap. 4; Beroggi and Wallace [7]). Most failure analyses and studies are based on either FMEA or FTA. The existing structuring methodologies and risk analysis techniques do not sufficiently utilize a holistic approach. Many current methodologies linearly connect cause and effect, and cannot respond appropriately to problems surrounded by many different stakeholders or problems under severe external changes in the environment. Therefore, what is lacking in the current majority of methodologies is the capability to tackle emergent problems caused by the complex relationships between the many stakeholders as well as external environmental changes surrounding the problem. A typical methodology applying a holistic approach is a soft systems methodology (SSM; Checkland and Holwell [15]). SSM can manage emergent properties and thus implement preventative measures. Unlike the other current methodologies, SSM solves the above problem by revealing the nature of the stakeholders surrounding the problem among the customers, actors, transformers, owners, and worldviews. Current methodologies tend to lose their holistic view of the root causes of a system failure. In addition, although most of them may be able to
74
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
clarify the problem structures to confirm the effectiveness of a preventative measure, they do not properly monitor the system failure trends over time. Therefore, systems often exhibit similar failures.
6.2.2 Issues and Challenges of Current Troubleshooting Methodologies All engineering systems were designed to achieve their goals. Events that fail to achieve their goals (i.e., system failures) in such a system can be attributed to an insufficient design. As Turner and Pidgeon [76] pointed out, a system failure may be defined as a characteristic of subsystems that do not contribute to the fulfillment goal of the supersystem. Alternatively, a system failure is the “termination of the ability of an item to perform its required function” [62]. The predominant technology in current ICT troubleshooting is a pre-defined goal-oriented model. In this model, Van Gigch [79] highlighted the main shortcomings of a system improvement as follows: • Engineers tend to try to find malfunctions inside the system boundary. • Engineers tend to focus on returning the system to normal. Long-term improvements cannot be achieved through operational improvements. • Engineers tend to have incorrect and obsolete assumptions and goals. In most organizations, the formulation of assumptions and goals is not explicit. In this context, improvements to fostering systems are senseless. • Engineers tend to act as “planner followers” rather than as “planner leaders.” In a system design concept, the planner must be a leader planning to influence trends, rather than a follower planning to satisfy trends. Explanations of system failures in terms of a reductionist approach (i.e., an event chain of actions and errors) are not useful for improving the system designs (Rasmussen [66]; Leveson [49]). In addition, Perrow [64] argued that a conventional engineering approach to ensure safety by building more warnings and safeguards fails because the system’s complexity makes failures inevitable. The following four key features have commonly been pointed out as limitations of current troubleshooting methodologies in ICT system environments: 1. Most methodologies have a reductionist perspective. This makes it difficult to understand the real meaning of the countermeasures, whether they are effective or tentative. 2. The current mainstream troubleshooting approach applies a cause-effect analysis (or event chain analysis) to determine the real root causes. FMEA [33] or event trees utilize forward sequences, and FTA [34] or fault trees utilize backward sequences.
6.3 Overview of Introducing Topology into SOSF Space
75
3. The speed of intense technological advances creates critical misunderstandings among stakeholders. Current methodologies cannot properly manage the disjunction among stakeholders. 4. An improvement of the deviation from the operating norm is bound to fail, and as Van Gigch [79] pointed out, the treatment of a system problem is bound to fail when improving the operation of existing systems. To summarize these four points, the current methodology focuses on the following three issues: • The system does not meet the pre-defined goals. • The system produces no predicted results. • The system does not work as expected during the design phase. As the basic assumption of improvement, the goal and operating norm are static and predetermined in the design phase and are based on hard-system thinking. These four key features and three issues hinder the examination of system failures from a holistic standpoint, making it difficult to manage the soft, systemic, emergent, and dynamic aspects of such failures. Based on the discussion described in this section, there are four major shortcomings of the current methodologies: 1. A lack of a methodology covering the worldviews of multiple stakeholders (Sect. 6.2.2, shortcoming (3)). 2. A lack of a methodology covering emergent failures (Sections 6.2.1 and 6.2.2, shortcomings (2) and (4)). 3. A lack of a methodology covering a holistic view of system failures rather than a reductionistic view (Sect. 6.2.2, shortcoming (1)). 4. A lack of a methodology to monitor system failure trends over time (Sect. 2.1). Reflecting on the SOSF introduced in Chap. 3, the next Sect. 6.3 ensure that the SOSF covers the first, second, and third shortcomings. In Sect. 6.4, a risk quantification/visualization method is introduced to address the fourth shortcoming.
6.3 Overview of Introducing Topology into SOSF Space SOSF promotes double-loop learning to overcome the first, second, and third shortcomings mentioned in the previous section. Double-loop learning is indispensable for reflecting on whether operating norms (i.e., mental models) are effective (Morgan [53]; Argyris and Schoen [1]; Senge [73]). A meta-methodology was used to transform the mental models (Van Gigch [79]; Rasmussen [66]; Leveson [49]; Perrow [64]), as described in Sect. 2.2. The system of system methodologies (SOSM) proposed by Jackson [38] is a typical and widely recognized meta- methodology. Its main features are the use of a meta-systemic approach (soft system thinking to promote double-loop learning) and complementarianism by
76
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
encompassing multiple paradigms depending on the state of the problem. FTA [34], FMEA [33], and most other analysis types, as discussed in Sect. 3.2 (Fig. 3.6), belong to the simple unitary domain in SOSM. To overcome the first shortcoming (i.e., covering multiple worldviews of stakeholders), SOSF is designed on the SOSM base that covers multiple stakeholders (i.e., the plural domain in SOSM). In particular, SOSF was developed by placing each type of failure from the system failure taxonomy [75] onto a two-dimensional SOSF (left side of Fig. 6.1). Notably, the recursive and hierarchical features of SOSF depend on the viewpoint of the system. These features form a system as a structural aggregation of subsystems, where each subsystem has its own SOSF. It is therefore important to note the hierarchical and relative structures (i.e., a technology failure may be an evolutionary failure depending on the viewpoint of the subsystem). To overcome the second and third shortcomings (i.e., to cover emergent failures and a holistic view), the author introduced a third dimension, namely the failure class. Three failure classes were defined to address the emergent and holistic aspects of a system failure. As Nakamura and Kijima [57] pointed out, failures are classified according to the following criteria: • Class 1 (deviance failure): The root causes are within the system boundary, and conventional troubleshooting techniques are applicable and effective. • Class 2 (interface failure): The root causes are outside the system boundary but are predictable at the design phase. • Class 3 (foresight failure): The root causes are outside the system boundary and are unpredictable during the design phase. The right-hand side of Fig. 6.1 (i.e., three-dimensional SOSF) is an expansion of the two-dimensional SOSF (left-hand side of Fig. 6.1) with the addition of the system failure dimension (i.e., three failure classes).
6.4 Proposed Methodology for Introducing Topology (Risk Quantification/Visualization Methodology) The IC chart and close-code metrics are introduced in this section, including how topological metrics are introduced in the SOSF space to quantitatively monitor the system risk over time.
6.4.1 Normal Accident Theory and IC Chart It is not unusual that several failures to happen sequentially or simultaneously. Each is not a catastrophic failure in itself; however, the complex (i.e., unexpected) interaction of those failures may have catastrophic results. Tight coupling of a
77
Fig. 6.1 SOSF formulation process
6.4 Proposed Methodology for Introducing Topology
78
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
component involves a cascade of single-point failures that quickly reach a catastrophic end before safety devices come into effect. This is called system failure or normal accident as opposed to a single-point failure. Perrow [64] analyzed system failures using interaction and coupling of system components. This is called normal accident theory. Tables 6.1 and 6.2 list the interaction feature and coupling tendencies, respectively, according to Perrow’s definition [64]. Table 6.1 explains linear and complex system interactions. Linear interactions are those in expected and familiar production or maintenance sequences and are quite visible even if unplanned. Linear systems have minimal feedback loops, and thus less opportunity to baffle designers or operators. And the information used to run the system is more likely to be directly received, and to reflect direct operations. Complex interactions are those of unfamiliar sequences or unplanned and unexpected sequences, and either not visible or not immediately comprehensible. To summarize, complex systems are characterized by (1) proximity of parts or units that are not in a production sequence, (2) many common mode connections between components (parts, units, or subsystems) not in a production sequence, (3) unfamiliar or unintended feedback loops, (4) many control parameters with potential interactions, (5) indirect or inferential information sources, and (6) limited understanding of some processes. Table 6.2 explains the nature of coupling (i.e., tight and loose). Coupling is particularly germane to recovery from the inevitable component failures that occur. One important difference between tightly and loosely coupled systems deserves a more extended comment on this connection. In tightly coupled systems the buffers and redundancies and substitutions must be designed in; they must be thought of in Table 6.1 Linear vs. Complex Systems Linear systems • Spacial segregation. • Dedicated connections. • Segregated subsystems. • Easy substitutions. • Few feedback loops. • Single purpose, segregated control. • Direct information. • Extensive understanding.
Complex systems • Proximity. • Common-mode connections. • Interconnected subsystems. • Limited substitutions. • Feedback loops. • Multiple and interacting controls. • Indirect information. • Limited understanding.
Table 6.2 Tight and Loose Coupling Tendencies Tight coupling • Delays in processing not possible. • Invariant sequences. • Only one method to achieve goal. • Little slack possible in supplies, equipment, and personnel. • Buffers and redundancies are designed-in, deliberate. • Substitutions of supplies, equipment, personnel limited and designed-in.
Loose coupling • Processing delays possible. • Order of sequences can be changed. • Alternative method available. • Slack in resources possible. • Buffers and redundancies fortuitously available. • Substitutions fortuitously available.
6.4 Proposed Methodology for Introducing Topology
79
advance. In loosely coupled systems there is a better chance that expedient, spur– of–the–moment buffers and redundancies and substitutions can be found, even though they are not planned ahead of time. What is true for buffers and redundancies is also true for substitutions of equipment, processes, and personnel. Tightly coupled systems offer few occasions for such fortuitous substitutions; loosely coupled ones offer many. The IC chart is a table for classifying object systems by interaction and coupling. Figure 6.2 shows the IC chart developed by Perrow [64]. Topological expression was done subjectively by Perrow [64]. By combining the two variables in this way, many conclusions can be made. It is clear that the two variables are largely independent. Examine the top of the chart from left to right. Dams and nuclear plants are roughly on the same line, indicating a similar degree of tight coupling. But they differ greatly on the interaction variable. While there are few unexpected interactions possible in dams and there are many in nuclear plants. Or, looking across the bottom, university, and post offices are quite loosely coupled. If something goes wrong in either of these, there is plenty of time for recovery, nor do things have to be in a precise order. But in contrast to universities, post offices do not have many unexpected interactions—it is a fairly well laid out (linear) production sequence without a lot of branching paths or feedback loops. The IC chart defied two key concepts, the types of interaction (complex and linear) and the types of coupling (loose and tight). Their variable has been laid out so that we can locate organizations or activities that interest us and show how these two variables, interaction, and coupling, can vary independently of each other. The next section introduces a method
Fig. 6.2 IC chart (Source: Perrow [64])
80
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
for using a metric in the IC chart. The metric is a close-code system of an object system’s failures. This enables us to quantitatively monitor a target system’s safety and quality.
6.4.2 Close-Code Metrics as an Example Taxonomy of System Failures To introduce a metric into the IC chart described in Sect. 6.4.1, the author focused on a close-code system. A close-code system is a type of table that classifies the causes of failures in each industry sector. A close-code is a failure root case classification taxonomy (e.g., hardware, software, or human error) used in a system. Although the close-code system varies by system and industry, it is classified as a close-code matrix with two dimensions. The first dimension consists of phases for creating an object system (i.e., design, configuration, and operation in a time sequence), and the second dimension is the nature of the stakeholders (i.e., simple or complex) responsible for the system failures. A close-code system is a filter for the root cause of a system failure and is an example taxonomy of system failures (Van Gigch [78], [79]). If a taxonomy of system failures is regarded as a generalized close-code system, a different close-code system can be standardized for each industry. This idea enables a quantitative discussion of the nature and changes in the time series of failures within an industry, enabling comparisons between industries. Table 6.3 summarizes an example of mapping a close-code system onto a close- code matrix for the ICT industry, and the relationship between a close-code matrix and an IC chart is clarified. The horizontal (vertical) axis of the close-code matrix corresponds to the interaction (coupling) axis of the IC chart. The coordinates of the close-code matrix in Table 6.3 are represented by (m, n), where m(n) is the number of horizontal row (vertical column) (i.e., (m = 1: design; m = 2: configuration; m = 3: operation) and (n = 1: simple; n = 2: complex)). For example, (2, 2) indicates configuration-complex area. This (m, n) notation enables risk factors to be quantified. The next section introduces the metric derived from a close-code matrix used in an SOSF space.
6.4.3 Introduction of the Metric into SOSF Space In Chap. 3, SOSF was introduced, and in the previous section, a metric capable of a quantitative discussion when using a close-code system was described. Here, the metric detailed in the previous section is introduced into SOSF. Consequently, SOSF is transformed into a topological space, enabling a quantitative discussion. Each failure in an SOSF space is represented by a point representing the system risk location (SRL). The risk of the target system is represented by a three-tuple. This
6.4 Proposed Methodology for Introducing Topology
81
Table 6.3 Example close-code matrixes in the ICT industry
1 (simple)
2 (complex)
1 (Design) 2 (Configuration) Failure of technology Failure of and structure regulation Close codes Failure of rationality, evolution A (Hardware) A (1 ~ 5)a B B (A–D, F)c (Behaviours) P (Obsolete) P A (Hardware) A (6)d B (Behaviours) N (Future N plan)
3 (Operation) Failure of behaviour and evolution A(B)b B(E)c
A(U)e B(G)c
Legend: (causes A and B have subcategories) a A(1)–A(5): CPU, memory, channel, power, and disk failure, respectively b A(B): Hardware setup mistake c B(A)–B(G): Network setup, IO setup, parameter setup, installation, operation, application coding mistake, and other mistakes, respectively d A(6): Other IO e A(U): Unknown causes
can be expressed as an SRL within an SOSF space. An SRL is represented by three- dimensional coordinates (X, Y, Z) in an SOSF space, where X (Y, Z) represents the system interaction (system coupling and annual call rate (ACR)). ACR is the ratio of incidents per unit of shipment each year. Isomorphism occurs among the two- dimensional SOSF (left side of Fig. 6.1), close-code matrix, and IC chart. The isomorphs of these three perspectives with the component attributes are shown in Fig. 6.3. There are four steps used to introduce these metrics into an SOSF space, as shown in Fig. 6.4. The first step (Fig. 6.4a) defines the system failure group at any arbitrary time. Fig. 6.4b shows that β is the area inside α; therefore, β is obtained by dividing the number of system failures in the (3, 2) area by the total number of system failures. The X-Y axes are in Fig. 6.4b correspond to the interaction-coupling axes. The quantification of risk factors is achieved using the (m, n) notation in the close-code matrix. The author defines γ as shown in Fig. 6.4c. The complex and loose risk factors of an object system are represented by γ = (α, β), which is the quantitative coordinate point in the IC chart. Figure 6.4d provides a detailed explanation of γ = (α, β) in an IC chart for the system failure group at any arbitrary time. Adding a new dimension (i.e., the Z-axis representing ACR) to γ produces the SRL (α, β, ACR). Here, ACR is the frequency of system failures and should therefore be incorporated into the system failure metric.
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
Fig. 6.3 Isomorphic structure of three perspectives and its component attributes
82
6.4 Proposed Methodology for Introducing Topology
83
Fig. 6.4 Detailed diagram of the metric generation
In Sects 6.3 and 6.4, a topological SOSF was introduced, and a system failure can now be discussed numerically. To verify the effectiveness of the new SOSF with the metrics introduced into the SOSF, two examples of its application to ICT systems are presented in the following two sections.
84
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
6.5 Application Examples to ICT Systems 6.5.1 Application Example 1: Topological Presentation of SRF for Various ICT Systems To confirm the effectiveness of the new SOSF with the metrics introduced into the SOSF, a new SOSF is applied to the ICT system as a first example. Figure 6.5 shows the system risk factor distribution between several ICT systems. Table 6.4 lists the data related to specific ICT systems. The SRL (α, β) is calculated based on the incidents that occurred in a system. The third component of the SRL was not considered because there were no reasons to compare the ACR between the different systems. All ICT systems in Fig. 6.5 correlate with the two factors (i.e., interaction and coupling). The stronger the interaction (i.e., linear) becomes the stronger the coupling (i.e., tight). The number of jobs or tasks in an object system could affect the results. Stock exchanges are more single-goal agencies than other object systems. They tend to reside in the linear-tight domain, on the other hand, multi-goal agencies (i.e., meteorological and healthcare) have the opposite tendency. This result is following the design concept of each system. This qualitative argument was confirmed with this quantitative measure.
Fig. 6.5 Distribution of SRL between various ICT systems Table 6.4 Data from ICT systems (Fig. 6.5) System Stock exchange Meteorology Healthcare
Α 48.3 56.9 64.1
β 40 50.2 54.6
Incidents 145 239 940
duration 2010 2008–9 2008
6.5 Application Examples to ICT Systems
85
Two Hypotheses There are two hypotheses based on the topological presentation of system risk. The first is “If a quantified expression of IC chart is developed, could it be possible to visualize system property between the systems and monitor system quality improvement and system feature?” The second is “If corresponding Plural-Complex-Class 3 domain in SOSF space (the detail is explained in Chap. 3) to Complex-Loose domain in IC chart as well as the failure occurrence ratio, Could the effort to improve system quality be represented by the migration from Complex-Loose domain to Linear-Tight domain in IC chart with system failure occurrence ratio reduction?” The following two examples are the challenge to answering these questions, especially in ICT systems. IA Server Systems Shift to Linear Interaction and Tight Coupling Figure 6.6 shows the SRL transition of IA server systems (all systems that use Intel Architecture servers). Table 6.5 lists the data used in the SRL calculation. During this transition period, IA server systems migrate toward the linear interaction and tight coupling domain with decreasing ACR. During this period, two countermeasures are used (Nakamura and Kijima [55–57]) in these systems. The first countermeasure is to educate engineers to become hybrid engineers who can handle hardware and software. This countermeasure does not require cumbersome communication between engineers and reduces long outstanding incidents due to lengthy communication. The second countermeasure is to alter the noise design
Fig. 6.6 SRL transitions of IA server systems
86
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
goal to achieve an acceptable noise level to use even in an office environment. The ACR decreased by 69% compared to that in 2004. The first change is contributed to a tight coupling shift due to the removal of such cumbersome communication and the second is contributed to the linear interaction shift due to the reduction in stakeholders in an office environment. Healthcare Systems Shift to Complex Interaction and Loose Coupling Figure 6.7 shows the SRL transition of healthcare systems. Such systems include various tasks or jobs. Typical systems are electric health record systems and those for processing medical practitioners’ receipts for health insurance claims. They are closely related to national policy to comply with the Electronic Data Interchange (EDI) policy. Table 6.6 lists the data used in the SRL calculation when healthcare systems are migrating toward the complex interaction and loose coupling domain with increasing ACR. The main factor for this complex interaction shift is the EDI policy. Following this change, new stakeholders (i.e., medical equipment vendors Table 6.5 Data from IA server systems (Fig. 6.6) 2004 2007 2008
α 59 44 41
β 43.9 31.6 27.3
Fig. 6.7 SRL transitions of healthcare systems
ACR 1 −0.56 −0.69
Incidents 82 250 242
6.5 Application Examples to ICT Systems
87
and politicians) participate in the new EDI processes, which is a main factor for the loose coupling shift of systems. This shift requires the introduction of a countermeasure to more clearly define system boundaries to remove stakeholders’ misunderstandings or to overlap objectives between systems. This requires further research to confirm the outcome.
6.5.2 Application Example 2: Application to ICT Systems Complexly Coupled with Cloud and Network Technologies To confirm the effectiveness of the new SOSF with the metrics introduced into the SOSF, a new SOSF is applied to the ICT system as a second example shown in Fig. 6.8. The virtualized ICT systems in Fig. 6.8 are mainly composed of three technologies (operating systems, networks, and virtualization platform products). Virtualized ICT systems and other IOTs (i.e., Internet-based information architectural devices) are also complexly connected throughout the networks. As described in the Introduction, ICT technologies are drastically changing. Complexly connected ICT systems are considered suitable for verifying the effectiveness of the proposed methodology. An overview of the application target system is shown in Fig. 6.8. The SRL transition for virtualized ICT systems over 3 years is shown in Table 6.7, and SRL (α, β) was calculated based on the incidents that occurred in the corresponding system components (i.e., OS, virtual platform, and network). Every system failure was identified and gathered from incidents reported by the field operation group and analyzed quantitatively by an ICT company. A close-code matrix (Table 6.3) was used to formulate a metric within the SOSF space. According to Table 6.7, the network and virtualization platforms move toward a complex-loose direction with an increase in the ACR. The OS moves toward a linear-tight direction with a decrease in the ACR. The SRL of the network and virtualization platform migrates toward the complex- loose direction, and the OS migrates toward the linear-tight direction. The SRL trajectory for each product is shown in Fig. 6.9. A brief explanation of the key application results is presented below. The network and virtualization platform migration trends are attributed to changes in the external environment, whereas the OS migration trend is attributed to improved quality and reliability. Table 6.6 Data from healthcare systems (Fig. 6.7) 2007 2008 2009 2010
α 59.7 64.1 63.9 68.5
β 51.4 54.6 54.8 58.8
ACR 1 1.22 1.59 1.71
Incidents 769 940 1220 1317
88
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
Fig. 6.8 Overview of complexly connected ICT systems Table 6.7 SRL transition
OS Virtual Network
2016 α 19.3 16.7 12.6
β 19.1 16.4 8.9
ACR 15.5 31.2 44.6
2017 α 19.5 18.0 13.0
β 19.2 16.8 8.7
ACR 20.6 39.6 57.2
2018 α 18.1 20.3 15.9
β 18.0 19.2 12.1
ACR 17.6 41.7 52.0
6.6 Results and Discussion of Application to ICT Systems In this chapter, a risk quantification/visualization methodology for an SOSF space was proposed and applied to complexly connected ICT systems. As an application, the risk over time can be visualized using the quantified SRL within the SOSF space.
6.6.1 Results of the Application Example 1 The author obtained several findings from the application of our risk quantification method to several ICT systems. The results of this application confirmed the two research questions introduced at the beginning of this. For the first research
89
Fig. 6.9 SRL trajectory of each product
6.6 Results and Discussion of Application to ICT Systems
90
6 Transformation of SOSF Space into Topological Space to Quantify and Visualize Risk
question, the author confirmed that the risk factors for system failures can be quantified and presented as a topological space (i.e., SOSF space with IC metrics). According to the comparisons of various ICT systems, stock exchange systems are more linear and tight than meteorological or healthcare systems. Stock exchanges are single-goal agencies. On the other hand, healthcare systems are migrating toward the complex-loose domain due to electric medical record systems with various stakeholders being introduced to EDI policy. If migration toward complex interaction is inevitable to adapt to environmental change, other countermeasures for preventing migration in the tight coupling direction could be a challenge for healthcare systems. Clarification of job goals or seeking loose coupling between systems within healthcare could promote system safety. However, this requires more research to reach concrete results. As for improving system safety for IA servers, educating engineers to become hybrid engineers results in tightly coupled migration, and decreasing noise in the design goal results in linear interaction migration. Along with these two shifts, ACR decreased by 69% in 4 years. This confirms the second research question which stipulates that migration toward the tight-linear domain of IA server systems enhances system safety. The author verified that the linear-tight domain is safer than the complex-loose domain for ICT systems, especially IA server systems. Healthcare systems are migrating toward the complex-loose domain as the ACR increases. Further research is required to confirm whether the countermeasures for preventing migration in the tight coupling direction would reduce the ACR in healthcare systems. The proposed method for visualizing risk factors by introducing metrics in the SOSF space is effective because it complements the shortcomings of the subjective IC chart. Complex and tight shifting could be prevented by periodically monitoring the SRL trajectory in the SOSF space. This would enable us to objectively compare various systems in terms of risk management, and assure that countermeasures will be introduced to migrate toward the ideal domains.
6.6.2 Results of the Application Example 2 The findings of this application are as follows: 1. The SRL of the network and virtualization platforms has been shifting toward a complex-loose domain from a linear-tight domain. In 2018, the ACR increased from its value in 2017. 2. The SRL of the OS platform shifted toward a linear-tight domain from a complex-loose domain. The ACR in 2018 decreased from its value in 2017. The first result is believed to be due to the diversity of the stakeholders and the complexity of the technology concerning networks and virtual platforms, particularly complex shifts that deteriorate the risk over time. This change supports
6.6 Results and Discussion of Application to ICT Systems
91
Gartner’s analysis [19], as described in the introduction. This loose shift is believed to be due to system redundancy (such as a network or server duplex). The second result is believed to be a continuous improvement in quality from a relatively small number of OS vendors in comparison to that of network and virtualization development vendors, and the speed of change in OS technologies is relatively slow compared to that of networking and virtualization. For greater safety and higher reliability, the application results suggest the following. If complex shifts are inevitable owing to technological changes, measures such as the introduction of redundancy enhancement equipment contribute toward a movement away from a catastrophic outcome. In other words, measures such as avoiding complex shifts can enhance the reliability and safety of ICT systems. The results of this application in ICT systems suggest that it may be effective for other social systems as well. Through its application in an ICT system, the proposed methodology shows the effectiveness of quantitatively monitoring the level of risk over time. Further research is required to expand this approach to other industries with various close-code matrices. This method will lead to a refinement of the proposed methodology and thus contribute to enhancing the safety and security of our society as a whole.
Chapter 7
Reconsidering SOSF from the Perspective of HAS
Abstract In this chapter, the author discusses how to extend the System of system failures (SOSF)to the Human activity system (HAS) with reference to the 7-stage model of Soft Systems Methodology (SSM) (Checkland). The success of the system of system failures (SOSF) may indicate the success of the System of human activity systems (SOHAS) for all human activities. The possibility of applying SOHAS to various human activity systems has been demonstrated, and the basis for the specific application examples in the next chapter (human error as a human activity system) has been established. Keywords System of human activity systems (SOHAS) · System of system failures (SOSF) · Human activity system (HAS) · Soft Systems Methodology (SSM) · Human error A human activity system (HAS) is a model of a conceptual system that includes the activities that humans must perform to pursue a particular goal. According to the definition introduced by Klir [46], any system can be expressed as “S = (T, R),” where S is the system, T (Thinghood) is a set of things, and R (systemhood) is a relation defined on T . Applying this definition to HAS, any HAS can be expressed as “HAS = (Thing, Relations).” Thus, a HAS as an object of discussion can be expressed as “a HAS = (A Thing, {A Methodologyi} i ∈ I)” where “A Thing” is the object of discussion and “{A Methodologyi} i ∈ I” is the group of methodologies associated with the object. As explained in previous chapters, it is important to have a meta-methodological system to ensure a holistic view. This is because a metamethodological system is effective in avoiding short-sighted management (i.e., managerial myopic) and local optimization that can lead to various side effects that prevent the achievement of the objectives. Therefore, if we introduce a meta-HAS, it can be expressed as “A Meta-HAS = (A HAS, A Meta-Methodology)”, where “A HAS” is the object of the HAS and “A Meta Methodology” is the meta-methodology © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_7
93
94
7
Step1
Reconsidering SOSF from the Perspective of HAS
A HAS viewed as problems (A HAS for System failure)
Step6 Step5
Step5
Change
Create new methodology (SSFM, SFDM)
Step2
Create a SOHAS (Create a SOSF)
Step3
Step4
Assess effectiveness
View a specific problem through SOHAS (View system failures through SOSF)
Fig. 7.1 The sequence of problem-solving
in the HAS equation “{A Methodologyi} i ∈ I”. In summary, the argument for creating a meta-methodology that solves the problem can be expressed as follows. Figure 7.1 shows the problem-solving sequence. The solid line in Fig. 7.1 represents the boundary that separates the real world (upper part of the solid line) from the conceptual world (lower part of the solid line) according to the 7-stage model of SSM (Checkland [13]). The seven stages are: Stage 1: Entering the problem situation. Stage 2: Expressing the problem situation. Stage 3: Formulating root definitions of relevant systems. Stage 4: Building Conceptual Models of Human Activity Systems. Stage 5: Comparing the models with the real world. Stage 6: Defining changes that are desirable and feasible. Stage 7: Taking action to improve the real-world situation. Applying the above SSM stages to the HAS, the following is the interpretation of the stages of the SSM. • Step 1: Identify the specific problem as a HAS (SSM Stage 1). • Step 2: Create a meta-methodology, i.e., a system of HASs (i.e., SOHAS) (SSM Stages 1–4). • Step 3: View the specific problem through SOHAS (SSM Stage 5). • Step 4: Evaluate the methodological shortcomings of the current situation (SSM Stage 6). • Step 5: Create a new methodology to transform reality (SSM Stage 7). • Step 6: Transform: Transform reality by implementing systemically desirable and culturally feasible measures.
7
Reconsidering SOSF from the Perspective of HAS
Step2
95
Step2.1
Create Common language (Van Gigch [78])
Step2.2
Survey Current Methodologies (FTA, FMEA)
Step2.3
Step2.4
Map them onto SOSM
Add third dimension
Fig. 7.2 The sequence of SOHAS creation
The creation of SOHAS to address various HAS-related issues is important and is discussed in more detail in Step 2. Figure 7.2 shows the sequence of SOHAS creation. • Step 2.1 Abstract the cause of the problem within the HAS in a common language (SSM Stages 1 to 2). • Step 2.2: Investigate current methodologies related to the HAS (SSM Stage 3). • Step 2.3: Map current methodologies onto the SOSM (SSM Stage 4). • Step 2.4: For each common language abstracted in Step 2.1, identify the responsible party and express it in a third dimension. Add this third dimension to the SOSM to create SOHAS (SSM Stage 4). Looking back at learning from system failures as one case study of HAS, the same order can be applied to creating SOHAS (Steps 1 through 5). In Figs. 7.1 and 7.2, the individual case of creating SOSF is indicated in parentheses. Steps 1 through 5 below are a reorganization of Sect. 3.1. Step 2 consists of Steps 2.1 through 2.4: Step 1: Investigate the current methodologies {A Methodologyi} i ∈ I. Step 2: Creating an SOSF and Map{A Methodologyi}i∈I onto SOSF. Step 3: View individual failures through SOSF. Identify the location (domain) of each failure in the SOSF space. Step 4: Evaluate the shortcomings of the current methodologies. • Place the current methodologies in the SOSF space. • Visualize that no methodology covers Classes 2 and 3 in the unitary and plural domains. Step 5: Create a new methodology to transform reality. • Create FFSM that covers Classes 2 and 3 in the unitary domain and SFDM that covers the entire SOSF domain.
96
7
Reconsidering SOSF from the Perspective of HAS
SOHAS
SOSF
Meta methodology
Meta failures
Meta methodology understand “A Thing”
expansion
Methodology
{Methodology i} i
HAS
{Methodology i} i
∈I
to
∈I
example Reality
System failures
A Thing
Fig. 7.3 The relation between SOSF, SOHAS, and HAS
Steps 2.1 through 2.4 summarize Sect. 3.1. Step 2.1. Abstract the cause of the problem using a common language (SSM Stages 1 to 2). Step 2.2. Investigate current methodologies related to the SOSF (SSM Stage 3). Step 2.3. Map current methodologies onto the SOSM (SSM Stage 4). Step 2.4: For each common language abstracted in Step 2.1, identify the responsible party and express it in a third dimension. Add this third dimension to the SOSM to create the SOSF (SSM Stage 4). As is clear from a review of the process of creating the SOSF, the SOSF is an example of SOHAS (i.e., SOSF ∈ SOHAS); the success of the SOSF indicates the success of SOHAS for all human activities. Figure 7.3 shows the relationship between SOSF, SOHAS, and HAS. Figure 7.3 is identical to Fig. 1.2 described in Chap. 1. The above is a detailed description of Fig. 1.2 as described in Chap. 1 and its full implications. In the Phase 3 analysis of the application of FFSM in Sect. 4.2, human error was identified as a cause of long downtime. In the discussion of that three analysis, it was pointed out that (c) in Table 4.8 as a countermeasure against human error, a fail-safe function should be developed to avoid major accidents caused by human error. In the next chapter, the author proposes a framework to secure a holistic view of human error as a case study of SOHAS and verify the effectiveness of the proposed framework focusing on human error. In that chapter, we will discuss effective measures that are far more meaningful and effective than recognizing human error as a mere cause and adding safeguard functions.
Chapter 8
Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic Measures and its Application to Human Error)
Abstract In this chapter, a method is proposed for promoting ICT engineering safety learning from crisis management. In particular, ICT engineering arena human factors play a crucial role in promoting ICT system safety. The Tokyo stock exchange was crushed on the November 01, 2005, by an operation error, which had a severe impact on the global. The human factors (operator error, maintenance engineers’ error, etc.) cause severe impacts on not only ICT systems but also social systems (nuclear plant systems, transportation systems, etc.). In addition, the progress of ICT technologies (i.e., cloud, virtual, and network technology) inevitably shifts ICT systems into complexity interacting with tightly coupling domains. This trend places human factors above other elements to promote safety more than ever. The emergent property interaction between ICT and human conduct should be dealt with to promote system safety. Crisis management treats holistic property over partial components. The author introduces a human error framework to promote a holistic view to manage system failures. An application example of ICT human error exhibits the effectiveness of this methodology. Keywords Risk management · Crisis management · Normal accident theory (NAT) · High-Reliability Organization (HRO) · Information and Communication Technology (ICT)
8.1
Background
Chapter 8 proposes a methodology to promote the safety of ICT systems by incorporating a crisis management approach. Many current methodologies surrounding ICT systems adopt an element-reductive approach and may lack a holistic perspective. Therefore, a methodology with a holistic viewpoint is necessary to promote system safety and to realize system safety, it is necessary to consider not only the © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 T. Nakamura, System of Human Activity Systems, Translational Systems Sciences 37, https://doi.org/10.1007/978-981-99-5134-5_8
97
98
8 Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
technical elements of ICT systems such as hardware and software but also the people who operate them. In particular, human factors play an important role in the ICT technology field to ensure system safety. The Tokyo Stock Exchange system went down in 2005 and 2020 mainly due to human factors, which had a significant impact on the global economy. In addition, an error in the network configuration of Google Inc. caused a widespread network failure in 2017. Furthermore, Mizuho Bank’s consecutive system failures since February 2021 were attributed by a special investigation committee consisting of members outside Mizuho Bank to the organization’s low-risk sensitivity. Thus, human errors (operator errors, maintenance factor errors, missed crises due to organizational culture, etc.) have a significant impact not only on ICT systems but also on society at large. This trend is causing ICT systems to transition to complex interactions with the external environment and to a tightly coupled state between systems. This trend means that human factors are more important than ever in achieving safety, and the characteristics of emergent accidents caused by the interaction of ICT and human actions deserve special attention in achieving safety. This chapter develops and proposes a human error framework with the aid of crisis and risk management methods to gain a holistic view of managing system failures. The framework is then applied to human errors in ICT systems and its effectiveness is verified. Finally, the author points out the possibility of developing this approach into a methodology for reducing system failures, including human errors, and provides directions for future research.
8.1.1 Socio-Technical Systems and Safety Socio-technical systems are affected by various environmental stresses. The main environmental stresses are politics, public awareness, market conditions, economic conditions, educational attainment, and the speed of technological change. The relationship between safety and socio-technical systems in controlling safety is shown in Fig. 8.1. In the context of systems science, system safety needs to be treated in terms of the 4 Ms. (Man, Machine, Media (i.e., Manual and Procedure), and Management, each of which refers to the four English initial M). The theory to pursue the cause of trouble focuses on the cause, the process, and the influence of the trouble. 4 M is a widely used method to analyze and clarify the factors of trouble from the viewpoint of 4 M. The team operating in the field is the main target of the Man element; the Machine element includes equipment, and control system structure, and is the target of the operation method; the Media element is the production process of the team in the working and technical environment; and the management element is embedded in various processes such as safety culture and safety assessment management. Although 4 M is a necessary component to realize system safety, it is not the only way to realize system safety. The left side of Fig. 8.1 shows the hierarchy of socio-technical systems. The pursuit of system safety involves multiple disciplines corresponding to each level of
8.1
Background
99
Public/Society
Safety
Holistic System
System 5
Government
Whole Company
Regulators, Associations Management
Staff
Work (Hazardous Process)
Parts
Technical system (Banks, Trains, Etc.) Component 4M
Reduction
System I
Fig. 8.1 Socio-technical system involved in risk management
the hierarchy. The upper side of Fig. 8.1 is the domain of wholeness, and the lower side represents the parts that make up the whole. The domain of wholeness is represented by safety, the whole system, and system 5, while the domain of parts is represented by 4 M, element reduction, and system 1. System 5 and System 1 are concepts introduced in the Viable System Model (VSM) proposed by Beer [4, 5]. A viable system is considered to be composed of five interacting subsystems, each of which corresponds to the organizational structure of the system under analysis, based on the analogy with living organisms. In simplified form, Systems 1, 2, and 3 correspond to the limbs of the organism and are related to the “here and now” of organizational operations. System 4 corresponds to the spinal cord of the organism and is concerned with the “future and there,” responding strategically to external, environmental, and future demands on the organization. System 5 corresponds to the organism’s brain, which balances the “here and now” and “future and there,” and provides policy direction to ensure the viability of the organization.
100
8 Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
Section 8.2 provides an overview of current methodologies to achieve system safety. It is pointed out that risk management and crisis management attempt to realize system safety from different perspectives: static and dynamic. The two organizational theories are Normal Accident Theory (NAT) and High Reliable Organization Theory (HRO). HRO has been studied by Weick [86], Weick and Karlene [87], and Weick et al. [88]. Following this, the author reviews two organizational theories (NAT and HRO) that manage crisis management, risk management, and system failures, and in Sect. 8.3 the author proposes a methodology that complements various existing methods and organizational theories to identify hypotheses for achieving system safety. Section 8.5 summarizes the findings and clarifies the direction of realizing system safety, and provides directions for future research.
8.2 Current Methodologies to Achieve System Safety 8.2.1 Risk Management and Crisis Management Risk management is the process of identifying and analyzing uncertainties in investment decisions related to risk measures and deciding whether to accept or reduce them. In other words, risk management involves planning, estimating countermeasures, and making decisions before a situation occurs. Crisis management, on the other hand, requires decision-making about events that have already occurred, often in a short period, as opposed to risk management, which plans for possible future events. In other words, crisis management concentrates on the “here and now” of events that have already occurred. If the target system is viewed objectively, the risk management methodology is optimal, but if the target system is viewed subjectively from the perspective of the people operating the system, a crisis management methodology is necessary. Table 8.1 outlines the differences between risk management and crisis management. Risk management focuses on planning for the future, while crisis management takes a proactive response to risk and crisis that includes all relevant people and assets. Both measures are important to promote safety.
8.2.2 Static (Safety and 4 M) and Dynamic (Individual and Team) Perspectives As mentioned above, safety is a systems issue. On the other hand, 4 M is the ability of components (system elements) to achieve safety. This indicates that measures to achieve the 4 Ms. are not sufficient to achieve safety. Systemic problems (i.e., emergent problems) cannot be addressed from a 4 M perspective. The left side of Fig. 8.2 shows the perspective from risk management (i.e., static). The right side of Fig. 8.2
8.2 Current Methodologies to Achieve System Safety
101
Table 8.1 Risk management and crisis management Risk management
Plan People are part of the management
Focus This plan addresses the identification of risks and the search for prevention and reaction measures to mitigate the risks. Focused on processes and operations.
Crisis management
People are the main focus
This plan addresses the causes and the impact of risks, taking into consideration what is at stake. It seeks to protect all people and assets. People come first.
Risk management
Safety
4M
Approach Static approach: Take preventive action and implement emergency / contingency measures if an emergency or a disaster occurs. The organization is mainly REACTING to a threat. Dynamic approach: Implement a crisis management plan as a part of an ongoing crisis management initiative. The organization is ANTICIPATING/BEING PROACTIVE/REACTING. Crisis management
System
Team
Component
Individual
Static aspect
Dynamic aspect
Fig. 8.2 Different views between risk and crisis management
shows the perspective from crisis management (i.e., dynamic). The case of a ferryboat capsizing accident will be treated as an example of a systemic failure in the next section. The right-hand side of Fig. 8.2, the human aspect of crisis management, shows that errors in teamwork are systemic problems and individual errors can be viewed as component errors in teamwork. This implies that measures to avoid individual errors are not sufficient to avoid team errors. Systemic problems (i.e., emergent problems) cannot be addressed from the perspective of avoiding individual work errors.
102
8 Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
8.2.3 Safety Is a Systems Problem As Rasmussen’s [66] analysis of the ferry capsize accident in the port of Zeebrugge, Belgium (Fig. 8.3) shows, safety is realized at the organizational or social level of the hierarchy above the physical system. In this accident, boat design, port design, cargo management, passenger management, navigation scheduling, and operations management (left side of Fig. 8.3) are each a system in the physical hierarchy, each making decisions independently. They are not aware of how their decisions interact with each other, resulting in the ferry capsizing accident. Even if each local decision is “correct” (and “reliable”) within the limited context of decision-making, the uncoordinated individual decisions and organizational actions may interact unexpectedly and lead to an accident. The accident was caused by the unexpected interaction of uncoordinated individual decisions and organizational actions (see the right-pointing arrow in Fig. 8.3). Complex interactions are amplified in the systems we create, and accidents are more likely to occur due to unpredictable interactions between components. To achieve safety, control must be achieved at the system level, not at the component level. In this situation, modeling only in terms of task sequencing and errors is not an effective way to understand system behavior. In the next section, the author reviews the two main organizational theories that make safety possible, and in the subsequent Sect. 8.3, the author introduces a framework for understanding and clarifying the mechanisms that shape the fundamental behavior of the system.
Fig. 8.3 Complex pattern of the Zeebrugge accident
8.2 Current Methodologies to Achieve System Safety
103
8.2.4 Two Major Organizational Theories Normal Accident Theory and High-Reliability Organization Theory This section introduces two major organizational theories. Perrow [64] developed the theory known as NAT after the Three Mile Nuclear Power Plant accident. His basic argument is that complex interactions between systems and the external environment (such as in nuclear power generation), as well as tight coupling between systems, lead to unpredictable situations. Perrow [64] categorized systems in terms of interaction axe (Linear and Complex) and coupling axe (Tight and Loose). Perrow [64] classified systems along two axes (Fig. 6.2), the Interaction axis (classified as Linear and Complex) and the Coupling axis (classified as Tight and Loose), and expressed them as an IC chart. The industries in Fig. 6.2 are qualitatively classified and arranged by Perrow[64]. The nuclear system described above is located in the upper right region of Fig. 6.2, i.e., in the region of the Complex Interaction Axis and the Tight Coupling Axis. La Porte and Consolini [47] and Karlene [43], claim that some organizations are “highly reliable” because they have achieved continuous safety over time. Weick et al. [88] identified five characteristics of HROs: (1) preoccupation with failure, (2) reluctance to simplify interpretations, (3) sensitivity to operations, (4) commitment to resilience, and (5) deference to experience. In other words, HRO researchers argue that creating appropriate behaviors and attitudes among organizational members can achieve a highly reliable organization and avoid system accidents (Weick and Karlene [87]). In particular, bureaucratic rules are seen as stifling expert knowledge; according to HRO theory, safety has to be enacted on the frontlines by workers who know the details of the technology being used in the respective situation and who may have to invent new actions or circumvent “foolish” rules to maintain safety, especially during a crisis. NAT theory focuses on the nature of the system, and HRO focuses on the human side, especially the frontlines. Both theories view systems from different perspectives in this sense they do not contradict but rather complement each other. In summary, NAT is a theory that states that failure is inevitable due to the complex interactions of the system with the external environment and the tight coupling of system elements (that is, we call it normal accident theory), while HRO claims that safety is achieved by people on the frontlines working in critical phases. Outwardly, the two theories appear to be in conflict (Leveson et al. [50]). In the next section and below, the author introduces a general framework (Mitroff [52]) that is useful for acquiring a holistic perspective, and the author aims to use this framework to ensure that the two organizational theories complement rather than refute each other. The author then examines the extent to which human error contributes to system failure and proposes several hypotheses using two work models (concurrent work and sequential work models). The author then tests these hypotheses by applying the human error framework to the ICT domain, which was derived by specializing in the general framework for human error. As a result, the author shows that the human error framework is effective in promoting system safety.
104
8 Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
8.3 Proposal of the Human Error Framework 8.3.1 General Perspectives on Crisis As explained in the ferry capsizing accident in Sect. 8.2.3, partial solutions are not sufficient to achieve safety. Aiming to obtain a holistic perspective to achieve safety, Mitroff [52] proposed a general framework that is effective in identifying a holistic perspective, including methodologies, solutions, and organizational culture to be utilized in one’s organization. The general framework has two dimensions. The horizontal dimension relates to the scope and scale of the problem or situation and is the dimension that we feel is instinctively appropriate to handle it. The vertical dimension relates to the instinctive feelings associated with the issue or situation and the related decision-making process. This framework is important because it reveals that there are at least four very different attitudes or positions on an issue or problem, and how and why this issue or problem is important. Rather, the author argues that all four perspectives need to be consciously checked to avoid psychological blind spots. Figure 8.4 shows the general framework proposed by Mitroff [52]. Figure 8.5 is the risk framework derived by applying the general framework of Fig. 8.4 to risk events. The dimensions on the vertical axis are those related to risk- related issues and decision-making processes, while the dimensions on the horizontal axis are the same as in the general framework of Fig. 8.4, related to the scope and scale of the risk situation. The 4 Ms. at the top of the vertical axis are more analytical and technical concepts compared to safety at the bottom of the vertical axis; conversely, safety is a more human and social concept compared to the 4 Ms. With the benefit of hindsight, it can be said that the aforementioned ferry capsizing
Perspective 1 Details, facts, formulas, here and now
Perspective 2 The big picture, multiple interpretations, systems, future possibilities
Perspective 3 Specific individuals, stories, personal values, feelings
Perspective 4 Communities and the entire planet, governance, social values, politics
Personal or People
Fig. 8.4 General framework
Whole
Parts
Analytical or Technical
8.3
Proposal of the Human Error Framework
105
Perspective 1 Risk is an objective, quantifiable, measurable, real phenomenon.
Perspective 2 Risk is designed into and produced by technologies.
Perspective 3 Risk is a subjective phenomenon.
Perspective 4 Risk is embedded in social and cultural belief systems.
Whole
Parts
4M
Safety
Fig. 8.5 Risk framework
accident occurred due to the lack of Perspective 2 (i.e., the viewpoint that integrates each element of 4 M from a holistic perspective); according to the two organizational theories discussed in Sect. 8.2.4, NAT is classified under Perspective 2, while HRO (including crisis communication) is classified under Perspectives 3 and 4. Reason [68] argues that the characteristics of an informed culture necessary to manage the risk of organizational accidents require a free exchange of information, which requires a culture that is justice, reporting, learning from itself (i.e., one’s experiences), and flexibility. An informed culture theory covers entire perspectives. This risk framework is useful for proactively introducing various measures to avoid problems. It is useful to map the various methodologies currently in place onto the risk framework to discover weaknesses in the state-of-the-art methodologies currently in place. Indeed, it can reveal the basic significance and implications concerning the importance of each quadrant’s standing (which other quadrants may ignore or dismiss altogether) issues and problems.
8.3.2
Contributions of Human Error (Team Errors and Individual Errors)
Reason [67] classified human error into three types: mistakes, lapses, and slips. A mistake is when work is performed as planned but the planned results are not obtained. This is usually the case when the original plan was incorrect and the work was carried out based on it, resulting in unexpected results. Mistakes are failures in decision-making. The two main types of mistakes are rule-based mistakes and
106
8
Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
knowledge-based mistakes. These are the types of errors that occur when an action is believed to be correct and is subsequently executed incorrectly. Lapses are generally unobservable events. They occur with forgetting what should be done during the execution of a task or forgetting the part of the task being worked on. Slips, on the other hand, are errors that surface as a result of not performing a task as planned (“not performing the intended task”) and can be objectively observed from the outside. Table 8.2 lists the types of human error and typical error prevention measures. The four types of human errors in Table 8.2 are categorized into the four quadrants of the risk framework in Fig. 8.6. The vertical dimension corresponds to the part-whole, and the horizontal dimension corresponds to the individual team. Table 8.2 Classification of human error types Occurring phase Planning Decisionmaking
Error type Rule-based mistake
Knowledge-based mistake Lapse Execution Slip
How to reduce • Increase worker situational awareness of high-risk tasks on site and provide procedures for predictable non-routine, high-risk tasks. • Ensure proper supervision for inexperienced workers and provide job aids and diagrams to explain procedures. • Make all workers aware that slips and lapses do happen. • Use checklists to help confirm that all actions have been completed. • Include in your procedures the setting out of equipment, site layout, and methods of work to ensure there is a logical sequence. • Make sure checks are in place for complicated tasks. • Try to ensure distractions and interruptions are minimized, e.g., mobile phone policy. 4M
Perspective 1 Human error is an objective, quantifiable, measurable, and real phenomenon.
Perspective 2 Human error is designed into and produced by technologies.
Perspective 3 Human error phenomenon.
Perspective 4 Human error is embedded in social and cultural belief systems.
is
Lapse
a
subjective
Rule based mistake
Knowledge based mistake Safety
Fig. 8.6 Human error framework
Team
Individual
Slip
8.3 Proposal of the Human Error Framework
107
When using this framework (Fig. 8.6), it is important to ensure that current measures and management processes take all perspectives into account to encompass a holistic perspective in the measures taken. The two axes of the IC chart (Fig. 6.2) and the human error framework (Fig. 8.6) are isomorphic as shown in Fig. 8.7. Noting the isomorphism shown in Fig. 8.7, Fig. 8.8 integrates the two axes of the IC chart (coupling and interaction axes of Fig. 6.2) into the human error framework (Fig. 8.6). The representative systems in
Fig. 8.7 Human error framework and IC chart
Fig. 8.8 Systems perspective on IC charts and human error
108
8
Viewing Human Error as a HAS (Proposed Framework for Ensuring Holistic…
each quadrant of the IC chart are banking, gas/electricity, manufacturing, and education, respectively. And these systems are discussed in Sect. 8.5. A further consideration is concerning the dimensions of the horizontal axis in Fig. 8.8. To discuss whether individual or teamwork processes are safer, it is necessary to clarify the work process model. Two simple models for work processes are introduced: sequential and parallel work. Figure 8.9 shows the sequential work model. Safety decreases with the number of people or groups. This is because, logically, combining an infinite number of persons or groups sequentially and successively would have a success probability of zero (100% failure). Each box represents a person with an error probability greater than 0% (because humans are not perfect). Si in Fig. 8.9 is the success probability of the i-th person (0 ≤ Si