Advances in Systems Engineering [1 ed.] 9781624104091, 9781624104084


156 85 33MB

English Pages 320 Year 2017

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Advances in Systems Engineering [1 ed.]
 9781624104091, 9781624104084

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Systems Engineering

Advances in Systems Engineering

EDITED BY

John Hsu Systems Management and Engineering Services California State University at Long Beach Long Beach, California

Richard Curran Delft University of Technology Delft, The Netherlands

Volume 252 Progress in Astronautics and Aeronautics Timothy C. Lieuwen, Editor-in-Chief Georgia Institute of Technology Atlanta, Georgia

Published by American Institute of Aeronautics and Astronautics, Inc.

American Institute of Aeronautics and Astronautics, Inc., Reston, Virginia 12345 Copyright # 2016 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved. Reproduction or translation of any part of this work beyond that permitted by Sections 107 and 108 of the U.S. Copyright Law without the permission of the copyright owner is unlawful. Copies of chapters in this volume may be made for personal and internal use, on condition that the copier pay the per-page fee to the Copyright Clearance Center (CCC). All requests for copying and permission to reprint should be submitted to CCC at www.copyright.com. Employ the ISBN indicated below to initiate your request. Data and information appearing in this book are for informational purposes only. AIAA is not responsible for any injury or damage resulting from use or reliance, nor does AIAA warrant that use or reliance will be free from privately owned rights. ISBN 978-1-62410-408-4

PROGRESS IN ASTRONAUTICS AND AERONAUTICS

EDITOR-IN-CHIEF Timothy C. Lieuwen Georgia Institute of Technology

EDITORIAL BOARD Paul M. Bevilaqua

Eswar Josyula

Lockheed Martin (Ret.)

U.S. Air Force Research Laboratory

Steven A. Brandt

Mark J. Lewis

U.S. Air Force Academy

University of Maryland

Jose´ A. Camberos

Dimitri N. Mavris

U.S. Air Force Research Laboratory

Georgia Institute of Technology

Richard Christiansen Sierra Lobo, Inc.

Daniel McCleese

Richard Curran

Alexander J. Smits

Delft University of Technology

Princeton University

Jonathan How

Ashok Srivastava

Massachusetts Institute of Technology

Verizon Corporation

Christopher H. M. Jenkins

Oleg A. Yakimenko

Montana State University

U.S. Naval Postgraduate School

Jet Propulsion Laboratory

TABLE OF CONTENTS Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Author Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Chapter 1 System of Systems Integration: Fundamental Concepts, Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . 1 Azad M. Madni, University of Southern California, Los Angeles, California; Michael Sievers, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II. SoS Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III. Unique Characteristics of System of Systems (SoS) . . . . . . . IV. System of Systems Integration (SoSI) . . . . . . . . . . . . . . . . . V. SoS Integration Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . VI. SoSI Model for a Directed SoS . . . . . . . . . . . . . . . . . . . . . . . VII. Interoperability is a Cross-Cutting Issue . . . . . . . . . . . . . . . . VIII. Human-Systems Integration Challenges in SoS Integration . IX. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

1 5 5 9 11 21 22 30 31 31

Chapter 2 Advances in Sociotechnical Systems . . . . . . . . . . . . . . . . 35 William B. Rouse, Michael J. Pennock and Joana Cardoso, Stevens Institute of Technology, Hoboken, New Jersey

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . II. Origins of Sociotechnical Systems Research . . . III. Contemporary Views of Sociotechnical Systems IV. Methods and Tools . . . . . . . . . . . . . . . . . . . . . . V. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

35 36 43 53 60 60

Chapter 3 Engineering Resilience into Human-Made Systems . . . . . 65 Scott Jackson, Burnham Systems Consulting – Greater Los Angeles Area, California

I. II. III. IV.

Overview of Resilience . . . . . . . . . . . . . . . . . . Design of a Resilient System . . . . . . . . . . . . . Creating a Concrete Solution . . . . . . . . . . . . . Improvised Solutions as Resilient Systems in Crisis Environments . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . . 66 . . . . . . . . . . . . . . . . . . 70 . . . . . . . . . . . . . . . . . . 96 . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . 122

viii

TABLE OF CONTENTS

Chapter 4 Applying SysML and a Model-Based Systems Engineering Approach to a Small Satellite Design . . . . . . . . . . . . . . 127 Sanford Friedenthal, SAF Consulting, Reston, Virginia; Christopher Oster, Lockheed Martin Corporation, Cherry Hill, New Jersey

I. Introduction . . . . . . . . . . . . . . . . . . . II. Spacecraft System Design Example . III. Summary . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

127 145 217 218 218

Chapter 5 A Systems Engineering Approach and Case Study for Technology Infusion for Aircraft Conceptual Design . . . . . . . . . 219 Kelly Griendling and Dimitri Mavris, Georgia Institute of Technology, Atlanta, Georgia

I. II.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Requirements Analysis and Identification of Driving Parameters . . . . . . . . . . . . . . . . . . . . . . . . III. Morphological Analysis for Alternative Generation IV. Modeling and Simulation . . . . . . . . . . . . . . . . . . . V. Design Space Exploration . . . . . . . . . . . . . . . . . . . VI. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . 219 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

222 235 239 251 264 267

Chapter 6 Powerful Opportunities to Improve Program Performance Using Lean Systems Engineering and Lean Program Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Bohdan W. Oppenheim, Loyola Marymount University, Los Angeles, California; Cecilia Haskins, Norwegian University of Science and Technology, Trondheim, Norway

I. II.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evolution of Systems Engineering Toward Detailed Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III. Myths of Classical Systems Engineering and Program Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV. Lean Enablers for Systems Engineering, and Lean Enablers for Integrated Systems Engineering and Program Management . . V. Powerful Opportunities for Improvement . . . . . . . . . . . . . . . . . VI. Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index

. . . 269 . . . 271 . . . 277 . . . .

. . . .

. . . .

279 281 285 285

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

Supporting Materials

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

PREFACE In 2007, the Systems Engineering Technical Committee (SETC) of the American Institute of Aeronautics and Astronautics (AIAA) proposed a new Journal of Systems Engineering to join the other eight existing journals published by AIAA. Through discussions among SETC members, the Publication Committee, and the Technical Advisory Board (TAB), publication of a progressive series of books to facilitate paper submission momentum was recommended instead of a new journal. By 2010, the SETC decided to publish a book comprising a progressive series of advanced systems engineering topics. Discussion among members of the SETC concerning how best to generate the series of advanced systems engineering topics revolved around either including the papers presented in the conferences or inviting world-preeminent experts to write each chapter. The latter proposal prevailed and gained approval from SETC members. In my mind, the book should represent unique contributions to Systems Engineering (SE) not in existence elsewhere in any single book. Traditional SE has flourished since the United States entered the space race and developed ICBMs (intercontinental ballistic missiles) to compete against the Soviet Union. It was started at TRW and championed by Dr. Simon Ramo. He recognized the continually growing complexity of systems, and the need for a paradigm change and profession to help program managers and engineers grasp and resolve how to manage the complexity to field ever more progressive systems. Traditional SE was the catalyst for the United States prevailing in the space race and delivering a superior ICBM. According to Ramo, the more complex systems and/or systems-of-systems resulting from the continuous advancement in the technological world thus pushes systems engineering to manage the growing number of systems and interfaces with these systems-ofsystems. This lag presents tremendous challenges to industry. We hope that this book can help close the gap and answer some of the questions not answered by any one single book, and bring to light modern systems engineering thought processes and paradigms. Many systems of interest in the modern world can be characterized as systems-of-systems (SoS). Infrastructure systems (power systems, transportation systems, water systems, etc.), Next Generation Air Transportation system, Walmart Supplier Chain System, and the National Oceanic & Atmospheric Administration System are prime examples and methodologies to determine their definition when operating as an SoS are needed. Traditional SE has treated the human “operator” separate from the system, but in this current reality, the human represents an inseparable (and integral) part of the system. Modern systems engineering paradigms bring the socio (human) 

Ramo, S., and St. Clair, R. K. “The Systems Approach,” KNI, Incorporated, Anaheim, CA, 1998.

ix

x

PREFACE

aspect together with the technical aspects in an effort to optimize the performance of the system of systems to meet its mission objectives. Methodologies are needed to define systems able to withstand major disruptions (hurricanes, earthquakes, terrorist attacks, etc.) and retain all (or acceptable) levels of functionality. Priorities have been set by the government to do this. This capability is called resilience. Modern SE needs to include resilience. While resilience to natural disasters is certainly important, resilience to a wide variety of events such as cyber attacks, hardware and software obsolescence, interoperability issue, will also be important throughout the life cycle of a system of systems. Traditional SE has relied on a document-centric approach for implementing requirements (amongst other aspects of systems). Modern SE is better able to meet the challenge of increasingly complex systems and SoS using the computer. Systems Modeling Language (SysML) is a language that enables SE for SoS to be performed efficiently by means of models (as opposed to documents). The recent focus on model-based systems engineering (MBSE) represents a potentially significant advancement in making requirements development and design an iterative process that helps reduce programmatic risk prior to formal testing, and thus greatly decreasing the costs associated with remedying issues discovered during formal testing and the subsequent redesign. Modern systems incorporate increasing amounts of software as part of the system. This trend, seen in aircraft, automobiles, and other systems is not well covered in traditional SE. The attempt to merge software and hardware together in requirements development, design and testing has been challenging but obviously will be valuable to solve to optimize system performance. Lean Systems Engineering (LSE) should be regarded as the process of amending traditional SE, as opposed to supplanting it with a new body of knowledge. It does not mean less SE but rather better SE (better preparation of the processes, people, and tools). The six chapters presented in this book derive their motivation from the subject matter addressed above. The purpose of this book is to explore research and development advances that are at the forefront of systems engineering. Today, systems are becoming increasingly more complex and invariably operate within a broader SoS context that typically includes social, technical, and natural systems such as healthcare, communications, power, transportation, and human organizations. It is important to understand the needs of the broader SoS before undertaking the design of the technical systems. These systems need to be capable of operating reliably in a wide range of conditions associated with the broader context, that demands robustness and resilience as key attributes of the SoS. Another critical advance is MBSE, which seeks to transform the traditional document-centric approach to systems engineering to a model-based approach. In some sense, this is analogous to the transition from the drawing board to computer-aided design models. In contrast to the document-centric approach, MBSE enhances the capability to more accurately represent systems and integrate multiple views

PREFACE

xi

of a system. A methodology called the Relational Oriented Systems Engineering Technology Tradeoff Analysis (ROSETTA) Framework is introduced to give designers and decision makers a meaningful way to trade the benefits and costs of technology infusion with the ability to meet and even exceed design requirements. Chapter 1 presents the fundamental concepts, challenges, and opportunities for SoS Integration (SoSI). Emergent behavior is an important aspect of a SoS. The authors provide an overview of the definitions, theories, interpretations, and open issues related to these behaviors. An iterative SoSI concept is discussed using a SoSI ontology as a guiding construct. Various factors, such as risk management, requirements and interface definitions, configuration management, and verification and validation, are evaluated for SoSI. Interoperability is discussed as a cross-cutting issue. As discussed in Chapter 2, sociotechnical systems are a type of SoS. This chapter discusses the origins of this concept, contemporary perspectives, and methods and tools for addressing such systems. Eighteen years ago, Simon Ramo noted that “the systems approach is being seen as a unifying, integrating mechanism for application to social problems.” The engineering of these systems has to account for how people and organizations will use as well as misuse these systems. Without such an accounting, sound traditional engineering cannot be expected to lead to success. The core objective of the analysis of sociotechnical systems is the joint optimization of social and technical components. In this regard, agent-based simulations have allowed social and behavioral scientists to explore the sociotechnical behavior of complex systems at the right level of detail. The simulated behavior of individual agents can be compared to the behavior of individual humans, where the aggregate behavior from the simulation can be compared to the aggregate behavior of the population. It is possible to build a multilevel model that combines a computational model of the technical system with a computational model of the social system. Such models are called agentbased models. While there are various interpretations of complexity, social systems are invariably considered complex regardless of the definition used. Some of the key properties of complex systems are highlighted in this chapter. Like most complex systems, an SoS or sociotechnical system needs to be resilient to confront and recover from human-made threats and natural hazards. This is covered in Chapter 3. Multiple definitions of resilience are covered in an engineering context. A popular definition characterizes resilience as the “ability of the system to withstand a major disruption within acceptable degradation parameters and to recover within an acceptable time and composite costs and risks.” There are 14 primary principles and 20 support principles. The human-in-the-loop principle is primary in all domains that are related to the subject in Chapter 2, Advances in Sociotechnical Systems. A primary principle of resilience engineering is the defense in depth principle. This principle helps compensate for vulnerabilities in other principles. The threats and natural disasters are illustrated by several case studies in this chapter. The resilient case studies cover four domains: fire

xii

PREFACE

prevention, aviation, rail, and power distribution and address many of the primary and support principles. MBSE is an approach to systems engineering where the model of the system is a primary artifact of the systems engineering process. This is the subject of Chapter 4. This is the modern approach to engineering the systems described in Chapter 1, 2, and 3. This MBSE approach is often contrasted with a more traditional document-based approach where similar information is captured in document-based artifacts. MBSE is often placed within the broader context of model-based engineering. In this context, the system model is intended to integrate with other models used by systems engineers and other disciplines. The chapter provides an overview of the Systems Modeling Language (OMG SysMLTM ). An effective application of MBSE also requires a well-defined method that addresses typical systems engineering activities such as Analyze Mission and Stakeholder Needs; Specify Spacecraft System Requirements; Synthesize Alternative Spacecraft System Architectures; Perform Analysis; Manage Requirements Traceability; and Verify System. An MBSE method using SysML is applied to the design of a small satellite design to demonstrate how MBSE can enhance the capability to address multiple aspects of a complex system design. The ROSETTA Framework is introduced in Chapter 5 to illustrate how technology considerations are infused into an aircraft conceptual design. The ROSETTA approach applied to the conceptual design of aircraft begins by harnessing the best information available from experts in the field to make high-level decisions, and then proceeds to perform higher fidelity detailed analysis using historical data and/or physics-based modeling and simulation tools as the design matures. Starting with a detailed understanding of the customer requirements, the designer can use the QFD House of Quality through repeated mapping between Requirements space to Metrics space to Design space in downward hierarchical parametric levels to find relative importance among critical design variables. Since it is not necessarily possible to implement all of these technologies simultaneously, a compatibility matrix is created to help identify technology combinations that are unreasonable or impossible. Then it is followed by a quantitative analysis that is sped up through the use of Design of Experiments (DoE) and surrogate modeling approaches. This approach to aircraft conceptual design can leverage the approaches in the previous chapters including the approach to integrate social behavior from Chapter 2, the resilient engineering design approach from Chapter 3, and the model-based systems engineering approach from Chapter 4. Chapter 6 summarizes recent powerful advances in Lean systems engineering (LSE) and Lean program management (LPM) intended to reduce program waste, cost and schedule while promoting program quality and mission assurance. Recent large defense programs are notorious for high costs, long development schedules that extend for years and even decades, and frequently reduced performance. The amount of bureaucracy and politics involved in the programs has been growing steadily, resulting in deterioration of programs from intended

PREFACE

xiii

“great engineering and management” to unintended “bureaucracy of artifacts.” Engineers tend to spend 10% or less on engineering and the rest of various forms of non-value adding wastes. The number of top-level requirements has grown to thousands, and, on average, only 18% of them remain stable throughout the program. Unstable requirements penalize programs with added costs, schedule, and reduced performance. The chapter begins by describing the problems and challenges that burden recent programs and the destructive myths regarding interfaces and requirements, which lead to a Catch-22 situation. The bulk of the chapter is devoted to the presentation of the new body of knowledge of LSE and LPM. Results of two comprehensive studies by large teams of experts in LSE and LPM, in the form of 326 Lean Enablers (that is best practice heuristics) for systems engineering and program management are summarized. Examples of specific enablers are presented. The LSE and LPM discussed in this chapter are applicable to the other five chapters. John Hsu August 2016

ACKNOWLEDGMENTS Members of the AIAA Systems Engineering Technical Committee (SETC) started the initiative for the Journal of Systems Engineering, which transformed into working on a book with a progressive series of advanced systems engineering topics. It has been nine long years in putting the book together. Thanks to the members for their valuable suggestions during the course of the endeavor. Over the years, the book efforts have been led by the SETC chairs, John Hsu, John Day, Sophia Bright, John Dahlgren, John Eiler, and now Michelle Bailey. The contributing members are Satoshi Nagano, Eric Nichols, John Muehlbauer, David Dress, Mat French, Vicki Johnson, Reece Lumsden, and Dennis Van Gemert. I would like to thank AIAA Publications Acquisitions and Development Editor, David Arthur, for his constant advice, and Senior Editor of Books, Pat DuMoulin, and production manager, Nick Barber, for their assistance. Of course, my sincere thanks go to the contributing authors of these six chapters. I am grateful for their patience with my demands and requests. Without their contributing chapters, the book would have been impossible to publish. Finally, I would like to give many thanks to my longtime colleague and friend at The Boeing Company, Scott Jackson, for his guidance and contributions to this book. John Hsu

xv

AUTHOR BIOGRAPHIES Joana Isabel Lacerda da Fonseca Pinto Cardoso is a Doctoral Student in the School of Systems and Enterprises at the Stevens Institute of Technology and Research Assistant of the Center for Complex Systems and Enterprises. Joana’s research interests include analysis of complex systems and strategic planning for plausible future scenarios. Sanford Friedenthal is an industry leader in model-based systems engineering (MBSE) and an independent consultant. As a Lockheed Martin Fellow, he led the corporate engineering effort to enable Model-Based Systems Development (MBSD) and other advanced practices across the company. His experience includes the application of systems engineering throughout the system life cycle on a broad range of systems. He led the industry team that developed SysML, and is co-author of A Practical Guide to SysML. Kelly Griendling was awarded a Sam Nunn Security Program Fellowship. She has been the Chief of the Advanced Systems Engineering Division at the Aerospace Systems Design Laboratory of the Georgia Institute of Technology research faculty. Her research expertise is in the area of systems engineering, spanning requirements analysis, modeling and simulations, experimental practices, surrogate modeling, and data analysis, synthesis, and visualization. Cecilia Haskins is an Associate Professor of Systems Engineering at the Norwegian University of Science and Technology (NTNU). Her research interests focus on applications of lean practices to manufacturing and construction projects in shipbuilding and the oil and gas industry. She holds an MBA from Wharton, University of Pennsylvania, and is a certified Systems Engineering Professional since the inception of the certificate in 2004. Scott Jackson is a Fellow of the International Council on Systems Engineering (INCOSE). He is now Principal Engineer with Burnham Systems Consulting in the Greater Los Angeles area specializing in systems engineering for commercial aircraft. He is author of the book Systems Engineering for Commercial Aircraft: A Domain Specific Adaptation, 2015, Ashgate Publishing in the United Kingdom. He is a doctoral candidate at the University of South Australia for which his research was the basis for the material in Chapter 3. Azad M. Madni is a Professor of Astronautics and the Director of the Systems Architecting and Engineering Program in the University of Southern California’s Viterbi School of Engineering. He is the founder and Chairman of Intelligent Systems Technology, Inc., a high tech R&D company specializing in complex systems engineering and game-based educational simulations. He is an elected

xvii

xviii

AUTHOR BIOGRAPHIES

Fellow of AIAA, AAAS, IEEE, INCOSE, SDPS, and IETE. Among his numerous awards and honors is the 2011 INCOSE Pioneer Award. He received his BS, MS and PhD degrees from UCLA. Dimitri Mavris is the Boeing Chaired Professor of Advanced Aerospace Systems Analysis in the Georgia Institute of Technology’s School of Aerospace Engineering, Regents Professor, and Director of its Aerospace Systems Design Laboratory (ASDL). He is the S.P. Langley NIA Distinguished Professor, an AIAA Fellow, member of the ICAS Executive Committee, the AIAA Institute Development Committee, and the U.S. Air Force Scientific Advisory Board. He is also the Director of the AIAA Technical, Aircraft and Atmospheric Systems Group. Bohdan “Bo” W. Oppenheim is a Professor of Systems Engineering and Director of Healthcare Systems Engineering at Loyola Marymount UniversityjLos Angeles. He is the founder of Lean Systems Engineering Working Group of INCOSE, coleader of the effort developing Lean Enablers for Systems Engineering, author of Lean for Systems Engineering, the second author of The Guide to Lean Enablers for Managing Engineering Programs, and author of 11 books and book chapters, and 30 journal publications. He was recognized with two Shingo Awards, INCOSE Best Product Award, Fulbright Award, and INCOSE Fellowship. Christopher Oster is a Systems Architect at Lockheed Martin Advanced Technology Labs and an industry recognized expert in Systems Engineering and model-integration. Mr. Oster’s experience has focused on developing largescale, software intensive systems including data processing suites, spacecraft command and control systems and space platforms. Mr. Oster earned his M.S. and B.S. in Computer Science and Engineering from The Pennsylvania State University and is enrolled in a PhD program in Systems Engineering at Stevens Institute of Technology. Michael J. Pennock is an Assistant Professor in the School of Systems and Enterprises at the Stevens Institute of Technology and Associate Director of the Center for Complex Systems and Enterprises. Michael’s research interests involve issues associated with the computational modeling of enterprise systems and systems of systems in the national security and health care domains. Michael has also worked as a senior systems engineer in various lead technical roles for the Northrop Grumman Corporation. William B. Rouse is the Alexander Crombie Humphreys Chair within the School of Systems & Enterprises at Stevens Institute of Technology and Director of the Center for Complex Systems and Enterprises. He is also Professor Emeritus, and former Chair of the School of Industrial and Systems Engineering at the Georgia Institute of Technology. Dr. Rouse has written hundreds of articles and book chapters, and has authored/co-authored at least 20 books, including

AUTHOR BIOGRAPHIES

xix

award-winning ones. Served as more than four Committee Chairs, and a member of the National Academy of Engineering and has been elected fellows of four professional societies. Michael Sievers is a Senior Systems Engineer at Caltech’s Jet Propulsion Laboratory where he is the principle avionics architect, systems engineer, and project software systems engineer for a number of JPL Earth-science and advanced electro-optical missions. Dr. Sievers also conducts research in the areas of system reliability, fault-protection, resiliency, and model-based systems engineering. He is an Associate Fellow of the AIAA, Senior Member of IEEE, and Member of INCOSE.

CHAPTER 1

System of Systems Integration: Fundamental Concepts, Challenges and Opportunities Azad M. Madni University of Southern California, Los Angeles, California

Michael Sievers†,‡ Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California

I. INTRODUCTION A system of systems (SoS) comprises standalone systems, originally designed for specific and distinct purposes, which are brought together within the SoS umbrella to create a new capability needed for a particular mission [Mayk and Madni, 2006; Manthorpe, 1996]. SoS are found in many domains including socio-technical systems discussed in Chapter 2 as well as informing the technologies used in future aircraft described in Chapter 5. An SoS may also be required to support multiple missions, some of which cannot be defined with any specificity. An SoS can be assembled dynamically to perform a particular mission and then reorganized as needed to perform other missions including those yet to be envisioned. Dynamic resource reallocation is an essential attribute of the engineered resilient systems described in Chapter 4. An effective SoS design might comprise modules that are not as efficient as their standalone counterparts that perform the same functions. Therefore, these modules are less likely to be deployed independently even though it may be technically feasible. An SoS is often characterized using terms such as “interoperable,” “synergistic,” “distributed,” adaptable,” “transdomain,” “reconfigurable,” and “heterogeneous.” As the SoS complexity increases [Madni, 2012b; Vuletid et al., 2005], SoS integration (SoSI) becomes increasingly more challenging. The Model-Based Systems Engineering (MBSE) tools discussed in Chapter 4 are a popular means for managing complexity. In essence, SoSI is intended to create a new mission capability through composition of component systems that contribute complimentary functions to the overall capability. Once

 Technical

Director, Systems Architecting and Engineering Program. Senior Systems Engineer. This work was done as a private venture and not in the author’s capacity as an employee of the Jet Propulsion Laboratory, California Institute of Technology.

† ‡

Copyright # 2014 by the authors. Published by the American Institute of Aeronautics and Astronautics, Inc. with permission.

1

2

A. M. MADNI AND M. SIEVERS

component systems are developed and fielded, they are incrementally integrated and tested, and ultimately, deployed in the operational environment. Often, the component systems tend to have different customers and purposes and, therefore, tend to employ different terminologies and concepts. Integrating such systems requires resolving semantic and syntactic differences. Furthermore, when SoSI involves the integration of legacy systems, it is invariably the case that the needed documentation and expertise needed for integration are not available. It is also worth noting that the conditions under which an SoS has to operate are becoming increasingly more challenging [Neches and Madni, 2012]. For example, today’s defense and aerospace systems as well as healthcare enterprises and energy grids have to satisfy affordability, adaptability, security, reliability, and resilience requirements [Madni and Jackson, 2008; Rehage et al., 2004; Mosterman et al., 2005]. Whether a given configuration is an SoS is more a matter of degree, rather than a binary decision. In fact, a system and an SoS lie on a continuum. Table 1 shows several key characteristics that are typically associated with an SoS. The more a system exhibits these characteristics, the more apt it is to call it an SoS. The criteria in Table 1 can be applied, for example, to determine whether or not the Curiosity rover (that landed on Mars in 2012) is an SoS. Curiosity comprises a number of interoperating subsystems designed to provide basic control, communications, power, thermal, and scientific services. However, these subsystems cannot operate in a standalone fashion and, consequently, do not have operational independence. Therefore, the curiosity rover does not qualify as an SoS. Emergence is a less well-understood and more complicated concept in the SoS literature. An SoS could exhibit new behaviors that were not previously observed because the conditions that caused them did not previously exist. In this case, the SoS is doing what it was designed to do but its behavior appears to be unanticipated because the outcome space was not fully explored. Is this emergence? Many would argue that it is not. Perhaps another way to think of emergence is to consider whether all states, state transitions, and outputs of the SoS are known and observable, or whether there are some states and transitions that are hidden and only become apparent when the observed outputs vary (i.e., diverge) from the expected outputs. Figure 1 shows a simple example of a hidden Markov model that comprises hidden states and transitions [Baum and Petrie, 1966, Rabiner, 1989]. An SoS that exhibits behavior such that shown in Fig. 1 is designed to conform to the set of states S and transition probabilities p, and produce outputs from the set Y. However, under certain condition(s), the SoS produces outputs from the set z corresponding to the hidden state set H. The cardinality of H is unknown as is the connectivity of the hidden states to each other, or back to the observable states. An SoS that functions solely within the set S may be called complicated. That is, while it may comprise a very large number of states and state transitions, it is entirely understood from its components and it does not exhibit emergent

SYSTEM OF SYSTEMS INTEGRATION

TABLE 1

3

KEY CHARACTERISTICS OF SoS

Operational independence of the elements

If the system-of-systems is disassembled into its component systems, the component systems must be able to usefully operate independently. The system-of-systems is composed of systems that are independent and useful in their own right.

Managerial independence of the elements

The component systems not only can operate independently, but they also do operate independently. The component systems are separately acquired and integrated but maintain a continuing operational existence independent of the system-of-systems.

Evolutionary development

The system-of-systems does not appear fully formed. Its development and existence are evolutionary with functions and purposes added, removed, and modified with experience and need.

Emergent behavior

The system performs functions and carries out purposes that do not reside in any component system. These behaviors are emergent properties of the entire system-of-systems and cannot be localized to any component system. The principal purposes of the systems-of-systems are fulfilled by these behaviors.

Geographic distribution

The geographic extent of the component systems is large. Large is a nebulous and relative concept as communication capabilities increase, but at a minimum, it means that the components can readily exchange only information and not substantial quantities of mass or energy.

Source: [Maier and Rechtin, 2009; Sage and Cuppan, 2001].

behavior. Transitions to any hidden state may imply emergence if the behaviors cannot be fully explained solely from the behavior of the components. We can look beyond engineered systems to explore the concept of emergence. For example, one might ask whether biologically inspired systems exhibit emergence [Bongard, 2009]. Over a lifetime of operation, a biologically inspired system may alter its responses to a given set of conditions as the system “learns” and “adapts.” But is this truly emergence? Remember that the system executes a predefined algorithm for which the outcomes are completely predictable. What about a chemical solution that exhibits precipitation at some temperature? Is that emergence? That might depend on whether or not precipitation was anticipated. Once again, it comes down to how one chooses to define emergence.

4

A. M. MADNI AND M. SIEVERS

Fig. 1 Hidden Markov model. States and transition probabilities p are known and observable. States H and transitions u are neither known nor observable. Outputs Y and z are observables. Emergence within complex adaptive systems (CAS) is easier to grasp with physical phenomena such as phase changes. However, biologists do not view this as a CAS phenomenon (no sensate “agents”). It is interesting to note how a phenomenon like phase change which initially appeared to be almost magical has become part of intent and design now that the phenomenon is understood. This observation raises further questions about emergence. Should emergence be redefined to cover the expected and designed, or should it be redefined in some other sense? Should emergence be defined along a continuum (i.e., is it a matter of degree)? These are but a few questions that need to be answered before emergence can be defined in a consistent and useful way. We note though that regardless of how one wishes to define emergence, from an engineering perspective, we really need to understand whether our systems are resilient to unexpected behaviors. A number of researchers are investing means for developing and analyzing resilient systems [Madni and Jackson, 2009]. There are other important considerations that come into play when a system assumes a role within an SoS. For example, it is likely that some degree of autonomy will be sacrificed when a system participates within an SoS. Also, some characteristics can potentially be both time- and goal-dependent. Thus, at a given time, a system may want to control certain local resources and not make them available to the SoS to achieve its overall goal. At some other time, the system may have completed its local task and released the previously tied up resources to the SoS. In this regard, Baldwin et al. [2012] define a game-theoretic approach that is relevant for analyzing such a collaborative SoS.

SYSTEM OF SYSTEMS INTEGRATION

5

The remainder of this chapter is organized as follows. Section II discusses the unique characteristics of SoS. Section III presents an SoS typology in use today. Section IV discusses SoS integration from several different perspectives. Section V presents SoSI ontology to inform and guide SoS integration. Section VI presents an SoS model for a Mars Colony SoS as described in Shisko et al. [2015]. Section VII discusses SoS interoperability for different SoS types from a variety of perspectives. Section VIII presents human–system integration challenges in SoS. Section IX summarizes the contributions of this chapter.

II. SoS TYPOLOGY According to the Systems Engineering Guide for SoS (2008), an SoS can be classified according to the way it is managed and its openness to change and new capabilities (Table 2). The form and rigor of SoSI are directly related to SoS type—an unmanaged SoS is inherently more difficult to integrate than a tightly managed SoS. Table 2 offers a coarse SoS classification defined in this guide. The boundaries between these types can be defined in terms of the degree of operational and managerial independence of the components. In principle, each SoS type as shown in Table 2 is open to any system that conforms with the interoperability rules of the SoS. However, an SoS may choose to restrict membership to only well-known, trusted systems. The Internet is an example of the former, while a secure banking network is an example of the latter. In an open, collaborative SoS, new, untrusted systems may enter and leave with unknowable impact on overall SoS integrity. In a closed SoS, the pool of systems are trusted (to an extent) and may also be required to undergo periodic inspection and validation to continue to maintain membership in the SoS.

III. UNIQUE CHARACTERISTICS OF SYSTEM OF SYSTEMS (SoS) There are several integration perspectives that apply to both systems and an acknowledged SoS (Table 3). These include: certification and accreditation (C&A), acquisition, structure, integration mechanisms, and verification (does the SoS conform to established requirements) and validation (does the SoS have the desired capabilities). C&A is a major complexity driver because of the multiple management layers, multiple funding sources, and the lack of synchronization in the life cycles of the various systems within an SoS. Acquisition is a complexity driver due to multiple acquisition programs, multiple systems’ life cycles across programs, and the need to achieve interoperability among legacy and new systems. The structure is a major complexity driver in that the structure of an SoS can change dynamically as systems continue to enter/exit the SoS. Integration mechanisms are a major complexity driver in that they need to support dynamic interoperability among constituent systems. Verification and Validation (V&V) is

6

A. M. MADNI AND M. SIEVERS

TABLE 2 Virtual SoS

TYPOLOGY OF SYSTEM(S) OF SYSTEMS No central management authority No centrally agreed-upon purpose Systems can enter/exit dynamically based on mission requirements Potential for emergence of large scale system behavior (which may be desirable) Relies on relatively hidden mechanisms for maintenance Examples: stock market, national economies

Collaborative SoS

Component systems interact more or less voluntarily to fulfill agreed upon central purposes Internet is a collaborative SoS; the Internet Engineering Task Force defines but can’t enforce standards Central players collectively decide how to provide/ deny service (some means to enforce and maintain standards) Examples: World-wide web, multirobot systems

Acknowledged SoS

Has recognized objectives, a designated manager, and resources However, constituent systems retain independent ownership, objectives, funding, development and sustainment Changes in the system are based on collaboration between the SoS and the system Example: DoD’s Ballistic Missile Defense System

Directed SoS

Integrated SoS is built and managed to fulfill specific purposes Centrally managed during long-term operation to fulfill purposes and any new ones set by system owners Component systems maintain ability to operate independently, but their normal operational mode is subordinated to the central managed purpose Example: JPL’s Deep Space Network

Source: Adapted from OUSD AT&L SE Guide for SoS, 2008.

SYSTEM OF SYSTEMS INTEGRATION

TABLE 3 Comparison Factors Knowledge of Stakeholders

C&A

7

COMPARING AN ACKNOWLEDGED SoS WITH SYSTEM System

At one level

Program management aligned with program funding

Acknowledged SoS .

At both system and SoS levels

.

Include system owners with competing interests

.

In some cases, system owner has no vested interest in SoS

.

All stakeholders may not be recognized

.

Added complexity due to existence of management and funding for both SoS and individual systems

.

SoS does not have authority over all systems

Operational Focus

Meet operational objectives

.

Meet operational objectives using systems whose objectives may not necessarily align with SoS objectives

Acquisition

Aligned with acquisition milestones, requirements; has systems engineering plans

.

Quite complex due to multiple systems lifecycles across acquisition programs involving legacy and development systems, new developments, opportunistic technology insertion

.

Typically have stated capability objectives upfront that may need to be translated into formal requirements

.

More challenging due to difficulty in synchronization across multiple systems’ lifecycles, complexity of changing “parts,” unspecified behaviors and requirements, and the potential for unintended consequences

V&V

Generally possible

(Continued )

8

A. M. MADNI AND M. SIEVERS

TABLE 3 Comparison Factors

COMPARING AN ACKNOWLEDGED SoS WITH SYSTEM (Continued) System

Acknowledged SoS

Boundaries and Interfaces

Pertain to a single system

.

Pertain to systems that contribute to the SoS objective and enable the flow of data, control, and functionality across the SoS while balancing the needs of the system

Performance and behavior

To meet specified objectives

.

Across SoS, satisfies needs of SoS users while balancing the needs of the component systems

Structure

A-priori defined interfaces and interactions assure that assembled subsystems work together

.

Dynamic structures and the potential for syntactic and semantic mismatches may lead to “lethal” configurations that could not be predicted implying that not all objectives of an SoS are achievable or that great care is needed in selecting SoS applications

Mechanisms

Integrated and tested bottom-up after being built up hierarchically; interfaces designed for functional components with limited use connectivity in mind; component requirement and interface verif. of components as system is built preceded by functional validation

.

Reusability and adaptability require flexible, generalized integration methods that provide semantic and syntactic compatibility that otherwise do not exist between systems that were not necessary designed with integration in mind. In unmanaged SoS, formal validation may not be possible. Therefore, integration success is determined during SoS operation and use.

SYSTEM OF SYSTEMS INTEGRATION

9

a complexity driver because of the difficulty in synchronizing across multiple systems’ life cycles, the dynamic entry/exit requirement for some of the SoS components, and the lack of a defined set of behaviors or requirements in some SoS. In general, an SoS tends to be more complex than a system due to the added requirements for integration, different governance, and a different emphasis and approach to design and validation. Creating detailed requirements and evaluating compliance is possible when the objectives of an SoS are clear, and the SoS structure is closely managed. For example, an SoS could provide local situational awareness from data collected and processed by field wearable sensors and processors. Additionally, if reconnaissance aircraft data are available, then information about a neighborhood can be fused with information provided by ground sensors. Big picture situational awareness is achieved from integrating groundbased, aircraft-borne, and spacecraft sensors. Such an SoS has well-defined objectives that drive requirements for sensors, processors, and communications. These requirements can be flowed down to individual systems and validated by independent system engineering teams. In sharp contrast, consider the difficulty in generating requirements when the objectives and architecture of an SoS are unknown until stakeholders get together to specify a needed capability. In this case, requirements refer to communications protocols, information semantics, and usage restrictions. When each system is locally managed and maintained, assessing compliance even for general requirements may be left to local oversight with no external guarantees of compliance in terms of consistency and completeness. There are additional differences between a system and an SoS beyond those identified in Table 2. For example, the typical role of the human in a system tends to be mostly fixed with the human in the role of an operator (or an agent) with predefined interfaces. However, in an SoS, the human can orchestrate the SoS configuration and invariably perform as an agent with changing interfaces to other agents/systems under the SoS. This dynamic behavior occurs because a system can enter and exit an SoS based on changing mission requirements that, in turn, determine the mix of systems in the mission capability package. Specifically, humans in an SoS can experience ongoing changes with respect to the agents they collaborate with as systems enter/exit the SoS. Continuous context switching associated with ephemeral SoS teams can produce excessive cognitive load that subsequently tends to show up in the form of human errors and increased human error rates [Madni, 2010, 2011]. Also, given that some participating systems in an SoS are temporary collaborators, issues of trust and lack of shared understanding can arise compromising overall SoS performance.

IV. SYSTEM OF SYSTEMS INTEGRATION (SoSI) A virtual SoS and a collaborative SoS are not prespecified. The significance of the observation is that there can be no a priori expectation that information paths

10

A. M. MADNI AND M. SIEVERS

and/or information content are consistent in the SoS. Typically, an SoS is constituted and dynamically configured in a “loosely coupled” fashion to achieve the interoperability needed to create a mission capability package, and then reorganized as necessary to satisfy subsequent mission needs. In some cases, the systems, behaviors, and connectivity of an SoS become known only after a mission capability is specified. In this environment, traditional top-down integration and validation normally associated with systems integration are not applicable [Madni and Sievers, 2013]. Conversely, a directed SoS and an acknowledged SoS are prespecified, making them predictable and compatible with traditional functional verification and validation methods. Figure 2 shows a simplified SoSI flow. In this figure, the systems are linked directly or through an intermediary. While the figure shows only two SoSI configurations, there are many more possibilities as the “n” available systems are organically and dynamically configured to satisfy the needs of a mission capability package. Figure 1 implies a well-controlled flow. However, for an unmanaged SoS, structure and integration primarily revolve around the needs of the stakeholders,

Fig. 2 SoSI showing two possible integration configurations. The two SoS are configured organically and dynamically for a particular need, and subsequently reassembled to satisfy other needs.

SYSTEM OF SYSTEMS INTEGRATION

11

technologies involved, and legacy constraints [Madni, 2012a]. Moreover, cost and schedule constraints often conflict with the desire to thoroughly test a given level of integration before moving to the next level. At this point, there are two questions that one might ask: (1) what does it mean to dynamically integrate a collection of possibly unknown resources, and (2) what assurance is there that the SoS correctly provides the services needed by its users? The answer to both questions depends, in part, on the type of SoS and the interoperability and connectivity mechanisms used. Currently, there are no formal or uniformly applied mechanisms to identify and isolate systems that do not comply with the protocols and rules implicit with membership in an unmanaged SoS. Generally, SoS users stop using a “misbehaving” system and prevent that system from interoperating with other systems in the SoS. Various organizations today create and maintain lists of systems that are untrustworthy or have been found to host malicious applications (http:// www.spamhaus.org).

V. SoS INTEGRATION ONTOLOGY As described in an earlier paper [Madni and Sievers, 2013], the definition, management, and evaluation of integration approaches for systems and SoS are hampered by the lack of common terminology and clear definitions of technical and nontechnical interdependencies. To overcome this problem, we propose an SoS integration ontology shown in Fig. 2. This ontology serves to: 1. Define standard terminology to describe integration features. 2. Define a checklist of salient issues and influences that impact integration. 3. Enable development of integration models that guide/support reasoning. 4. Accommodate all stakeholders’ needs and their respective viewpoints. It is important to note that ontologies, in general, are not unique for a particular problem domain and, ultimately, all ontologies are transformed into useful representations that help answer questions and assist in reasoning in the problem space. An instantiated integration model links the properties to each other and to properties associated with the behaviors, requirements, architectures, and allocations that are commonly held within a model of the technical characteristics of the SoS. It also includes a “V&V” entity that captures pertinent V&V metrics and artifacts. As shown in Table 4, several core concepts are involved in SoSI. These concepts provide the semantic underpinnings for defining and managing SoSI. Practically speaking, the SoSI ontology offers a way of unifying the concerns in SoSI. The characteristics of the SoSI ontology and each concept in this ontology are described next.

12

A. M. MADNI AND M. SIEVERS

TABLE 4 System/Industry

REPRESENTATIVE C&A Representative C&A

DoD Ballistic Missile Defense System (BMDS) (acknowledged SoS)

DoD approved communications protocols, local test and maintenance schedules, event reporting, backup, disaster recovery, formal documentation of failures and corrective actions

Internet (collaborative SoS)

Internet Engineering Task Force (IETF) imposed protocols and protocol implementations, working groups, communication standards, security, and so forth.

Banking (collaborative SoS)

Government regulations, international treaties and agreements, internal bank policies, backup and disaster recovery policies, central bank regulation, legal constraints, ISO standards, OCC standards, business codes, and licensing. C&A imposed by national and international banking authorities to enable the exchange of funds among banks around the world.

SoSI Ontology SoSI ontology includes both artifacts (documentation) and metrics (measurements used to assess integration success). SoSI concepts are related to the SoSI domain through relationships (associations) such as SoSI has stakeholders, has CA, and has integration resources. In an instantiated SoSI model, data property assertions assign values to properties such as “documentation,” and “metrics.” For example, a model could assign, “Protocol Document D,” to the documentation property meaning, “My SoS Integration has Documentation ¼ Protocol Document D,” where My SoS Integration is an instance of SoS integration. Data property assertions are not restricted to single entities. The assigned values are allowed to be a class or any element derived from a class. This characteristic enables defining, for example, a class structure for the metrics property of SoS integration. If this class is named, “My Metrics,” then My SoS Integration has Metrics ¼ My Metrics allows defining multiple metrics, and assigning values to each. Object property assertions represent relationships between instantiations of the compositional elements in Fig. 3. These assertions form the logical network that links elements of the SoSI model. As an example, a stakeholder defined as a “manager” could have a property assertion, “has cost report,” that is linked to the data property earned value, which is contained in the concept, C&A. That is, Stakeholder_Manager has Cost Report ¼ Earned Value in which

SYSTEM OF SYSTEMS INTEGRATION

Fig. 3

13

Viewpoint and view for two stakeholders: Project manager and customer.

Earned Value refers to an equation that involves data properties, schedule and expenditures, held in Integration Resources. The network formed by object property assertions can be expected to grow quite large and complicated. Therefore, as with any complicated model, it is advantageous to define viewpoints that capture specific concerns and characteristics of stakeholders. Viewpoints are associated with views that organize the model according to the viewpoint and focus on specific content that addresses stakeholder concerns. For example, one viewpoint for SoSI might be Schedule Risk. This viewpoint captures the concerns associated with issues that could delay procuring, developing, or maintaining the SoS. This viewpoint is of interest to and could be shared by a number of stakeholders including the Project Manager, System Scheduler, and the Maintenance Supervisor. An example of a view that conforms to this viewpoint might be the availability of physical resources as defined in the instance of the Integration Resources model element. SoS Stakeholders Stakeholders are individuals, organizations, or external systems with an interest in the SoS, or in a component system that could potentially become part of or participate in the SoS. Stakeholder “concerns” are those issues that are important to specific stakeholders. Some stakeholders can exert “influence” on some aspect of SoS integration and can, in turn, be influenced by other aspects. The concerns and influences of stakeholders define the “viewpoints” that characterize specific aspects and relationships (within the model) represented as views that conform with a particular viewpoint. There are typically a number of viewpoints for a given model that can be associated with one or more stakeholders. For example, a project manager (stakeholder) is typically concerned about schedules, costs, and risks. Figure 3 shows an instance of a viewpoint for this stakeholder that has a view defined as earned value analysis that depends on the expenditures and schedule values in the integration resources element. Other stakeholders (e.g., the

14

A. M. MADNI AND M. SIEVERS

project customer), who may also want to view/review an earned value report, can also share the earned value viewpoint. SoSI requires defining stakeholders, and then managing their needs and expectations. These needs and expectations can often be in conflict. Thus, managing them is as much a part of SoSI as the process of building and testing the SoS. All stakeholders are influenced by and can influence one or more stakeholders. Invariably, stakeholders monitor metrics associated with system parameters that relate to their concerns. Some stakeholders (e.g., designers, testers) also provide artifacts. The interactions, associations, and semantics that link stakeholders represent communications pathways, checks and balances, and testable assertions. Integration Resources and Influence of External Factors Because they are closely related, integration resources (or enablers) and external factors can be addressed together. Integration resources define the “what,” while “external factors” defines influences that affect the “what.” That is, Integration Resources are the physical resources (equipment, facilities, materials, etc.), integration plans, required interfaces to external entities, and schedule and budget necessary for integration. External influences contribute to integration resources but often may not be under the control of the SoS integrator. Thus, for example, suppose that an SoS plans to use a particular protocol defined by an international standards organization. Although the SoS integrator may have some influence on the decisions of the international body, it may not be able to change aspects of the protocol that are inconvenient or inefficient for the particular mechanisms planned for the SoS. In this case, an international body becomes an external influence on the design and integration of the SoS. SoS Architecture The architecture of an SoS defines the component systems and their interfaces. The dynamic nature of an SoS implies that the constituent systems and interfaces may vary based on capabilities needed to satisfy mission goals. Moreover, certain functions may exist in more than one constituent system. These systems are incorporated within the SoS as needed to accommodate, for example, local maintenance schedules, availability, performance requirements, and interconnection traffic. While the specific components within the SoS structure need not be fixed, the mechanisms available to link and unlink systems need to conform to defined rules. Figure 4 shows an exemplar SoS comprising “m” systems interconnected through two communications fabrics. In this example, system m and system mþ2 may use either of the two fabrics. These fabrics do not necessarily have the same semantic and syntactic interfaces. This fact complicates SoS integration and validation. The example in Fig. 4 also implies that connecting systems 1 and n, for example, requires cooperation from system m. In virtual and collaborative SoS, there is no central authority with the knowledge of available capabilities, and their ability to be integrated at any

SYSTEM OF SYSTEMS INTEGRATION

Fig. 4

15

An example of an SoS comprising two distinct communications fabrics.

point in time. SoS integrators need an understanding of whether nondeterminism is an issue for users of the SoS to make informed decisions regarding the mechanisms used. NASA’s Deep Space Network (DSN; http://deepspace.jpl.nasa.gov/dsn), is an example of a Directed SoS structure as defined by DoD. It is important to note, however, that using the criteria in Table 2, a directed SoS could not exist, because once management is centralized, an SoS becomes just another system. The DSN is an international network of radio antennas and data relays used to communicate with Earth-orbiting satellites, and interplanetary spacecraft, as well as carry out radio and radar astronomy observations. There are currently three stations in the DSN located at Goldstone California, Madrid Spain, and Canberra Australia (Fig. 5). Each station is locally managed and maintained. However, for certain NASA missions, they come together to provide communication with the spacecraft and to coordinate communication handover as the spacecraft goes out of view of one station and into the view of another. Station activities are coordinated by a centralized operations center that coincides with spacecraft visibility. Data collection and control activities are planned by the centralized operations center and automatically switched as signal strength declines at one station and increases at another. Although there are variations in when the stations assume communication related to Earth orbit and spacecraft trajectory, there are very well tested models that can predict what each station does and when. There are also a fixed set of functions that each station needs to provide while in contact and while waiting for contact. The SoS is carefully tested well in advance and simulated tracking and hand-offs are performed to assure that

16

A. M. MADNI AND M. SIEVERS

Fig. 5 70 M antenna at the Goldstone Tracking Station (http://deepspace.jpl.nasa.gov/ dsn/gallery/goldstone3.html)

all resources are working as expected well before the need arises. While critical missions may use more than one station to assure uninterrupted communication, other missions might have only one station assigned. Typically, each station serves multiple missions and when not acting as a command and telemetry relay, can be repurposed for astronomy observation. The SoS is carefully controlled by a timebased activity schedule that dictates the configuration and services that a station must perform in support of flight missions as well as nonmission time that is scheduled by science teams. The ARPANET, sponsored by the U.S. Advanced Research Projects Agency, evolved from work done at MIT Lincoln Labs, UCLA, Stanford, UC Santa Barbara, University of Utah and Bolt, Beranek and Newman (BBN). The original ARPANET was a centrally managed, directed SoS but over time evolved into an unplanned, collaborative SoS—the Internet. The Internet itself has evolved from a chaotic collection of disparate resources available only to researchers and dedicated enthusiasts to one that is somewhat organized by web services and search engines. In the early days of the Internet, users knew how to access the resources they needed and made use of well-documented protocols for making connections and exchanging information. In many cases, users depended on personal communication with other users to find resources, learn semantic interfaces, and solve problems. The modern Internet is supported by application layers, object

SYSTEM OF SYSTEMS INTEGRATION

17

and semantic standards, and message transport protocols that hide many of the connectivity details and also assure communication compliance. From an SoSI perspective, the modern Internet exists by virtue of the protocols developed and stable intermediate services that provide the links between systems. Regardless of its management and determinism, a key structural theme for SoSI structures is stable and robust communications. This characteristic implies that SoS architects should exert as much control as possible over the systems and standards that form the basis of SoSI [Rechtin, 1991]. Certification and Accreditation (C&A) C&A defines the records, tests, regulations, rules, policies, and other such considerations that are binding on the SoS [Mayk and Madni, 2006]. For example, a company may write a policy that defines how components of a system must be grounded. All products from that company need to adhere to that policy to ensure that integration does not fail due to grounding-related incompatibilities. In an SoS, C&A may take the form of international standards and protocols rather than corporate or government regulations. Compliance failure is not officially policed, but systems found lacking are kept from participating in the SoS until they become compliant or are made compliant. A centrally managed SoS may have formal requirements on a number of local operations, maintenance, testing, and scheduling. The form that C&A takes depends on the particular use of the SoS and the impact of failure. Mechanisms Mechanisms are processes, procedures, tests, and inspections that are required for integration. They are generally motivated by safety and verification concerns. At the most basic level, mechanisms assure that appropriate safety precautions have been taken (e.g., checks performed before attaching a cable to a connector). At higher levels, procedures are put in place to ensure that enabling a given function for the first time does not adversely affect or cause the failure of other components, subsystems, or systems. It is important to note that mechanisms tend to be somewhat specific to given industries and products. Table 5 presents some common SoS integration mechanisms. These mechanisms are fundamentally applicable to any integration process. It is also worth noting that each mechanism has associated documentation that gets filed (i.e., persistently stored) in the development project database. These documents serve as evidence that certain required processes were in fact performed. As important, they become a source of information that can be useful for diagnosing anomalous behavior after an SoS has been deployed. The banking industry, for example, is governed by a wide variety of local, federal, and international rules, regulations, treaties, and laws. There are ISO standards that define currency codes, securities identifiers, Eurobond identifiers, business codes, standard e-business ontology, licensing, and so forth. The US

18

A. M. MADNI AND M. SIEVERS

TABLE 5 SoS Type

a

KEY SoSI MECHANISMS FOR SoS CATEGORIES Integration Mechanism

Example

Virtual SoS (deliberate/ on-the-fly)

Integration is largely accomplished WWW; control is exerted through published through adherence to standards for resource documented standards for naming, navigation and resource sharing, navigation, document structures; and documentation; SoS evolves Web sites choose to with its environment; No abide/not abide by the authority to enforce compliance standard; system is with governing in standards. controlled by forces that assure cooperation and compliance with core standards.

Collaborative SoS

No centralized organization to enforce compliance with integration standards. Component systems voluntarily collaborate to fulfil larger purpose. Agreements enforced by acceptance/rejection of service requests among collaborators. Primary integr. mechanism is agreements documentation.

Elements are computer networks and major computer sites; some networks exchange information using documented protocols; protocol adherence is largely voluntary with no central authority with coercive power.a

Acknowledged SoS

Are collaborative with designated manager to enforce collaboration; Manager dictates accepted protocols and semantics for collaboration; can require evidence of adherence to interface specs.

Air Force’s Air and Space Operations Center; Navy’s Naval Integrated Fires Counter Air Capability.

Directed SoS

Built for specific purposes and centrally managed; Component systems operate independently but their services are co-opted by centralized manager; Component systems are tightly coupled with centralized function through layered architectures.

JPL’s Deep Space Network; comprises a number of stations with the ability to carry out science investigations; they come together when needed to support a space flight mission.

Coercive power emerges through agreements among major sites to block traffic and sites observed to misbehave.

SYSTEM OF SYSTEMS INTEGRATION

19

Office of Comptroller of the Currency publishes a number of regulations and may take enforcement action when laws, rules, or regulations are violated. There are also a number of regulations published by the US Consumer Financial Protection Bureau (CFPB), Federal Reserve Bank (FRB), Federal Deposit Insurance Corporation (FDIC) and regulations defined by each country the bank does business with. Also, banks may have internal policies and procedures for managing and protecting customer information and assets. Although banks are integrated as a collaborative SoS, required C&A imposes tight restrictions on allowable integration mechanisms to ensure the safety and security of the bank and customers as well as mechanisms that conform to government regulation. The Bank Secrecy Act (BSA) of 1970, for example, requires financial institutions in the United States to detect and prevent money laundering. There are five reports that a financial institution has to file for transactions in excess of $10,000. Given that most transactions are managed electronically, each financial institution must provide a means to integrate information potentially from multiple sources and trigger a warning when any BSA condition occurs. Risk and Risk Management Risk is an evaluation of the likelihood of a given obstacle and the impact of that obstacle if the risk occurs. Risk management involves processes associated with evaluating risk and determining whether and when risk mitigation is needed. Risk management activities are an integral part of SoSI initiatives. Risk management during SoSI subsumes acceptance testing, deployment, transition from existing system to new SoS, operations, maintenance, logistics, training, and upgrades. SoSI needs integration plans which are derived from development plans, engineering plans, and test plans. NASA, for example, has issued a Risk Management Handbook (NASA-SP-2011-3422) that includes the informational flow and roles shown in Fig. 6. Risk items are typically inserted in a table that places risk likelihood on one axis and risk consequence on the other. The intersection of a likelihood and consequence defines the criticality of the risk. A color code is used in the matrix in which high likelihood and high consequence risks are coded red, intermediate risks are coded yellow, and low risks are coded green. Risk is assigned an action that indicates what, if anything, should be done to mitigate the risk. Additionally, the risk trend (increasing, decreasing, no change) is tracked. The flow in Fig. 6 and the risk table are applicable to a centrally managed SoS such as the DSN in which a Flight Director for a specific flight mission can articulate a set of objectives and values that are analyzed and adjudicated. Risk items are difficult to identify and even more difficult to track or mitigate in collaborative or virtual SoS. This is the case because of the dynamic nature of the SoS, the lack of specific goals, and little or no insight into the systems expected to participate in carrying out a particular capability. In this latter environment, one might almost want to assume the worst case, that is, the likelihood of the risk occurring

20

A. M. MADNI AND M. SIEVERS

Fig. 6 NASA Risk Information Flow and Functional Roles (NASA SP-2011–3422, http:// www.hq.nasa.gov/office/codeq/doctree/NHBK_2011_3422.pdf) is high, and the consequence is that the capability is not available. In noncritical applications this condition might be acceptable since the payoff when the capability does work is high. Of course, if a multibillion dollar spacecraft is heading to Mars, Flight Directors want very high assurance that no “red” or “yellow” risks are outstanding and that all “green” risks are under control. Requirements and Interfact Definitions Requirements are physical, logical, functional, nonfunctional (safety, reliability, survivability, etc.) performance, etc., agreements between stakeholders that must be satisfied by the completed implementation. Interface definitions document the physical and logical details (semantic and syntactic) needed for interconnecting components. Configuration Management (CM) CM typically comprises processes and documentation that control the performance, capabilities, and composition of the SoS (Table 6). CM assures that expected behaviors are present in the SoS and that these behaviors agree with their associated artifacts. Global CM is possible for an SoS that is centrally managed and, to some extent, for locally managed component systems. In general though, systems within an SoS have no obligation to publish or maintain their configuration. Consequently, SoS configuration changes as new systems join,

SYSTEM OF SYSTEMS INTEGRATION

TABLE 6

21

KEY ELEMENTS OF CONFIGURATION MANAGEMENT

Key Element

Definition

CM planning

Defines the tools, entities, responsibilities, and relevant documentation that captures how the SoS configuration will be managed.

Identification

Essentially the “parts list” covering the tracked CIs, ownership, versions, models and so forth.

Change control

Define the change process, responsibilities, and what documentation accompanies the changes

Configuration status accounting

Metrics associated with the SoS, CI changes, and the state of its documentation

Configuration verification

Establishes the consistency between the SoS and its documentation

member systems leave, local maintenance is performed, and systems elect to temporarily opt out of the SoS entirely, or in part. Verification and Validation (V&V) Verification comprises the processes (test, analysis, inspection, demonstration, etc.) needed for assuring that the SoS was built correctly. Validation comprises processes that evaluate whether the right SoS was built. Verification typically focuses on “proving” requirement satisfaction while validation examines whether use cases defined for the SoS are achievable. V&V may be difficult when the SoS is loosely defined and exhibits emergent behavior.

VI. SoSI MODEL FOR A DIRECTED SoS In this section we apply our integration ontology to a hypothetical SoS. A number of studies have been conducted over the years that evaluate the options for sending a human crew to Mars. For example, [Hoffman and Kaplin, 1997] is a comprehensive NASA study that explores all aspects of travel to and survival on Mars. More recently, [Shisko et al., 2015] looks at SOS architecture for surface activities and support of Mars Colonization. We use a simplified version of the latter SOS for illustrating the ontology in 4. The reference architecture comprises three primary systems: Surface Operations, Supply Chain, and Ground Support (Fig. 7). Surface Operations comprises life support, habitat, communications, navigation, work and leisure, in-situ resource utilization, and surface mobility systems. The Supply Chain comprises logistics, propellant depots, interplanetary internet, propulsion, ascent/descent, deep space habitat, and launch vehicle systems that deliver supplies to the

22

A. M. MADNI AND M. SIEVERS

Fig. 7

Simplified context of Mars mission SoS.

Surface Operations. Ground support comprises ground networks, research, manufacturing, and deep space communication systems that support Surface Operations. In Table 7, we next show representative examples of each ontology term for the hypothetical Mars mission.

VII. INTEROPERABILITY IS A CROSS-CUTTING ISSUE Interoperability is the ability of distinct systems to share semantically compatible information and then process and manage the information in semantically compatible ways, enabling users to perform desired tasks [Madni and Sievers, 2012; Zeigler and Mittal, 2008]. The emphasis on distinct systems that were not designed to be integrated with each other is implicit in the concept of interoperability. Interoperability is intended to create a capability that serves an important human purpose and/or satisfies a mission requirement. Interoperability implies far more than getting systems to communicate with each other. It requires that the organizations involved employ some degree of compatible semantics and common interpretations of the information they exchange. The Levels of Conceptual Interoperability Model (LCIM) is a key advance in this regard [Tolk and Miguira, 2003]. A 2006 NRC Report on Defense Modeling, Simulation and Analysis lays out a strategy for meeting the interoperability challenge as well [National Research Council, 2006]. Interoperability is a cross-cutting concern that is beyond the scope of a single system development effort, or organization [Naudet et al., 2010]. It is also important to realize that systems are often designed and implemented before a recognized need for their interoperation exists. Such a need, when it arises, extends interoperability beyond the original scope of the system. Even so, there are internal concerns that need to be addressed by any system, if some day it is called upon to interoperate with others. These include adherence to standards, choice of information processing procedures and algorithms, the validity criteria

SYSTEM OF SYSTEMS INTEGRATION

TABLE 7

23

INSTANTIATION OF INTEGRATION ONTOLOGY FOR MARS MISSION

Ontology Term

Instantiation

Artifacts

† Mission plans, habitat assembly plans, equipment user manuals, requirements, CONOPS, interface definitions, contingency plans, mission assurance analyses.

Metrics

† Mass margin, surface power margin, propellant margin, communications bit error rate, uplink data rate, downlink data rate.

Stakeholders

† Surface crew (concerns: safety, work schedule, leisure schedule, comfort, family contact. . .) † Supply crew (concerns: safety, work schedule, leisure schedule, comfort, family contact, mission duration. . .) † Ground scientists (concerns: surface schedules, health of surface science equipment, quantity and quality of returned data. . .) † Others

Tailoring and Reuse

† Resupply vehicle

Certification and Accreditation

† NASA manned mission safety requirements, space radiation protection requirements, surface and supply crew nutritional requirements.

Structure

† Each major system comprises a number of subsystems and assemblies, e.g., Surface Operation comprises a habitat which is composed of life support, crew quarters, leisure areas, work areas, power generation.

Integration Resources

† Example: Surface Operations will require tools and test equipment for assembling the habitat and checking that it is structurally sound, leak free, and stable on the surface. Specialists may be needed for the assembly process and performing tests.

Architecture

† Example: systems within the habitat are loosely connected by WiFi. Surface mobility connects to habitat communications through an RF link.

Mechanisms

† Example: the habitat arrives in several launches and is deposited on the Mars surface prior to crew arrival. Crew arrives in a descent vehicle that also serves as an ascent for the returning crew. The crew inventories the habitat shipments and in coordination with ground support determines whether to proceed with assembly. (Continued )

24

A. M. MADNI AND M. SIEVERS

TABLE 7

INSTANTIATION OF INTEGRATION ONTOLOGY FOR MARS MISSION (Continued)

Ontology Term

Instantiation Given the go-ahead, the crew assembles the outer wall of the habitat and verifies integrity. Once verified, the internal rooms and equipment are installed.

External Influences

† Examples: the Mars environment influences the habitat characteristics, location of suitable materials for in-situ resource usage.

Risk and Risk Management

† Example: there is a risk that a supply mission is delayed; the risk is mitigated by assuring that colony has sufficient supplies to last at least two launch windows (approximately 18 months between windows).

Requirements and Interface Defintions

† Examples: requirements on the time needed for habitat assembly and test, requirements on contingency procedures should assembly problems arise.

Configuration Management

† Example: the surface crew may need to swap out a failed component, surface and ground documentation is updated when replacement is connected.

Verification and Validation

† Occurs at several levels and in multiple locations. All launch and supply vehicles undergo comprehensive ground tests. Additional V&V is performed by the supply crew in-flight and by the surface crew during and after habitat assembly.

surrounding the representation and processing of information, interfaces to other systems and to their users, and the nonfunctional, quality attributes such as reliability, availability, security, privacy and information assurance. For results to be meaningful and valid, interoperable systems need to share common/compatible semantics. This requirement implies that interoperating systems must have compatible mechanisms for exchanging, representing, modifying, and updating semantic descriptions of information items. As important, an interoperable system needs to process information in ways that are meaningful to the other systems with which it interoperates. To the extent that the meaning and form of information is dynamic (i.e., can change over time), these systems need to be able to dynamically modify their information processing approaches and, possibly, their representations. Since interoperability is across-system issue, it involves other cross-cutting concerns (e.g., security) to achieve interoperability solutions. The cross-cutting nature of interoperability has other important consequences. First, interoperability cannot be implemented piecemeal. Second, interoperability has a unique and crucial coordination role relative to the other cross-cutting concerns [Rothenberg,

SYSTEM OF SYSTEMS INTEGRATION

25

2008]. Third, interoperability cannot be added as an afterthought without incurring substantial costs and diminished effectiveness. In sum, when a system participates within an SoS, it is necessary to ensure that it understands the data, processing, and policies of the other systems in the SoS with which it interoperates, and that they, in turn, understand its data, processing, and policies. This requirement creates a need to explicitly represent and share the subset of system semantics needed to interoperate with these other systems. This need, which is rooted in the SoS construct, is rarely acknowledged or resourced within individual system design efforts, especially for pre-existing systems that were designed without interoperability considerations in mind. SoS Interoperability It is difficult enough to implement interoperability when the system is being designed with interoperability in mind. It is even more difficult when this is not the case. Table 8 presents interoperability cases in increasing order of difficulty. Specifically, five different cases are provided in increasing order of difficulty. The first two cases in Table 8 are demanding but straightforward, since they involve the design and implementation of a new system that is to be made interoperable with an intentionally designed SoS. The last three cases, however, are increasingly problematic, since they involve various combinations of retrofitting existing systems and achieving interoperability within ad hoc, nonintentionally designed SoS. Many real-world initiatives exemplify the last, most challenging case. For an SoS, interoperability often serves as a surrogate for integration when independently developed, standalone systems are combined to provide a new capability in an SoS. In many cases, the SoS is not intentionally designed as an SoS. Rather, it is constituted and dynamically configured as needed. While an ad hoc SoS of this kind can create useful new capabilities, it can place significant stress on the component systems which, in general, were not designed to work with each other. Even for an intentionally designed SoS, it is challenging to make its component systems interoperate. Clearly, when an ad hoc SoS is created out of existing standalone systems, the challenge is considerably harder. Even so, interoperability is easier to achieve in such cases than integration, and in fact can offer some of the same advantages as integration while providing potentially greater flexibility and scalability. In all cases, to make systems work together effectively, it is essential to recognize that interoperability is a crosscutting concern, which must be implemented pervasively both within and across systems. Interoperability and Semantics System interoperability needs to address both syntactic interoperability and semantic interoperability. Syntactic Interoperability is the ability of two or more systems to communicate and exchange data. In this case, specified data formats

26

A. M. MADNI AND M. SIEVERS

TABLE 8

SoS INTEROPERABILITY CASES

1) New system developed as part of an intentionally designed new SoS W overall authority should be empowered and resourced to perform SoS integration across all systems in the SoS W

each individual system development effort should be required and resourced to adhere to interoperability guidelines mandated or recommended by this authority

2) New systems are designed to work with an existing, intentionally designed SoS W

required and resourced to adhere to whatever interoperability guidelines exist for that SoS

3) New system developed to be part of an existing SoS that was not intentionally designed as such W such an SoS may not have any interoperability guidelines W

W

new guidelines may have to be developed retroactively by whatever authority controls the SoS or, by collaboration with developers, maintainers, and users of other systems in the SoS

4) An existing system, not intended to be part of any SoS, is subsequently required to interoperate with other systems in an existing or new, intentionally designed SoS W the existing system must be retrofitted to conform to the interoperability guidelines of the SoS 5) An existing system that is not part of any SoS is required to interoperate with other systems in an ad hoc SoS W not only is the SoS not designed intentionally as an SoS, but also no new system development effort is underway to incorporate interoperability as a new concern W

W

W

W

instead, existing systems in the SoS must be made to interoperate, with minimal redesign/modification this is demanding and risky; requires retrofitting pervasive, cross-cutting, semantic interoperability into specific aspects of multiple independent, existing systems not designed to interoperate or be part of an SoS unfortunately, it is a common occurrence in the real-world to require existing systems to interoperate in ways never envisioned by their designers without some degree of formal SoS integration, coupled with significant, resourced redesign, and reimplementation of existing systems, such initiatives are unlikely to succeed

SYSTEM OF SYSTEMS INTEGRATION

27

and communication protocols are key. Standards such as Extensible Markup Language (XML) and Structured Query Language (SQL) offer a means to provide syntactic interoperability. Syntactic interoperability is a prerequisite to semantic interoperability. Semantic Interoperability is the ability of two or more systems to automatically interpret the meaning of exchanged information, and to accurately produce useful results for end users (of both systems). For semantic interoperability, any two systems within an SoS have to agree on a common information exchange reference model that can be used by both. However, a standard model that has been adopted by an SoS community to facilitate interoperability among systems in the SoS community is clearly preferable. Regardless, the information exchange request content is unambiguously defined; that is, what is sent is consistent with what is understood. Specifically, for humans to collaborate and tools to work together, it is imperative that the information exchanged is both correct and meaningful [Mayk and Madni, 2006]. This requires compatibility among concepts, terms, processes, and policies. Such compatibility is essential for semantic interoperability. The compatibility requirements for semantic interoperability encompass: terminology and controlled vocabularies; data definitions; units of expressions; computational methods and assumptions used to produce results; conceptual and functional models; key policies (e.g., access, authentication, authorization, security, transparency, accountability, privacy); business process models; atomic transactions; interface definitions and procedure invocation; state and mode definitions and compatibility; and clear definition of limits and restrictions. If terminology and data are not aligned, there exists the likelihood of misinterpretation in exchanged information. If processes and procedures are incompatible, then it may be difficult or impossible to combine them to create new, integrated processes that implement new (composite) capabilities. If computational algorithms and information processing are not semantically compatible, the results are likely to be meaningless or, worse yet, misleading. If policies (e.g., privacy, access) are misaligned, interactions may be infeasible or in conflict with specific organizational policies. From the foregoing, it is evident that semantic interoperability is desirable (and sometimes essential) for meaningful interactions among systems, applications, and organizations. It is also the case that semantic interoperability is a substantial, potentially costly undertaking (as a design goal) because of increased validation and verification complexity. Often times, organizations or systems may have good business/technical reason to maintain distinct semantics. In such cases, to assure interoperability such distinctions need to be made explicit so that mechanisms can be developed that automatically map one set of semantics to another. It is worth noting that the mappings may not be one-to-one and may include semantics from one set that have no equivalent semantics in the other. Also, organizations may not wish to expose their internal semantics to system users (e.g., human users, other organizations). Therefore, to enable system users to understand and interact with a “black box” effectively, an organization must either

28

A. M. MADNI AND M. SIEVERS

modify its internal semantics, match external user semantics, or map internal semantics to external user semantics. However, developing explicit semantics requires articulating the key assumptions and tacit knowledge shared by a community of practitioners with a common purpose and mindset. Invariably, these practitioners tend to be unfamiliar with formal methods for knowledge representation. A semantic representation includes the Semantic Web 2.0, and XML-based tools such as Resource Description Framework (RDF) and Web Ontology Language (OWL). However, current semantic representation tools do not provide mechanisms to support interoperability. Therefore, an SoS should include a semantic interoperability development process model to facilitate the development of requisite semantics. Such process models can enable organizations to explicitly represent and align their processes with those of other organizations with whom they wish to interoperate. “Designing In” vs “Retrofitting” Interoperability Practically speaking, interoperability is achieved in one of two ways. It can be designed in (at system inception), or it can be retrofitted (after the fact). Clearly, designing interoperability into a system is easier, more effective, and less expensive. Yet the need for interoperability is not always apparent when a system is designed, leading to the need to achieve interoperability after the fact. While retrofitting interoperability may be worthwhile, it may comeat an enormous cost. The cross-cutting nature of interoperability requires it to be infused into several aspects of a system (e.g., external interfaces, user interface, use of standards and communication protocols, internal processing, policies, and procedures for data handling; data access privileges). Modifying these aspects of an existing system to make it interoperable will typically require substantial effort. Also, since interoperability depends on shared semantics, retrofitting interoperability can require major redesign of the system’s representation and algorithms, as well as realignment of policies. These considerations can potentially pose serious obstacles. System Architecture Informs Interoperability Design Ultimately, interoperability depends on the architecture of the SoS [Chen et al., 2008]. This architecture can be explicit or implicit, formal or ad hoc. When pre-existing or independently designed systems are required to interoperate in the absence of a common or reference architecture, the linkages between any two independently designed systems are invariably designed and implemented in ad hoc fashion, producing an ad hoc integration architecture. Ad hoc architectures typically lack coherence and scalability. Therefore, it is usually desirable to develop open architectures. However, in practice, the cost and schedule of the initial development might not seem justifiable to managers and customers. So, at a minimum, interoperability should be part of the design trade space and given serious consideration in defining the system’s life cycle. Ideally, by adopting a reference architecture and explicit architectural framework,

SYSTEM OF SYSTEMS INTEGRATION

29

and assuring that each system conforms to the framework, it becomes possible to achieve significantly greater interoperability among disparate systems. However, often times it is impractical to enforce such a framework for pre-existing systems. In such cases, the result is an ad hoc SoS. In general, the architecture of an SoS, whether predesigned or ad hoc, can enhance or inhibit the ability of the component systems of the SoS to interoperate. In other words, some architectures promote interoperability, while others don’t. A Service-Oriented Architecture (SOA) is an example of an architectural framework that promotes interoperability between its component services so long as they comply with the SOA paradigm and its tenets. Interoperability and Reuse Reuse is a key motivation for interoperability. The ability to compose an SoS out of reusable component systems has the potential to make the resulting SoS cheaper, more reliable, easier and faster to develop. It also has the potential to make the resulting SoS more consistent, easier to maintain and support, and more flexible and scalable. Reusability and interoperability converge in strategies such as the SOA, in which the behavior and interfaces of components are formally defined, thereby allowing them to interoperate with each other, and also be reused as components of multiple systems. A reusable component must provide a function that is of general value to other components. In some cases, this can be achieved by offering a single, well-defined capability that is of general use, such as looking up names in a directory or registry. The need to balance the generality of reusable components against their ability to be customized (i.e., customizability) is a key challenge. There is no general strategy to ensure that components will achieve this balance. Therefore, each component must attempt to find its own ‘sweet spot’ that balances generality against customizability in ways that depend on its function and its relationship to other components. Moreover, as noted above, the services in SOA are typically designed (or at least packaged) as reusable, interoperable components rather than as standalone systems in their own rights. These services are envisioned as components of a larger system, rather than existing systems that are to be combined into an SoS. It is, therefore, unclear that the reusability of SOA services translates into reusability of interoperable systems in an SoS. Also, interoperability achieved through SOA can produce a testing nightmare unless strict constraints are placed on the interfaces, timing, and functionality. It might also become necessary to stipulate that SOA services are either stateless or at least have strict limitations on states and rules for state transition. Also, transactions need to be atomic to prevent sideeffects, if the occurrence of faults prevents their completion. Interoperability Pros and Cons Interoperability offers a number of advantages including: (a) increased flexibility, by allowing mixing and matching of systems; (b) creation of new capabilities, by composing new functions from existing ones; and (c) increased

30

A. M. MADNI AND M. SIEVERS

cost-effectiveness, by allowing reuse of existing systems and capabilities [Rothenberg, 2008]. The mixing and matching of systems enables performing unanticipated/unprecedented tasks from new combinations of existing functions. This can be accomplished on-the-fly (for nonrepetitive tasks), or a priori in a principled manner for repetitive or recurring tasks. Similarly, interoperability can reduce the cost of creating new capabilities by allowing existing systems to be reused in multiple ways for multiple purposes. An unheralded advantage of interoperability is that it hides overall system complexity from users by creating the illusion of an integrated system. Interoperability is usually accomplished through a common user interface, uniform semantics, and uniform policies and procedures. However, in the real world this is not always the case. In some cases, interoperability is achieved through somewhat “clumsy” means. In fact, one might question if such systems can be claimed to be truly interoperable. One example of clumsy integration is that employed on the International Space Station. The ISS employs a low-speed 1553 interface between itself and any experiment bolted on to it. While there is also a high speed Ethernet interface for downlink from an experiment, there is no direct uplink path. Thus, if an experiment needs the high speed link, crew members have to put the data on a thumb drive and walk it to another computer that can perform data uplink over the Ethernet. While one could argue that this is an interoperable system, it is clearly not automated, which is typically what is implied by true interoperability. Despite obvious advantages, interoperability also has some disadvantages stemming from the increase in technical complexity and “open” system design [Rothenberg, 2008]. In particular, issues of privacy and security arise when systems are made interoperable. As important, costs can escalate from having to make systems interoperable. Finally, interoperability adds technical complexity to system design in that new interoperability requirements are now imposed that the designer has to satisfy. Even so, the benefits of interoperability invariably outweigh the costs.

VIII. HUMAN-SYSTEMS INTEGRATION CHALLENGES IN SoS INTEGRATION Humans within an SoS have to continually adapt to changes in configuration, modes, and levels of autonomy of the SoS based on changing mission requirements [Madni, 2010]. These behaviors tend to dramatically complicate the lives of humans and often result in human error. This is not surprising in that while humans exhibit adaptivity, the adaptation tends to be slow, not always complete, and occasionally not possible [Madni, 2010, 2011]. Also, humans tend to be poor at multitasking and context-switching, especially when context-switching frequency exceeds a maximum threshold. While several advances have been made in human systems integration [Booher, 2003], existing methods, processes, and tools are inadequate for integrating humans with adaptable SoS [Madni, 2010, 2011]. To fill this gap, research is needed to create new methods, processes, and tools that enable the development of dynamic HSI strategies for SoS. To this

SYSTEM OF SYSTEMS INTEGRATION

31

end, HSI research needs to integrate, build on, and extend the existing body of knowledge in HSI to address the needs of adaptable systems. The existing body of knowledge in HSI is quite fragmented and addresses diverse topics such as information overload, dynamic attention re-allocation in multitasking and context switching situations, distributed decision making and team coordination stress, ability to back up automation when needed, and cognitive bias in shared human-machine decision making. Integrating this fragmented body of knowledge for a variety of use cases is a promising starting point for realizing the goal of integrating humans with adaptable systems.

IX. CONCLUDING REMARKS An SoS is a composition of components that are themselves systems. The two characteristics that make SoS different from other complex systems are operational and managerial independence of component systems. These characteristics imply that each component system has the ability to operate independently as well as contribute to the larger SoS as and when the need arises. An SoS tends to exhibit evolutionary development as intermediate systems are developed that perform useful functions, and which are integrated into larger systems. As such, an SoS typically evolves through stable intermediate forms [Maier and Rechtin, 2009]. Advances in system of systems integration (SoSI) have become critical today as SoS continue to grow in scale and complexity, and as humans continue to assume a variety of roles in SoS [Madni, 2010; Brown and Eremenko, 2009; Hively and Loebl, 2004; Rodriguez and Merseguer, 2010). In this chapter, we have discussed the different definitions of SoS, key challenges in SoSI, and an underlying semantic model for SoS integration. We have identified external factors as potentially having a significant impact on SoSI. We have discussed SoSI in terms of requirements and interface definitions, emphasized the importance of C&A, and presented various forms that C&A can take based on a variety of factors. We identified the various stakeholders in SoSI along with their concerns, influences, metrics, and needed resources. We also presented various mechanisms (i.e., processes, procedures, tests, and inspections) needed for successful SoSI. We discussed interoperability in terms of advantages/disadvantages, examples, implications for SoS, and designing in versus retrofitting interoperability. In conclusion, we remind the readers that SoSI continues to be a fertile area of research and development. This chapter should serve as a starting point for systems engineers who have the responsibility for planning and executing SoSI initiatives.

REFERENCES Ackoff, R. L. (1971), “Toward a System of Systems Concept,” Management Science, Vol. 17, No. 11, pp. 661–671.

32

A. M. MADNI AND M. SIEVERS

Baldwin, W. C., Ben-Zvi, T., and Sauser, B. J. (2012), “Formation of Collaborative Systems of Systems through Belonging Choice Mechanisms,” IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, Vol. 42, No. 4, pp. 793–801. Baum, L. E., and Petrie, T. (1966), “Statistical Inference for Probabilistic Functions of Finite State Markov Chains,” The Annals of Mathematical Statistics, Vol. 37, No. 6, pp. 1554–1563. Biffl, S., Schatten, A., and Zoitl, A. (2009), “Integration of Heterogeneous Engineering Environments for the Automation Systems Lifecycle,” INDN 2009: 7th IEEE International Conference on Industrial Informatics, pp. 576–581. Bongard, J. (2009), “Biologically Inspired Computing,” IEEE Computer, Vol. 42, No. 4, pp. 95–98. Bonjour, M., and Falquet, G. (1994), “Concept Bases: A Support to Information Systems Integration,” Advanced Information Systems Engineering, Lecture Notes in Computer Science, Vol. 811, Springer, New York, pp. 242–255. Booher, H. R. (ed.) (2003), Handbook of Human Systems Integration, Wiley, New York. Brown, O. C., and Eremenko, P. (2009), “Value-Centric Design Methodologies for Fractionated Spacecraft: Progress Summary from Phase I of the DARPA System F6 Program,” AIAA SPACE 2009 Conference and Exposition, 14–17 September. Browning, T. R. (2001), “Applying the Design Structure Matrix to System Decomposition and Integration Problems: A Review and New Directions,” IEEE Transactions on Engineering Management, Vol. 48, No. 3. Cabri, G. (2007), “Agent-Based Plug-and-Play Integration of Role-Enabled Medical Devices,” In: De Mola, G. and Letizia Leonardi, L. (ed.) HCMDSS-MDPnP: Joint Workshop on High Confidence Medical Devices, Software, and Systems and Medical Device Plug-and-Play Interoperability, 7 July. Chen, D., Doumeingts, G., and Vernadat, F. (2008), “Architectures for Enterprise Integration and Interoperability: Past, Present and Future,” Computers in Industry, Vol. 59, No. 7, pp. 647–659. Chowdhury, M. W., and Iqbal, M. Z. (2004), “Integration of Legacy Systems in Software Architecture,” SAVCBS: Specification and Verification of Component-based Systems Workshop at SIGSOFT. Cummins, F. A. (2002), Enterprise Integration: An Architecture for Enterprise Application and Systems Engineering, John Wiley & Sons, New York. Dahmann, J. S., Lane, J., and Rebovich, G., Jr. (2008), “Systems Engineering for Capabilities,” CrossTalk—The Journal of Defense Software Engineering. Hively, L. M., and Loebl, A. S. (2004), Horizontal System of Systems Integration via Commonality, Oakridge National Lab, Oakridge, TN. Hoffman, S., and Kaplin, D. (eds) (1997), Human Exploration of Mars: The Reference Mission of the NASA Mars Exploration Study Team, NASA Special Publication 6107, NASA, Washington DC. Kahn, M. O., Sievers, M., and Standley, S. (2012), “Model-based Verification and Validation of Spacecraft Avionics,” Infotech@Aerospace Conference. Kapurch, S. J. (ed.) (2010), NASA System Engineering Handbook, DIANE Publishing, Collingdale, PA. Kotov, V. (1997), “Systems of Systems as Communicating Structures,” HPL-97-124, Hewlett Packard Company, Palo Alto, CA.

SYSTEM OF SYSTEMS INTEGRATION

33

Krygiel, A. J. (1999), Behind the Wizard’s Curtain: An Integration Environment for a System of Systems, CCRP Publication Series. Madni, A. M. (2010), “Integrating Humans with Software and Systems: Technical Challenges and a Research Agenda,” INCOSE Journal of Systems Engineering, Vol. 13, No. 3. Madni, A. M. (2011), “Integrating Humans With and Within Complex Systems: Challenges and Opportunities,” (Invited Paper). “People Solutions,” CrossTalk—The Journal of Defense Software Engineering, May/June. Madni, A. M. (2012a), “Adaptable Platform-Based Engineering: Key Enablers and Outlook for the Future,” INCOSE Journal of Systems Engineering, Vol. 15, No. 1. Madni, A. M. (2012b), “Elegant Systems Design: Creative Fusion of Simplicity and Power,” INCOSE Journal of Systems Engineering, Vol. 15, No. 3. Madni, A. M., and Jackson, S. (2009), “Towards a Conceptual Framework for Resilience Engineering,” IEEE Systems Journal, Special issue on Resilience Engineering, No. 132. Madni, A. M., and Moini, A. (2007), “Viewing Enterprises as Systems-of-Systems (SoS): Implications for SoS Research,” Journal of Integrated Design and Process Science, Vol. 11, No. 2, pp. 3–14. Madni, A. M., and Sievers, M. (2013), “Systems Integration: Key Perspectives, Experiences, and Challenges,” INCOSE Journal Systems Engineering, Vol. 16, No. 4. Maier, M. W. (1990), “Architecting Principles for Systems-of-Systems,” Systems Engineering, Vol. 2, No. 1, pp. 1–30. Maier, M. W., and Rechtin, E. (2009), The Art of Systems Architecting, 3rd edn., CRC Press, Boca Raton, FL. Manthorpe, W. H. J. Jr. (1996), “The Emerging Joint System of Systems: A Systems Engineering Challenge and Opportunity for APL,” Johns Hopkins APL Technical Digest, Vol. 17, No. 3, pp. 305–313. Mayk, I., and Madni, A. M. (2006), “The Role of Ontology in System-of-Systems Acquisition,” Proceedings of the 2006 Command and Control Research and Technology Symposium, San Diego, CA, June 20–22. Mosterman, P. J., Ghidella, J., and Friedman, J. (2005), “Model-Based Design for System Integration,” Conference in Design. National Research Council. (2006), “Committee on Modeling and Simulation for Defense Transformation,” Defense Modeling, Simulation and Analysis: Meeting the Challenge: Meeting the Challenges, Washington, DC, The National Academies Press. Naudet, Y., Latour, T., Guedria, W., and Chen, D. (2010), “Toward a Systemic Formalization of Interoperability,” Computers in Industry, Vol. 61, pp. 176–185. Neches, R., and Madni, A. M. (2012), “Towards Affordably Adaptable and Effective Systems,” INCOSE Journal Systems Engineering, Vol. 15, No. 1. Nilsson, E. G., Nordhagen, E. K., and Oftedal, G. (1990), “Aspects of Systems Integration,” Proceedings of the First International Conference on Systems Integration, Morristown, NJ, April 23–26. Office of the Undersecretary of Defense for Acquisition. (2008), “Technology, and Logistics (OUSD AT&L),” Systems Engineering Guide for Systems of Systems, Washington, DC. Rabiner, L. (1989), “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings IEEE, Vol. 77, No. 2, pp. 257–286.

34

A. M. MADNI AND M. SIEVERS

Rechtin, E. (1991), System Architecting: Creating and Building Complex Systems, Prentice Hall, Upper Saddle River. Rehage, D., Caol, U., Merkel, M., and Vahl, A. (2004), “The Effects on Reliability of Aircraft Systems based on Integrated Modular Avionics,” Computer Safety, Reliability, and Security, Lecture Notes in Computer Science, Vol. 3219/2004, pp. 224–238. Rodriguez, R. J., and Merseguer, J. (2010), “Integrating Fault-Tolerant Techniques into the Design of Critical Systems,” Architecting Critical Systems, Springer, Berlin, pp. 33–51. Rothenberg, J. (2008), Interoperability as a Cross-Cutting Concern, RAND Corporation, Santa Monica, CA. Sage, A. P., and Cuppan, C. D. (2001), “On the Systems Engineering and Management of Systems of Systems and Federations of Systems,” Information, Knowledge, Systems Management, Vol. 2, No. 4, pp. 325–345. Shisko, R., et al. (2015), “An Integrated Economics Model for ISRU in Support of a Mars Colony-Initial Status Report,” AIAA SPACE 2015: Conference and Exposition. Siemieniuch, C. E., and Sinclair, M. A. (2006), “Systems Integration,” Applied Ergonomics, Vol. 37, pp. 91–110. Stavridou, V. (1999), “Integration in Software Intensive Systems,” The Journal of Systems and Software, Vol. 48, pp. 91–104. Tolk, A., and Miguira, J. A. (2003), “The Levels of Conceptual Interoperability Model,” Fall Simulation Interoperability Workshop. Vuletid, M., Pozzi, L., and Ienne, P. (2005), “Seamless Hardware-Software Integration in Reconfigurable Computing Systems,” Design and Test of Computers, IEEE, Vol. 22, No. 2. Zeigler, B. P., and Mittal, S. (2008), “Towards a Formal Standard for Interoperability in M&S/System of Systems Integration,” Symposium on Critical Issues in C4I.

CHAPTER 2

Advances in Sociotechnical Systems William B. Rouse , Michael J. Pennock , and Joana Cardoso Stevens Institute of Technology, Hoboken, New Jersey

I. INTRODUCTION A sociotechnical systems perspective is important when behavioral and social phenomena play a major role in the success of a technical system, typically an engineered system. In such situations, a complicated engineered system becomes part of a complex sociotechnical system. This chapter discusses the origins of this concept, contemporary perspectives, and methods and tools for addressing such systems. Sociotechnical systems are pervasive. Our systems for energy, finance, healthcare, and transportation, as well as delivery of food and water, are laced with behavioral and social phenomena that have enormous impacts on the functioning and performance of these systems. Consequently, the engineering of these systems has to account for how people and organizations will use, and perhaps misuse, these systems. Without such an accounting, first-rate engineering will not necessarily lead to success. In general, we need to understand the physical, human, economic, and social phenomena that affect functioning and performance of sociotechnical systems [Rouse, 2015]. We also need strategies, methods, and tools for representing this understanding and, hopefully, computationally using these representations to better design these systems. Pursuit of this goal has a long history and is rife with both fundamental and practical challenges that are addressed in this chapter. This chapter proceeds as follows. The next section outlines the origins of sociotechnical systems research. The subsequent section provides contemporary views, particularly of the problem of jointly optimizing both the social and technical system. The methodological challenges of such pursuits are explored in some detail. The next section addresses methods and tools for modeling sociotechnical systems, including discussion on a range of modeling paradigms and associated software tools. We conclude by summarizing the state of knowledge and practices in this area.  School

of Systems and Enterprises

35

36

W. B. ROUSE ET AL.

II. ORIGINS OF SOCIOTECHNICAL SYSTEMS RESEARCH The roots of the concept of sociotechnical systems lie in the political, economic, and social convulsions of World War I, the Great Depression, and World War II. The extremes of these periods prompted reflection and research on the nature of work, management, and control. Understanding of problems during these periods later led to generalization of possible solutions to broader contexts and less angst-ridden circumstances.

A. PATH TO WORLD WAR II At the end of the nineteenth century, Nationalism in Europe was alive and well, as countries did the most to uphold their territorial interests. As the British, French, and German Empires claimed new colonies and built up their military capabilities, tensions increased. By the turn of the century, these empires had started to forge alliances: the Central Powers (Germany, Austria–Hungary, and the Ottoman Empire) and the Allies (Britain, France, and Russia). These alliances further increased tensions in Europe, bringing it to the brink of conflict. In June 1914, when a Serbian nationalist assassinated the heir of the Austro-Hungarian throne, Archduke Franz Ferdinand, Austria–Hungary declared war on Serbia. Russia joined Serbia and declared war on Austria–Hungary. The allies jumped to the conflict, with Britain and France helping Serbia and Germany protecting Austria–Hungary [Showalter, 2015]. The scientific and technological investments during World War I were substantial. While some of these advances, such as trench warfare and machine guns, had already been used during the Civil War, the availability and sophistication that such military technologies had reached would become a major factor in the high number of casualties. Russia’s participation in the war resulted in a major death toll at the hands of an industrialized Germany. In 1917, Russia’s political and economic instability culminated in the February Revolution and Bolshevik Revolution, which ended centuries of Russian Imperial control and led to the formation of the Soviet Union. Under the administration of Vladimir Lenin, a radical leader of the Bolshevik Party, Russia made peace with Germany and became the first Marxist state in the world. In June 1919, World War I officially ended with the signing of the Treaty of Versailles, which reassigned German boundaries and required Germany to pay numerous reparations. The Allies, which limited Germany’s participation, led the negotiations. The terms of the negotiations, however, greatly displeased the German Workers’ Party, or Nazi Party, founded in the same year. Its anti-Semitic beliefs soon captivated Adolf Hitler, who later became the party’s leader [Showalter, 2015]. Following the war, the economic prosperity in North America and Europe greatly changed both the social and political environment. In Germany, however, the scenario was visibly different, as the reparations established by the Treaty of Versailles had put the country and its people under a huge debt

ADVANCES IN SOCIOTECHNICAL SYSTEMS

37

burden. On October 29, 1929, the crash of the New York Stock Exchange quickly and completely transformed the prosperity and optimism of the 1920s into a new decade of severe worldwide economic depression. Its devastating consequences included high rates of unemployment and poverty and a considerable decrease in international trade, tax revenues, and profits [Romer, 2015].

B. WORLD WAR II As the economic depression extended in time, authoritarian regimes strengthened, of particular importance the Nazi Party in Germany. In 1933, Adolf Hitler was elected Chancellor of Germany and, soon after, Fu¨hrer (in the Nazi Germany, the Fu¨hrer represented the uncontestable, sole leader of the country). From the mid- to late 1930s, Hitler strengthened the country’s military forces beyond the limits imposed by Treaty of Versailles, and signed strategic treaties with both a fascist Italy and Japan against the Soviet Union. This represented the first step for Hitler’s master plan: dominate Europe and invade the Soviet Union, and use this immense territory for the expansion of the Aryan race— Lebensraum. In 1938 and 1939, Germany occupied Austria and invaded Czechoslovakia, respectively. Meanwhile, on September 1st 1939, Hitler’s troops invaded Poland from West, which prompted France and Great Britain, who had promised military support to Poland, to declare war on Germany: World War II started [Royde-Smith, 2015]. Between 1940 and 1941, the war moved to the West giving Germany a tactical advantage over the countries it had invaded. Much of this dominance resulted from Germany’s warfare strategy that consisted of breaking into the enemies’ line of defense with tanks and planes in fulminant attacks—a tactic known as blitzkrieg [Royde-Smith, 2015]. A key element to these attacks was the Panzerdivision, a self-contained, organized, armored division consisting of a tank brigade and supporting battalions and units. The Panzerdivision represented one of the earliest influences for sociotechnical systems [Trist, 1981]. In June 1944, the Allies began a colossal invasion of Europe with the deployment of thousands of soldiers in Normandy. Intensive aerial bombardment took place in February 1945 before the Allies invaded Germany. On May 8, 1945, the increasingly Soviet-occupied Germany finally surrendered. The defeat of fascism in Europe by the Allied forces and the mass murder of Jews by Nazi forces highlighted the need for social studies on National Socialism as a regime of terror [Gerhardt, 2011].

C. GROUNDBREAKING STUDIES DURING AND AFTER WORLD WAR II Motivated by the rise of National Socialism in Germany and Communism in Russia, the American sociologist Talcott Parsons had, by the early 1930s, begun his studies on the structure of social action. His notes show his early associations of dictatorship and social disorganization [Gerhardt, 2011]. In 1937, Parsons published The Structure of Social Action, a critical analysis of four main theories on

38

W. B. ROUSE ET AL.

social phenomena that, as a whole, provided the theory for the establishment of National Socialism as a successful structure for social action [Parsons, 1937; Gerhardt, 2011]. Despite its importance, however, the book did not receive much attention until the end of World War II. Parsons’ Action Theory, a general study of society based on the principle of voluntarism (i.e., action based on free will) and the dismissal of positivism, was published in 1951 [Parsons and Shils, 1951]. Positivism assumes that society operates according to physical laws and rejects any kind of introspective and intuitive knowledge. World War II also shed light on Ludwig von Bertalanffy’s work on a System Theory of life. A few years before the war (1934), von Bertalanffy concluded that the essence of life could not be explained by its distinct processes (lebensvorga¨nge) but rather by a specific order among all the processes. As such, seeing the organism as a whole was fundamental to biology studies. Important system concepts were derived from Bertalanffy’s understanding of the organism: the open system in flux equilibrium, hierarchization, primary activity of the organism, and, finally, equifinality. In spite of being used implicitly since the early 1930s, the concept of system was only presented in 1945 as a set of elements in interaction [Drack, 2009]. The formalization of systems theory as a field of study preceded the Systems Movement [Rouse, 2015]. Operations research was another field of study that also emerged from World War II. Instigated by Patrick Blackett’s success on operational research in Britain, in 1942, Washington recruited a team of civilian scientists to study and suggest improvements to antisubmarine warfare. Philip Morse, an outstanding physicist, led a team of highly motivated scientists who, in a short period of time delivered impressive results: by following the US Navy on its field activities, these scientists were able to ensure accurate data gathering and learn about, in real time, their operations as well as the success or failure of the Operations Research team’s recommendations. After the war, Morse and other scientists were committed to expand the Operations Research field to nonmilitary applications [Little, 2002]. Yet another very important field spawned by World War II research was cybernetics—the study of human–machine interaction guided by the principles of feedback, control, and communication. Norbert Wiener, a distinguished mathematician, began working on this subject while working on a pressing problem with the National Defense Research Committee in 1940. This research project aimed at predicting the flight path of an evading aircraft. Wiener’s solution to the problem, which relied on predicting the future value of a pseudo-random function according to its statistical history, was too complex to be of applicable use to war; however, contemporary theories of optimal estimation and signal processing built upon his work. In 1942, Wiener explored the idea of having the human operator as a feedback element of a system, extending his work on control and communication to biological, physiological, and social systems. Wiener’s book Cybernetics: Or, Control and Communication in the Animal and the Machine, published in 1948, laid the foundation for the field of cybernetics [Mindell, 2000].

ADVANCES IN SOCIOTECHNICAL SYSTEMS

39

D. ERA OF TASK-DOMINATED VIEWS Increasing mechanization during the 1950s and 1960s failed to result in a correspondent increase in productivity. By the turn of the decade, the British mining industry was greatly affected by internal industry conflicts, high rates of turnover, and absenteeism. The National Coal Board, concerned by this grim scenario, requested the Tavistock Institute to conduct a comparative study between lowproducing–low-morale mines versus those where both productivity and morale were high. The Tavistock Institute ran two main field projects: the first one focused on the interactions among a group of people at different levels of a single organization; the second one addressed the question of innovative work habits and organizational structures as a way to increase productivity. The Haighmoor seam, whose levels of productivity and absenteeism were unexpectedly good, was chosen as the experimental site [Trist, 1981]. Ken Bamforth, one of the postgraduate fellows at the Tavistock Institute and former miner at Haighmoor, revisited the place in search of important insights behind the seam’s productivity levels. At that time, the use of the long-wall mining method was prevalent, resulting in a one-man/one-task role and external control and supervision. At the Haighmoor seam, however, the improved roof control made viable the use of the short-wall mining technique, characteristic of earlier, unmechanized periods. Bamforth noticed that unexpected work practices were being employed at the seam: cohesive and self-regulating working groups would alternate responsibilities and shifts with minimal supervision, with remarkable levels of collaboration and engagement. Bamforth’s observations would soon lead to the emergence of sociotechnical systems at the Tavistock Institute. von Bertalanffy’s system theory—in particular the principles of interdependency, openness, and boundary conditions for social systems—set the context for further investigations. In spite of the positive results at the Haighmoor, there was a clear need for more studies at other sites and also at the organization and macrosocial levels [Trist, 1981]. Significant knowledge on the topic was gathered and the potential returns for organizations became clear. Still, fears over the full-mechanization of the British mines, one of the Division Board’s top priorities, seemed to have prevented the adoption of these practices. The industry ignored the facts and kept its focus on one-man/one-task units, inflexible top-down management, and external regulation. Further studies on self-regulating groups in fully mechanized mines and the design of sociotechnical systems for the best technology available followed. The conclusions, similar to those previously found, did not, however, alter the National Coal Board’s directives, whose focus was now on closing nonprofitable mines. The Tavistock Institute extended these studies to other countries and industries, such as textile and automotive, only to draw similar conclusions. Despite sporadic signs of concern with new modes of operation, resistance to change was clear across industries—a consequence of the dominant mindset in the postwar period: competition over collaboration [Trist, 1981].

40

W. B. ROUSE ET AL.

Fred Emery, an Australian social scientist working at the Tavistock Institute, was determined to prove the emergence of a new work paradigm. His detailed studies in 1959 proposed the first generalized model of the dimensions of sociotechnical systems [Emery, 1959]. The relationship between the social and technical areas, initially thought as the best matching process possible, was later revised to a joint optimization of the social and technical systems [Ackoff and Emery, 1972; Trist, 1981]. In 1962, while working on the Norwegian Industrial Democracy project, Emery suggested that little progress could be done on sociotechnical systems unless major changes at the macrosocial level occurred. Norway’s environment seemed to provide a great experimental platform for the validation of Emery’s hypothesis: the economy had slowed down, major companies had gone bankrupt or were acquired by multinational firms, and modernization was sparse when compared to other Scandinavian countries. Emery, in collaboration with the Technical University of Norway, was told to look at directors’ performance and their influence on workers. The first studies revealed that, despite being valued as an enhancement to democratic control, the presence of the directors had little effect on workers’ performance and morale. The follow-up recommendation was to secure direct participation of workers in the decisions made at their own levels. Unsurprisingly, the diffusion of such practice into the Norwegian industry did not follow [Trist, 1981]. At the end of the decade, this diffusion would take place in Sweden, where a new generation of well-educated Swedish refused to take on unskilled jobs, as evidenced by the absenteeism and turnover rates for such positions. Immigration from Southern Europeans created new social problems, and Swedish managers and unions tailored the Norwegian approach for their own purposes. At the macrosocial level, attention on the representation of workers on Swedish boards of management ensued.

E. MATURING OF SOCIOTECHNICAL SYSTEMS By the beginning of the 1970s, the increasing interest in systems was clear. In 1971, C. West Churchman suggested the inclusion of ethical values into operating systems [Rouse, 2015]: he believed that problems should be tackled using a systems approach and that one should not restrain himself to the problem at hand, but also explore its underlying premises and the impact that the proposed actions have on humankind [Balderston et al. 2004]. His vast work was centered on the connections among planning, systematic thinking, action, and ethics. Not less important was Churchman’s contribution to the advancement of Operations Research. Russell Ackoff, one of Churchman’s doctoral students, was central to the development of Operations Research methods, concepts, and techniques [Ackoff, 1956]. Their famous book Introduction to Operations Research published in 1957, and coauthored with Leonard Arnoff, synthesized their main insights on the field [Churchman et al. 1959].

ADVANCES IN SOCIOTECHNICAL SYSTEMS

41

Interestingly, Ackoff later in his career was recognized for his criticism of the so-called technique-dominated Operations Research and advocate for more participative approaches [Rouse, 2015]. In his opinion, the purpose of Operations Research should be the search for the best decision possible with respect to the whole organization [Churchman et al. 1959]. As such, an interdisciplinary approach to decision-making, as well as an interdisciplinary team, was fundamental. Since Operations Research focused more and more on optimization and objectivity, its applicability became somewhat limited to narrow scope problems—the fundamental point of criticism in Ackoff’s opinion [Ackoff, 1979]. His open criticism to the course of the discipline was published in a series of papers in the 1960s and 1970s [Kirby and Rosenhead, 2005]. As Ackoff distanced himself from the field of Operations Research, he embraced the Systems Movement. Ackoff believed that the principles of reductionism and mechanism, and the associated analytical thinking, were being supplemented by the practices of expansionism and teleology and its systems-based thinking [Ackoff, 1974]. To put it simple, a transition from the machine age to the systems age was ongoing. And, in this new age, one of the most important concepts would be Ackoff’s idea of purposeful system: one whose members, having both individual and collective purposes, agree on collectively formulated objectives [Ackoff and Emery, 1972]. As the main ideas behind sociotechnical systems matured, a transition from the old to the new organizational paradigms ensued. The old technocratic and bureaucratic principles were replaced by socioecological and participative principles. These principles equipped systems with the necessary flexibility to adapt to uncertain and complex environments, something that technocratic bureaucracies failed to accomplish. Between 1978 and 1981, more and more plants with fewer levels, functions, and workforce were developed. Problem-solving became a collective responsibility, highlighting the principle of collaboration as opposed to the competition mindset that prevailed during the post-war period. People’s intrinsic values, such as the need for variety (different from novelty) and challenge, continuous learning, discretion and autonomy, recognition and support, and a meaningful social contribution, were both identified as and linked to the ideal characteristics of industrial jobs. The adoption of self-regulated autonomous groups became a synonym for efficacy and flexibility; external supervision was, therefore, only used in issues outside the group’s control. Macrosocial developments such as the microprocessor revolution, decentralization, technological choice, and networks were also studied and placed within the context of sociotechnical systems. Clearly, social aspects were being considered from the very early stages of new plant developments, in a quest of joint optimization between the technical and social parts [Trist, 1981].

F. TRANSITION TO CONTEMPORARY VIEWS During the 1990s and 2000s, the concept of system of systems received considerable attention. A system can be termed a system of systems when its

42

W. B. ROUSE ET AL.

components have their own purposes and continue to do so if separated from the overall system; moreover, components are managed in relation to their own purposes and not those of the whole [Maier, 1998]. Despite its rather recent popularity, the term system of system was actually coined in 1971 by Russell Ackoff [Ackoff, 1971]. In his work Towards a System of Systems Concepts, Ackoff recognized that organizations could be perceived as systems. He suggested that systems may or may not vary over time resulting in static, dynamic, or homeostatic systems. The changes in the concept of a system in terms of reaction, response and act, were clarified; the system outcomes, he explained, are not the response variables themselves but, instead, the consequence of system responses. He also derived a behavioral classification of systems as being state-maintaining, goal-seeking, multi-goal-seeking, or purposive system. Finally, Ackoff discussed the ideas of adaptation and learning and suggested that organizational systems may be variety decreasing or variety increasing [Ackoff, 1971; Rouse, 2015]. More recently, social studies have received considerable attention. Researchers strive to understand and model individual behaviors and their impact on systems and organizations. In 1991, Kathleen Carley studied how individual behavior affects the stability of a group. Her work built upon the premise that interaction leads to shared knowledge and vice versa [Carley, 1991]. Carley’s organizational studies also include the study of the impact of high turnover rates to organizations: she suggests that, even though teams learn faster than hierarchies, the latter are less affected by high turnover rates than the former [Carley, 1992]. In 2001, Carley in collaboration with Wallace worked on the development of computational models and techniques in the context of organizational analysis [Carley and Wallace, 2001]. Still on the study of individual and collective behavior, Rebecca Goolsby suggests that the structure of individual interactions points to the understanding of collective patterns [Goolsby, 2009]. Another increasingly popular trend is the study of social media as a platform for communication, wide-scale interaction, and collective sharing of information. One such example is the use of social media to inform people during natural disasters. Jeannette Sutton suggests that the use of social media in disaster arena is becoming commonplace in spite of authorities’ reservations about the legitimacy of the information being shared [Sutton et al. 2007]. Rebecca Goolsby adds to this discussion with her study of ad-hoc communities, who resort to social media as a way to generate community crisis maps [Goolsby, 2010]. The increasing use of social media and other online platforms lead to the availability of large-scale, digitized information on social behaviors and social phenomena. At the same time, the combination of computational and social sciences allows us to analyze and derive theoretical conclusions about a vast array of information on social phenomena. These conclusions, in turn, may be used as input to policy decisions. A recent manifesto on Computational Social Science addresses the goals and challenges of this field and its likely impact on science, technology, and society [Conte et al. 2012].

ADVANCES IN SOCIOTECHNICAL SYSTEMS

43

III. CONTEMPORARY VIEWS OF SOCIOTECHNICAL SYSTEMS If we view the core objective of the analysis of sociotechnical systems as the joint optimization of the social components and the technical components, then what is of interest is any development that furthers that objective. In recent years, the increasing availability and reduction in cost of substantial computational power coupled with the increasing availability of large datasets made possible through modern information systems has opened up (or at least made more practical) new avenues of research and analysis in sociotechnical systems. In particular, computational power allows for the development of large-scale simulations of sociotechnical systems while the increasing availability of data allows these simulations to be parameterized and validated. This has led to an apparent increase in interest in the development of models that are composed of different views or abstractions of the system of interest. This is sometimes referred to as multilevel, multimethod, or multiresolution modeling. While such models would seem to facilitate the joint optimization of a sociotechnical system, the study of complex systems has revealed the limitations of optimization in the face of the inherent complexity of social systems. In particular, the tendency of social systems to change and adapt results in any “optimal” solution having the potential to be quite fragile. Consequently, while multilevel approaches to model sociotechnical systems will certainly generate insights, a riskbased approach is necessary to deal with the substantial risks of model error. To explore these trends in greater detail, we will first consider the emergence of computational social science. Then we will discuss how the integration of computational social modeling with traditional computational models of physical phenomena can be applied to study sociotechnical systems. Next, we will consider the implications of social complexity and how it limits the inferences we can draw from multimethod modeling. Finally, we will discuss how managing the resulting model risk pushes us away from optimizing sociotechnical systems and toward influencing them.

A. COMPUTATIONAL SOCIAL SCIENCE While mathematical modeling has long been the standard in the physical sciences, the extensive application of mathematical modeling to the social sciences has been challenging due to the inherent complexity of social systems. This is not to suggest that social scientists do not construct mathematical models. They do. Economics, for instance, is extremely mathematical. The issue, however, is that to make such models tractable, one often has to focus on equilibrium behavior and make substantial simplifying assumptions. Since social systems are seldom in equilibrium, such models miss a substantial portion of social system behavior. In recent decades, the increasing availability of inexpensive computational power and the development of agent-based simulation have allowed social scientists to explore nonequilibrium behavior in greater depth. As noted earlier, this is

44

W. B. ROUSE ET AL.

sometimes referred to as computational social science. These simulations allow social scientists to study how the interaction among members of a social system give rise to the emergent behaviors observed in the real world. Another useful feature of agent-based models is that they can be both microand macro-validated [Moss and Edmonds, 2005]. The simulated behavior of individual agents can be compared to accounts of individual human behavior and the aggregate output of the simulation can be compared with the aggregate behavior of the population. In addition, agent-based models have been shown to produce similar statistical characteristics to empirical time series data of social phenomena (e.g., leptokurtosis) [Moss and Edmonds, 2005]. Given these advantages, agent-based models have been used to experiment with a number of population-based phenomena. Representative examples include civil violence [Epstein, 2002], domestic water consumption during a drought [Moss and Edmonds, 2005], stock markets [Palmer et al. 1994; LeBaron et al. 1999], traffic and transportation [Chen and Cheng, 2010], and traffic during urban evacuations [Chen and Zahn, 2008]. The emergence of widespread data collection (i.e., big data) and social media has further increased the potential of computational social science. This wealth of data not only provides a means to parameterize and validate simulations of social systems, but it also provides insights into the structure and behavior of social systems. The previously cited work of Conte et al. [2012] extensively discusses both the potential impact on computational social science and associated challenges. They also highlight the challenge of multilevel interactions in social systems and the associated challenge of understanding downward causation through the levels. We will discuss some of practical difficulties associated with multilevel modeling in the next few sections.

B. MULTILEVEL MODELING In the area of sociotechnical systems, we are interested in the simultaneous consideration of both the technical system and the social system. While computational modeling has a long history of application for technical systems, social systems have been more problematic. Consequently, there was a tendency to limit the representation of human behavior in technical systems to highly constrained scenarios. For example, an aircraft pilot must behave in certain ways or else the aircraft will crash. At first glance, the emergence of computational social science would seem to be a remedy to this problem. For any given social system, one could combine a computational model of the technical system with a computational model of the social system. In other words, we could build a multilevel model. Finding the joint optimal solution for a sociotechnical system becomes an exercise in computational optimization. At minimum, a heuristic-based search should be able to find a reasonable solution. As it turns out, the situation is not quite so simple, but

ADVANCES IN SOCIOTECHNICAL SYSTEMS

45

first let us consider some applications of multilevel models to real-world sociotechnical problems. Park et al. [2012] developed a multilevel simulation to examine policy alternatives for an employer-based prevention and wellness program. Their model consists of four levels: ecosystem, organization, process, and people. The ecosystem level consists of a set of parameters that could be adjusted to support “what if” analyses. Some of the parameters include the economic inflation rate, the healthcare inflation rate, and the payment system employed. The organization level also consists of a set of decision parameters that could be adjusted to explore policy options. Some of these parameters include the entering age, the risk thresholds for medical conditions, and the program length per participant. The process level is modeled using a discrete event simulation of the process that patients follow as they move through the care system. Finally, the people level is modeled using an agent-based approach to represent patient behavior. Park and his colleagues then used this simulation to examine the impact of different operational policies on the outcomes of the prevention and wellness program. One of the findings was that the financial objectives of the employer and the healthcare provider are in conflict and should not be independently optimized because “if either loses significantly, the system becomes dysfunctional.” Another example, SPLASH, is an effort by IBM Research to develop a framework for loosely coupling models from different domains to support decisionmaking regarding complex sociotechnical systems [Barberis et al. 2012]. While SPLASH is not intended purely for healthcare applications, the team presented an example model that considered the impact that the placement of a health food store had on obesity. In the example, four different models are composed: VISUM, a proprietary off-the shelf transportation model, an agent-based simulation of buying and eating, a discrete event simulation of exercise, and a differential equation based model of body mass index (BMI). SPLASH mainly focuses on achieving a methodologically valid combination of different modeling formalisms by developing methods to coordinate inputs and outputs, synchronize time, etc. Finally, Yu et al. [2011] developed a computational model to analyze enterprise transformation. Their model consisted of four interacting components: management, production, market, and social network. They then used this model to analyze decision-making under different market conditions. What they found was that under a high level of market uncertainty, an enterprise’s best option is to preserve resources, a behavior that was in fact observed in corporate actions following the financial crisis of 2007–2008.

C. BUILDING MULTILEVEL MODELS While seemingly reasonable in principle, building multilevel models has proven to be challenging. The issue is that simply exchanging data between two models is

46

W. B. ROUSE ET AL.

not sufficient to achieve valid results. Rather one needs what Tolk [2003] terms conceptual interoperability. In the modeling and simulation community, interoperability is sometimes considered in terms of modeling ontologies. The term ontology originated in philosophy and refers to the study of what exists. The concept has been adapted in the domain of computer science as well as the domain of modeling and simulation to refer to a formal specification of concepts and the relationships between them within a particular knowledge domain [Hofman, 2013]. Hofman [2013] notes two types of ontologies in the world of modeling and simulation: methodological and referential. Methodological ontologies refer to specifications of model formalisms, for example, discrete event simulation. Referential ontologies capture knowledge about the parts of the real world that are being modeled, for example, machines, workers, raw materials, power, etc. In other words, a methodological ontology should capture techniques for modeling and a referential ontology should capture what objects and phenomena in the world are being modeled. The challenges in multilevel modeling of sociotechnical systems arise from when one tries to establish consistent referential ontologies. To better illustrate the use of ontologies for composition, imagine you wish to develop a simulation by composing “off-the-shelf” models that could be used to train a pilot to fly an aircraft. Constructing a referential ontology would mean capturing the salient physical features of the aircraft, the pilot, and the relationship between them. One might imagine that there would be different referential ontologies for different classes of aircraft. For example, an airliner behaves very differently in the air than a fighter aircraft. As far as methodological ontologies, the flight characteristics of the aircraft might be modeled using differential equations while the arrival of sudden events such as in-flight emergencies might be modeled using a discrete-event formalism. Suppose that you have two off-the-shelf models: one that models the discrete arrival of inflight emergencies and one that models the continuous behavior of the aircraft in flight. Your ability to successfully compose these two models depends on their consistency with the appropriate methodological and referential ontologies. First, let us consider the methodological ontologies. If your flight model conforms to an approach called the differential equation system specification (DESS) and your emergency event model conforms to an approach called the discrete event system specification (DEVS), then there is a valid mechanism to combine the two modeling formalisms. Zeigler et al. [2000] developed an entire book that explains how to accomplish this composition. As Zeigler et al. [2000] point out, when we combine formalisms in this manner, we are essentially creating a new formalism that subsumes them. So by employing well-defined methodological ontologies and following certain rules, we can create a new, methodologically valid formalism. But this is not the whole picture. If the referential methodologies are misaligned, composition may still fail.

ADVANCES IN SOCIOTECHNICAL SYSTEMS

47

For example, if the inflight emergency model represents the incidents that one would expect while flying a fighter and the flight model represents the behavior of an airliner, the final composed simulation is not going to be representative of what the pilot would experience in the real world. While this example is a fairly obvious case of misalignment for illustrative purposes, one can imagine more subtle misalignments that could be difficult to detect without a full understanding of the referential ontologies. For example, what if instead the models represented two different variants of the airliner. Would they be close enough to compose? Consequently, it is entirely possible to combine two off-the-shelf models in a methodological valid way but have the composition fail because it does not represent the real world in a consistent fashion. Sufficient alignment of both methodological and referential ontologies is necessary to achieve interoperability. As we add more social phenomena into the picture, the situation becomes even more challenging. For example, it has been observed that culture affects how flight crews handle emergencies (e.g., Western crews tend to be more heterarchical while Asian flight crews tend to be more hierarchical). When we are trying to train the pilot, should we have a culture-specific model of how the other members of the flight crew will respond? Hofman [2013] observes that “For many technical domains and artificial systems, ontologies will be able to ensure the interoperability of simulation components developed for a similar purpose under a consensual point of view of the world.” So it would seem that the use of standardized ontologies could be a useful means to facilitate the composition of modeling formalisms. Unfortunately, when humans come into play, the situation becomes more challenging. Hofman also notes that “. . .it might be difficult, if not impossible, to capture many purposeful human actions in formal systems—including referential ontologies.” He goes on further to state that “. . .in many socio-technical and most social domains the specification of such ‘well defined’ domain ontologies (referential ontologies) will be impossible. . . . Hence, in these cases, there is no easy mapping possible between referential and methodological ontologies. . . . This mapping, if possible at all. . . would not be a technical matter, but a challenging and subjective task of selection.” The difficulties with creating referential ontologies in the sociotechnical domain that Hofman cites can be summarized as follows: human beings are capable of dealing with logical contradictions by bypassing formal logic, but a formal ontology is designed to be a logical representation. Furthermore, there are many possible representations of a social system with little agreement among experts in the domain. To complicate matters further, the appropriate representation often depends on the problem being addressed, which entangles what is being modeled with how it is being modeled. This leads Hofman to assert that there is no independence between referential and methodological ontologies in the social domain. While this may seem to be a fairly technical argument about building simulations, it actually has some fundamental implications for how we can understand

48

W. B. ROUSE ET AL.

sociotechnical systems, which we will address in the next section. It is important to note that there are ongoing research efforts trying to address this problem using ontologies. [See Hofman et al. 2011, 2013; McGinnis et al. 2011; Partridge et al. 2013; Tolk and Miller, 2011.] In particular, some are applying a branch of mathematics known as model theory in an attempt to create a rigorous understanding of the situation [see Tolk et al. 2011, 2013] and Diallo et al. [2013] for more information.) But as Pennock and Rouse [2014] indicate, there is not likely to be any general solution to this problem, and the best we can hope for is to understand when we can and when we cannot build multilevel models of sociotechnical systems. It is important to point out though that this does not mean that one cannot draw any inferences about a sociotechnical system just because one cannot computationally combine models. Rather we can still use disconnected models to inform strategies beyond just joint optimization of the sociotechnical system.

D. COMPLEXITY IN SOCIAL SYSTEMS In recent decades, a field called complexity science has emerged. While there are various interpretations of complexity, social systems would almost invariably be considered complex regardless of the precise definition. What is interesting from the point of view of sociotechnical systems is how the findings of complexity science inform the modeling and study of sociotechnical systems. This also has implications for achieving conceptual interoperability when building multilevel models of them. While a full discussion of the entire field of complexity is outside the scope of this chapter, for our purposes it is sufficient to highlight some of the key properties of complex systems as described by Helbing and La¨mmer [2008]: .

History-dependence

.

Multiple local optima

.

Instability and cascading effects

.

Guided self-organization is better than control

In short, if we accept that social systems are complex, then the implication is that they are: 1) difficult to model accurately and 2) difficult to optimize. This would seem to call into question the objective of joint optimization of sociotechnical systems. If we have trouble modeling and optimizing the social component of a sociotechnical system alone, how are we supposed to jointly optimize it with the technical component? In particular, Helbing and La¨mmer’s observation that guided self-organization is better than control is a reflection of that problem. It has been observed that complex systems often seem to resist deliberate attempts to change them while changing dramatically following a seeming insignificant trigger event. Consequently, trying to optimize a complex system can be counter-productive. Rather it is sometimes better to guide a complex system’s natural tendency to adapt.

ADVANCES IN SOCIOTECHNICAL SYSTEMS

49

If we accept this notion, then the next logical question would be how to guide self-organization. Presumably we would like to build models of a sociotechnical system to explore the efficacy of different guidance mechanisms, but now we are back to the problem that it is difficult to build models of complex systems. Let us explore this issue in a little more depth. Casti [1985] considers one possible measure of complexity as the number of independent models required to represent a system. Consequently, the more complex the social system, the greater the number of models required to represent it. This assertion would seem to be consistent with Hofman’s observations about proliferation of referential ontologies within social domains. The implication is that when we attempt to model a complex sociotechnical system, we may be forced to represent it using multiple, possibly incompatible models. In their work “Social Science as the Study of Complex Systems,” Harvey and Reed [1997] proposed a hierarchy of ontological complexity in social systems (see Table 1). This hierarchy starts from the “regularities of the physical universe” on the bottom and ends with the “evolution of social systems” at the top. The complexity of each layer is greater than the one below. According to Harvey and Reed, the consequence is “several epistemological breaks as we move along this abstraction dimension.” In other words, as we move from the physical sciences to the social sciences, the increasing complexity limits what we can know. To accommodate this, Harvey and Reed proposed levels of modeling abstraction that map to the appropriate ontological levels. Their modeling levels include predictive modeling, statistical modeling, iconological modeling, structural modeling, ideal type modeling, and historical narratives. It is important to note that in their analysis, predictive modeling only applies to the lower, more physical levels of the ontological hierarchy. It also means that a model of a social system may be fundamentally incompatible with a model of a technical system. It should not be surprising that a physics-based model is not computationally composable with a historical narrative. The point here, however, is that there are many cases that lie in between this extreme case where the models in question may seem superficially compatible but are ontologically incompatible on a deeper level. For these reasons, Pennock and Rouse [2014] assert that computationally connecting theoretical constructs from different disciplines, as one would like to do when jointly optimizing a sociotechnical system, will only work for a small subset of cases. There is one final impact of complexity that is relevant to this discussion. Earlier we mentioned the increasing availability of large-scale datasets as a driver in the study of sociotechnical systems. These provide a potentially valuable resource for understanding the structure of sociotechnical systems as well as for parameterizing and validating models of sociotechnical systems. However, the complexity of many sociotechnical systems complicates the extraction and validation of such models from data.

Source: Adapted from Harvey and Reed (1997).

Regularities of physical universe

Evolution of living phenotypes

Ecological organization of living phenotypes

Ecological emergence of designed organizations

Social infrastructure of organizations

Division of labor in productive activities

Distribution of material rewards and esteem

Intraorganizational allocation of roles and resources

Personal conformity and commitment to roles and norms

Interorganizational allocations of power and resources

Evolution of dominant and contrary points of view

Cultural dominance and subcultural bases of resistance

Competition and conflict among social systems

Iconological Modeling

Statistical Modeling

Predictive Modeling

HIERARCHY OF COMPLEXITY VS APPROACHES

Evolution of social systems over decades, centuries, etc.

Hierarchy of Complexity in Social Systems

TABLE 1

50 W. B. ROUSE ET AL.

Historical Narratives

Ideal type Modeling

Structural Modeling

ADVANCES IN SOCIOTECHNICAL SYSTEMS

51

In particular, many complex sociotechnical systems were not designed. Rather they emerged. For example, one may need to infer an emergent process from millions of transaction records. Extracting causal models from datasets collected from complex systems is a nontrivial endeavor for all of the reasons discussed previously. In fact, some researchers believe that complex systems do not have clearcut cause-and-effect relationships [Poli, 2013]. In recognition of this challenge, DARPA has initiated the Big Mechanism program. Its objective is to develop methods to extract causal pathways from individually studied phenomena [DARPA, 2015]. One such problem they are considering is the extraction of causal pathways for cancer through the analysis of the cancer biology literature. What we can draw from this discussion of complexity in social systems is that efforts to develop multilevel models of sociotechnical systems could incur a great deal of model risk (i.e., the risk that the model is incorrect). Consequently, attempting to jointly optimize a computational model of a sociotechnical system risks generating an invalid and potentially fragile solution. A more promising approach is to attempt to guide the adaptive tendencies of the sociotechnical system while carefully managing the associated model risk.

E. MANAGING MODEL RISK An important aspect of model risk for sociotechnical systems is that given the complexity and adaptive nature of sociotechnical systems, it may be difficult if not impossible to capture empirical probability distributions of such risks. Even worse is that there are likely to be unknown unknowns. Under such circumstances, one approach to managing risk is to employ dynamic strategies. Here a concept from the management literature is instructive. Snowden and Boone [2007] developed the Cynefin framework to provide a context-dependent approach to decision-making for leaders. The framework consists of five domains: simple, complicated, complex, chaotic, and disordered. Depending on the domain a leader finds him or herself, he or she should employ the associated approach to problem-solving. Employing the wrong approach for the context is likely to be counterproductive. Implicit in this framework is the notion of switching strategy based on how much one can know and predict in the current situation (i.e., easier to understand and predict in the simple domain than the chaotic domain). This is consistent with Harvey and Reed’s idea that one must switch modeling approaches to accommodate epistemological breaks. The takeaway here is that one should switch strategies as the level of model risk changes. To that end, Pennock and Rouse [2014] introduced a strategy framework that enables selection among four courses of action: optimize, adapt, hedge, and accept (see Fig. 1). The intent is to apply dynamic strategies to compensate for model risk. Ideally, one would like to find the optimal solution to a problem, but this typically requires a relatively well-behaved, tractable, and accurate model of the

52

W. B. ROUSE ET AL.

Problem Descripon Objecves, dynamics, and constraints measurable and tractable?

Yes

Opmize

No Yes

Enterprise response me < External response me?

Adapt

No Yes

Mulple, viable alternave futures describable?

Hedge

No Accept

Fig. 1

Problem-solving strategy framework [Pennock and Rouse, 2014].

system you wish to optimize. If that is the case, then there is nothing wrong with optimizing. Unfortunately, as was discussed above, many sociotechnical systems would not fit in this category. If the model uncertainty is too high to support an optimize strategy, the next best strategy is the adapt strategy. A decision or policy-maker attempting to influence a sociotechnical system monitors the state of the system and makes changes as the situation evolves. For this strategy to be feasible the decision-maker’s response time needs to be faster than the sociotechnical system’s response time. Modeling efforts can support an adapt strategy by identifying potential scenarios, identifying key variables to monitor, as well as identifying potential decision thresholds. If the decision-maker cannot respond faster than the system, then the next best strategy is to hedge. Hedging is equivalent to buying insurance. The decisionmaker invests resources to mitigate the risk and account for cases where sudden change is not possible without prior investment. A classic example is technology investment. Predicting whether or not a new technology will be successful is a risky proposition. Yet in many cases, companies cannot follow an adapt strategy because by the time they figure out which technology is successful, their competitors have already captured the market. Consequently, many companies invest small amounts in multiple technologies to give them the option to use one or more of them if they end up being successful. Modeling efforts can support a hedge strategy by exploring potential scenarios and identifying sensitive aspects of the system that a decision-maker might want to hedge. Of course, if it is not possible to predict an event and it is not possible to respond to it if it does occur, then the only option is to accept. Here the idea is not to waste resources on a wild guess. Doing so would deplete resources that could otherwise be allocated to a more manageable risk. A slightly facetious

ADVANCES IN SOCIOTECHNICAL SYSTEMS

53

example would be defending against an invasion of aliens from outer space. We have no basis for estimating the probability of such an invasion. If it did occur, we probably wouldn’t be able to do much about it. So it is not particularly advisable to invest in a defense system to repel an alien invasion. Spending the resources on improving healthcare would probably be a better way to go. Analogous situations can occur for more mundane events as well. Whether we are thinking about dealing with the uncertainties inherent in sociotechnical systems from the Cynefin standpoint or the optimize, adapt, hedge standpoint, the issue is that despite increasing capabilities to model sociotechnical systems, it is unlikely that we can reduce sociotechnical systems to engineering optimization problems in the foreseeable future. Consequently, it is important to develop and analyze models of sociotechnical systems in such a way to influence sociotechnical systems as opposed to control them. To do otherwise risks creating fragile systems.

IV. METHODS AND TOOLS The previous section outlined and elaborated the challenges and difficulties of pursuing the “holy grail” of joint optimization of social and technical systems. This quest seems indeed daunting. However, we now know the particular hurdles we have to clear. Thus, the conclusion is not that all such models are impossible. Instead, we now know what needs careful attention to avoid deluding ourselves that we know more than we do. From this perspective, consider what paradigms, methods, and tools we can bring to bear on modeling endeavors.

A. ARCHITECTURE OF PUBLIC–PRIVATE ENTERPRISES The architecture of public–private enterprises shown in Fig. 2 [Rouse, 2009; Rouse and Cortese, 2010; Grossman et al. 2011] provides a framework for discussion of alternative methods and tools for addressing sociotechnical systems. This architecture summarizes the range of phenomena of interest in these systems. Starting at the lowest level of Fig. 2, the efficiencies that can be gained at this level (work practices) are limited by nature of the next level (delivery operations). Work can only be accomplished within the capacities provided by available processes. Further, delivery organized around processes tends to result in much more efficient work practices than for functionally organized business operations. However, the efficiencies that can be gained from improved operations are limited by the nature of the level above, that is, system structure. Functional operations are often driven by organizations structured around these functions, for example, manufacturing and service. Each of these organizations may be a different business with independent economic objectives. This may significantly hinder process-oriented thinking. And, of course, potential efficiencies in system structure are limited by the ecosystem in which these organizations operate. Market maturity, economic

54

W. B. ROUSE ET AL.

Fig. 2

Architecture of public private enterprises.

conditions, and government regulations will affect the capacities (processes) that businesses (organizations) are willing to invest in to enable work practices (people), whether these people be employees, customers, or constituencies in general. Economic considerations play a major role at this level [Rouse, 2010a, b]. These organizational realities have long been recognized by researchers in sociotechnical systems [Emery and Trist, 1973], as well as work design and system ergonomics [Hendrick and Kleiner, 2001]. We need methods and tools that enable computational explorations of these realities, especially by stakeholders without deep disciplinary expertise in these phenomena.

B. LEVELS OF MODELING Identification of alternative representations of phenomena within the architecture of public–private enterprises shown in Fig. 2 leads to Table 2. We can use this architecture to organize thinking about how to select models for each of the layers in the sociotechnical enterprise. The ecosystem models draw heavily upon macroeconomics and social system models. These models are often highly aggregated and used to predict overall economic metrics such as GDP growth and inflation. We seldom need, for instance, to model each economic actor’s decisions to predict GDP growth and

ADVANCES IN SOCIOTECHNICAL SYSTEMS

TABLE 2

LEVELS OF MODELING, EXAMPLE ISSUES, AND POTENTIAL MODELS

Level Ecosystem

Organizations

Processes

People

55

Issues

Models

GDP, supply/demand, policy

Macroeconomic models

Economic cycles

Dynamic system models

Intrafirm relations, competition

Network models

Profit maximization

Microeconomic models

Competition

Game theory

Investment

DCF, options

People, material flow

Discrete-event models

Process efficiency

Learning models

Workflow

Network models

Consumer behavior

Agent-based models

Risk aversion

Utility models

Perception progression

Markov, Bayes models

inflation. Nevertheless, these variables can be very important. For example, higher levels of inflation mean that downstream healthcare savings due to upstream prevention and wellness will be more valuable, although this does depend on the discount rate employed. At the organization level, models of interest often include microeconomic models and constructs from decision theory and finance theory. At this level, we are concerned with maximizing profits, making investment decisions, and dealing with competitors. Of course, all of these issues and decisions occur within the context of the “rules of the game” determined at the ecosystem level. The returns on investments in the process level below are also influenced by the potential payoffs from process improvements and new capacities. Models employed at the process level often involve substantial interactions of physical and human phenomena. This is where many work activities and tasks are accomplished to deliver capabilities and associated information. Simulation of these processes and their evolution over time, in part due to production learning, is often how the theories associated with these phenomena are instantiated and studied. The people level draws upon the phenomena such as humans’ perceptions, decisions and actions, as well as, for example, how human health and possible disease states evolve. At this level, humans are both the agents of action and the objects of action through natural physical processes. It is also important to understand that the people level is both enabled and constrained by the capabilities and information provided by the process level.

56

W. B. ROUSE ET AL.

C. REPRESENTATION TO COMPUTATION We need to translate the model-based representations of phenomena into computational form. There are several common computational frames. 1. DYNAMIC SYSTEMS The dynamic systems frame is concerned with transient and steady-state responses as well as stability of dynamic systems. Central constructs are stocks, flows, feedback, error, and control. Such systems are represented using differential or difference equations. Both have continuous states, but the former involves continuous transitions while the latter involves discrete transitions. Various elements of these models may be stochastic in nature, for example, disturbances and measurement noise. A range of tools can be used to solve such equations (see below), but the essence of the computation with all approaches is the relationship between future system states and past system states over time. 2. DISCRETE EVENT SYSTEMS The discrete event frame is concerned with steady-state responses in terms of average waiting time and time in the system, as well as average number of entities waiting and number of entities in the system. Central constructs are flows, capacities, and queues, as well as allocations of resources and time-based metrics. Such systems are represented using Markov chains with discrete states, and continuous transitions. Also important are probability distributions associated with arrival flows (e.g., Poisson) and service processes (e.g., exponential). A variety of tools can be used to compute the responses of discrete event systems (see below). As with dynamic systems, the essence of the computation with all approaches is the relationship between future system states and past system states, in this case averaged over time. 3. AGENT-BASED SYSTEMS The agent-based frame focuses on large numbers of independent entities and the emergent responses of the collection of entities over time. Central constructs, for each agent, are information use, decision-making, and adaptation over time. Such systems are represented by the sets of information sampling rules and decisionmaking rules assigned to each agent. These systems may also incorporate network models of relationships among agents. A range of tools can be used to compute the responses of agent-based systems (see below). The essence of all these approaches is simulation to compute the evolving state of each agent and collective patterns of behaviors. Of particular importance with such models is the notion that the observed emergent behaviors are not explicitly specified by the agents’ rules.

ADVANCES IN SOCIOTECHNICAL SYSTEMS

57

4. OPTIMIZATION-BASED FRAME Another computational frame can overarch one or more of the above three frames. Beyond simply controlling a dynamic system, we may want to optimally control it. Beyond predicting the queuing time in a discrete event system, we may want to optimally allocate resources to minimize some criterion that differentially weights the queuing times of different types of entities. Beyond observing the emergent behaviors of an agent-based system, we may want to design optimal incentive systems that maximize the likelihood of desirable emergent behaviors. Thus, the problem to be solved using the models discussed above may be determination of the optimal control strategy, the optimal allocation of resources, or the optimal incentive system. Such aspirations are pervasive in operations research and systems engineering, as well as economics, finance, etc. The Holy Grail is the “best” answer that maximizes benefits and minimizes costs. We must keep in mind, however, that all of these pursuits must inherently solve their optimization problems in the contexts of models of the systems of interest rather the complex reality of the actual systems. Indeed, the achievement of the “best” answer can only be proven within the mathematical representations of the system of interest. To the extent that the mathematical models of the system are good approximations, the best answer may turn out to be “pretty good.” When there are humans in the system, we have to deal with not only our model of the system but also with humans’ models of the system. This is often addressed using the notion of “constrained optimality.” Succinctly, it is assumed that people will do their best to achieve task objectives within their constraints such as limited visual acuity, reaction time delays, and neuromotor lags. Thus, predicted behaviors and performance are calculated as the optimal behavior and performance subject to the constraints limiting the humans involved. If these predictions do not compare favorably with subsequent empirical measurements of behaviors and performance, one or more constraints have been missed [Rouse, 1980, 2007]. Determining the optimal solution for any particular task or tasks requires assumptions beyond the likely constraints on human behavior and performance. Many tasks require understanding of the objectives or desired outcomes and inputs to accomplish these outcomes, as well as any intervening processes. For example, drivers need to understand the dynamics of their vehicles. Pilots need to understand the dynamics of their aircraft. Process plant operators need to understand the dynamics of their processes. They also need to understand any tradeoffs between, for example, accuracy of performance and energy required, human and otherwise, to achieve performance. This understanding is often characterized as humans’ “mental models” of their tasks and context. To calculate the optimal control of a system, or the optimal detection of failures, or the optimal diagnoses of failures, assumptions are needed regarding humans’ mental models. If we assume well-trained humans

58

W. B. ROUSE ET AL.

who agree with and understand task objectives, we can usually argue that their mental models are accurate, for example, reflect the actual physical dynamics of the vehicle. They key point is that optimizing many types of systems requires that we develop models of humans’ model of reality. This is tractable if there are enough constraints on humans’ choices and behaviors. Without sufficient constraints, however, the whole notion of optimality can become extremely ambiguous and often useless. 5. SUMMARY Most of the applications of the above computational frames have involved modeling and representation of the “physics” of the environment, infrastructure, vehicles, etc. These are certainly important elements of many overall multilevel models. However, the greatest challenges in developing such models for the types of problems addressed in this chapter are modeling and representation of the behavioral and social activities and performance throughout the system, especially when it cannot be assumed that the human elements of the systems will behave in accordance with the objectives and “rules of engagement” of the overall system. There is a wealth of possible representations of behavioral and social phenomena, but these approaches are nonetheless subject to much more human variability than experienced for physical systems, especially systems that were designed or engineered. This is not a cause for despair, but a caution that human variability may be as important, or more important, than average human responses. This variability can, for example, result in deteriorating overall systems performance despite the fact that humans, on average, are performing just as expected.

D. COMPUTATIONAL TOOLS This is a wide range of commercially available software tools for creating computational instantiations of the theories and models discussed above. Many of these packages have been available for quite some time, have been updated regularly, and have sizable user groups and online forums for support. Mathematica and MATLAB are both commercial software packages for supporting representation and solutions of a wide range of equations. This, of course, requires that one explicitly derive the equations to be solved. There is a range of other software packages that support one or more of the theories or paradigms discussed earlier and, in effect, derive the needed equations from a graphic portrayal of the model. Simcad Pro and VisSim are both commercial software packages for supporting continuous simulation. They have block diagram oriented graphical user interfaces. They both provide options for numerical integration techniques, as they

ADVANCES IN SOCIOTECHNICAL SYSTEMS

59

are solving continuous differential equations rather than discrete difference equations. Arena and Simio are both commercial software packages for supporting discrete event simulation of network models. These models typically involve queuing of entities that arrive at entry nodes, wait for service, and are then serviced and passed on to the next node until the whole servicing process is complete. These packages include capabilities for analysis of simulation runs and graphic portrayal of results. NetLogo and Repast both support agent-based modeling and simulation for large numbers of agents. Model developers provide behavioral rules to each agent, and the simulation enables seeing the overall system behaviors that emerge from the collective actions of all agents. Both tools are free. Stella and Vensim are both commercial software packages for supporting systems dynamics modeling using Jay Forrester’s framework [Sterman, 2002]. These tools have easy-to-use graphical interfaces and a variety of associated analytical capabilities. AnyLogic is a commercial software package that simultaneously supports systems dynamics modeling, discrete event simulation, and agent-based simulation. The ability to pass data among these three representations enables multilevel modeling of complex phenomena. Its easy-to-use graphical interface enables rapid prototyping of initial models. R is a statistical package that also supports simulation but does not provide functionality tailored to systems dynamics modeling, discrete event simulation, or agent-based simulation. Nevertheless, one can embed equations from these paradigms in an R simulation. R is free. ModelCenter is a commercial software package for integrating outputs for independent software models. It does not provide functionality tailored to systems dynamics modeling, discrete event simulation, or agent-based simulation. It also includes capabilities to support team projects.

E. INTEGRATED APPROACHES SOCIOTECHNICAL SYSTEMS Our overall goal in this section has been to discuss a range of general representations and commercially available computational tools for addressing sociotechnical systems. It is important to indicate, however, that a variety of targeted integrated approaches have been developed. They are well represented by the other four chapters in this section of the book. Madni and Sievers [2016] discuss system-of-systems integration, which involves interfacing and enabling the interactions of component systems to create the needed system-of-systems capability to accomplish mission goals or satisfy business needs. They argue that a layer of complexity is introduced when the system-of-systems has to exhibit certain quality attributes such as adaptability and resilience in the face of contingencies and disruptions in the operational environment.

60

W. B. ROUSE ET AL.

Jackson [2016] describes an approach to creating a system that will anticipate, avoid, withstand, survive, and recover from threats that may be either humanmade, such as terrorist attacks, natural, such as earthquakes, or internal, such as latent faults. The domains examined included fire protection, rail, aviation, and power distribution. Friedenthal and Oster [2016] demonstrate how the OMG Systems Modeling Language (OMG SysMLTM ) and a model-based systems engineering (MBSE) approach can be applied to the mission, system specification, and design of a small spacecraft. Griendling and Mavris [2016] present a methodology and set of enabling techniques for performing technology infusion studies in the early phases of conceptual design, especially in cases where the designer is considering maturing technologies in parallel with the development of an aircraft concept. This methodology gives designers and decision-makers a meaningful way to trade the benefits and costs of technology infusion with the ability meet and exceed design requirements. The methodology is illustrated through a case study of a large passenger transport aircraft. These integrated approaches are of particular value when the problem of interest involves phenomena that can be appropriately represented by the models and computational methods chosen by the originators of these approaches.

V. CONCLUSIONS The notion of sociotechnical systems has evolved substantially in the last eight decades as it has been developed and refined. The initial goal was simply to attract attention to behavioral and social phenomena central to many technical systems. Aspirations then evolved to the idea that the social and technical elements of systems could be jointly optimized. This motivated a rich and ambitious portfolio of research endeavors. More recently, we have come to realize that the behavioral and social phenomena make systems very complex, not just highly complicated. We can only occasionally optimize, hopefully are otherwise able to adapt, and may have to hedge against alternative futures. Acceptance is the last choice, but may be a better strategy than investing resources in optimize, adapt, and hedge when they will inherently fail. Nevertheless, we can do a lot. We just cannot do everything.

REFERENCES Ackoff, R. (1956), “The Development of Operations Research as a Science,” Institute for Operations Research and the Management Science, Vol. 4, No. 3, pp. 165–295. Ackoff, R. (1971), “Towards a Systems of Systems Concepts,” Management Science, Vol. 17, No. 11, pp. 661–671.

ADVANCES IN SOCIOTECHNICAL SYSTEMS

61

Ackoff, R. (1974), “The Systems Revolution,” Long Range Planning, Vol. 7, pp. 2–20. Ackoff, R. (1979), “The Future of Operational Research is Past.,” The Journal of the Operational Research Society, Vol. 30, No. 2, pp. 93–104. Ackoff, R., and Emery, F. (1972), On Purposeful Systems, Aldine-Atherton, Chicago, IL. Balderston, F., Epstein, E., and Koenigsberg, E. (2004), In Memoriam: C. West Churchman. University of California, http://senate.universityofcalifornia.edu/inmemoriam/ cwestchurchman.htm [retrieved 27 March 2015]. Barberis, N., Haas, P. J., Kieliszewski, C., Li, Y., Maglio, P., Phoungphol, P., Selinger, P., Sismanis, Y., Tan, W., and Terrizzano, I. (2012), Splash: A computational platform for collaborating to solve complex real-world problems, IBM Research, Almaden, http:// researcher.watson.ibm.com/researcher/view_project.php?id=3931 [retrieved 28 October 2013]. Carley, K. (1991), “A Theory of Group Stability,” American Sociological Review, Vol. 56, pp. 331–354. Carley, K. (1992), “Organizational Learning and Personnel Turnover,” Organization Science, Vol. 3, No. 1, pp. 20–46. Carley, K., and Wallace, W. (2001), “Computational Organization Theory: A New Perspective,” Encyclopedia of Operations Research and Management Science, edited by S. Gass, and C. Harris, Kluwer Academic Publishers, Norwell, MA. Casti, J. L. (1985), “On System Complexity: Identification, Measurement, and Management,” Complexity, Language, and Life: Mathematical Approaches, edited by J. L. Casti, and A. Karlqvist, Springer-Verlag, Berlin, pp. 144–173. Chen, B., and Cheng, H. H. (2010), “A Review of the Applications of Agent Technology in Traffic and Transportation Systems,” Intelligent Transportation Systems, IEEE Transactions on, Vol. 11, No. 2, pp. 485–497. Chen, X., and Zhan, F. B. (2008), Agent-Based Modelling and Simulation of Urban Evacuation: Relative Effectiveness of Simultaneous and Staged Evacuation Strategies,” Journal of the Operational Research Society, Vol. 59, No. 1, pp. 25–33. Churchman, C., Ackoff, R., and Arnoff, E. (1959), Introduction to Operations Research, John Wiley & Sons, Hoboken, NJ. Conte, R., Gilbert, N., Bonelli, G., Cioffi-Revilla, C., Deffuant, G., Kertesz, J., Loreto, V., Moat, S., Nadal, J., Sanchez, A., Nowak, A., Flache, A., San Miguel, M., and Helbing, D. (2012), “Manifesto of Computational Social Science,” The European Physical Journal Special Topics, Vol. 214, No. 1, pp. 325–346. DARPA. (2015), Big Mechanism Program, http://www.darpa.mil/Our_Work/I2O/ Programs/Big_Mechanism.aspx [retrieved 1 March 2015]. Diallo, S. Y., Padilla, J. J., Gore, R., Herencia-Zapana, H., and Tolk, A. (2013), “Toward a Formalism of Modeling and Simulation using Model Theory,” Complexity, Vol. 19, No. 3, pp. 56–63. Drack, M. (2009), “Ludwig von Bertalanffy’s Early System Approach,” Systems Research and Behavioral Science, Vol. 26, pp. 563–572. Emery, F. (1959), Characteristics of Socio-Technical Systems, Tavistock Documents no. 527, London, UK. Emery, F., and Trist, E. (1973), Toward a Social Ecology, Plenum Press, London. Epstein, J. M. (2002), “Modeling Civil Violence: An Agent-Based Computational Approach,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 99, No. Suppl 3, pp. 7243–7250.

62

W. B. ROUSE ET AL.

Friedenthal, S., and Oster, C. (2016), “Applying SysML and a Model-Based Systems Engineering (MBSE) Approach to a Small Satellite Design,” This volume. Gerhardt, U. (2011), Talcott Parsons—An Intellectual Biography, Cambridge University Press, Cambridge, UK. Goolsby, R. (2009), “Lifting Elephants: Twitter and Blogging in Global Perspective,” Social Computing and Behavioral Modeling, Springer, Heidelberg, pp. 1–7. Goolsby, R. (2010), “Social Media as Crisis Platform: The Future of Community Maps/ Crisis Maps,” ACM Transactions on Intelligent Systems and Technology, Vol. 1, No. 1, pp. 7. Griendling, K., and Mavris, D. (2016), “A Systems Engineering Approach and Case Study for Technology Infusion for Aircraft Conceptual Design,” This volume. Grossman, C., Goolsby, W. A., Olsen, L., and McGinnis, J. M. (2011), Engineering the Learning Healthcare System, National Academy Press, Washington, DC. Harvey, D. L., and Reed, M. (1997), “Social Science as the Study of Complex Systems,” Chaos Theory in the Social Sciences: Foundations and Applications, edited by L. D. Kiel, and E. Elliot, The University of Michigan Press, Ann Arbor. Helbing, D., and La¨mmer, S. (2008), “Managing Complexity: An Introduction,” Managing Complexity: Insights, Concepts, Applications, edited by D. Helbing, Springer-Verlag, Berlin, pp. 1–15. Hendrick, H. W., and Kleiner, B. M. (2001), Macroergonomics: An Introduction to Work System Design, Human Factors and Ergonomics Society, Santa Monica, CA. Hofman, M. (2013), “Ontology in Modeling and Simulation: An Epistemological Perspective,” Ontology, Epistemology, and Teleology for Modeling and Simulation, edited by A. Tolk, Springer, Heidelberg, pp. 59–87. Hofman, M., Palii, J., and Mihelcic, G. (2011), “Epistemic and Normative Aspects of Ontologies in Modeling and Simulation,” Journal of Simulation, Vol. 5, No. 3, pp. 135–146. Jackson, S. (2016), “Engineering Resilience into Human-Made Systems,” This volume. Kirby, M., and Rosenhead, J. (2005), “IFORS’ Operational Research Hall of Fame: Russell L. Ackoff,” International Transactions in Operational Research, Vol. 12, No. 1, pp. 129–134. LeBaron, B., Arthur, W. B., and Palmer, R. (1999), “Time Series Properties of an Artificial Stock Market.” Journal of Economic Dynamics and Control, Vol. 23, No. 9, pp. 1487–1516. Little, J. D. C. (2002), “Philip M. Morse and the Beginnings,” Operations Research, Vol. 50, No. 1, pp. 146–148. Madni, A. M., and Sievers, M. (2016), “System of Systems Integration: Fundamental Concepts, Challenges and Opportunities,” This volume. Maier, M. (1998), “Architecting Principles For Systems-of-Systems,” Systems Engineering, Vol. 1, No. 4, pp. 267–284. McGinnis, L., Huang, E., Kwon, K. S., and Ustun, V. (2011), “Ontologies and Simulation: A Practical Approach,” Journal of Simulation, Vol. 5, No. 3, pp. 190–201. Mindell, D. A. (2000), Cybernetics—Knowledge Domains in Engineering Systems, http:// 21stcenturywiener.org/wp-content/uploads/2013/11/Cybernetics-by-D.A.-Mindell. pdf [retrieved 20 March 2015]. Moss, S., and Edmonds, B. (2005), “Sociology and Simulation: Statistical and Qualitative Cross-Validation,” American Journal of Sociology, Vol. 110, No. 4, pp. 1095–1131.

ADVANCES IN SOCIOTECHNICAL SYSTEMS

63

Palmer, R., Arthur, B. W., Holland, J. H., LeBaron, B., and Tayler, P. (1994), “Artificial Economic Life: A Simple Model of a Stockmarket,” Physica D: Nonlinear Phenomena, Vol. 75, No. 1, pp. 264–274. Park, H., Clear, T., Rouse, W. B., Basole, R. C., Braunstein, M. L., Brigham, K. L., and Cunningham, L. (2012), “Multilevel Simulations of Health Delivery Systems: A Prospective Tool for Policy, Strategy, Planning, and Management,” Service Science, Vol. 4, No. 3, pp. 253–268. Parsons, T. (1937), The Structure of Social Action, The Free Press, Glencoe, IL, https:// archive.org/stream/structureofsocia00pars#page/n7/mode/2up [retrieved 12 March 2015]. Parsons, T., and Shils, E. (1951), Toward a General Theory of Action: Theoretical Foundations for the Social Sciences, Transaction Publishers, New Brunswick, NJ. Partridge, C., Mitchell, A., and de Cesare, S. (2013), “Guidelines for Developing Ontological Architectures in Modeling and Simulation,” Ontology, Epistemology, and Teleology for Modeling and Simulation, edited by A. Tolk, Springer, Heidelberg, pp. 22–57. Pennock, M. J., and Rouse, W. B. (2014), “The Challenges of Modeling Enterprise Systems,” Proceedings of the 4th International Engineering Systems Symposium, Hoboken, NJ, 9–11 June 2014. Poli, R. (2013), “A Note on the Difference between Complicated and Complex Social Systems,” Cadmus, Vol. 2, No. 1, pp. 142–147. Romer, C. (2015), Great Depression. Encyclopedia Britannica, http://www.britannica. com/EBchecked/topic/243118/Great-Depression [retrieved 10 March 2015]. Rouse, W. B. (1980), Systems Engineering Models of Human-Machine Interaction, North Holland, New York. Rouse, W. B. (2007), People and Organizations: Explorations of Human-Centered Design, Wiley, New York. Rouse, W. B. (2009), “Engineering Perspectives on Healthcare Delivery: Can We Afford Technological Innovation in Healthcare?” Journal of Systems Research and Behavioral Science, Vol. 26, pp. 1–10. Rouse, W. B. (2010a), “Impacts of Healthcare Price Controls: Potential Unintended Consequences of Firms’ Responses to Price Policies,” IEEE Systems Journal, Vol. 4, No. 1, pp. 34–38. Rouse, W. B. (ed.) (2010b), The Economics of Human Systems Integration: Valuation of Investments in People’s Training and Education, Safety and Health, and Work Productivity. Wiley, New York. Rouse, W. B. (2015), Modeling and Visualization of Complex Systems and Enterprises: Explorations of Physical, Human, Economic, and Social Phenomena, Wiley, Hoboken, NJ. Rouse, W. B., and Cortese, D. A. (eds.) (2010), Engineering the System of Healthcare Delivery, IOS Press, Amsterdam. Royde-Smith, J. (2015), World War II. Encyclopedia Britannica, http://www.britannica. com/EBchecked/topic/648813/World-War-II [retrieved 10 March 2015]. Showalter, D. (2015), World War I. Encyclopedia Britannica, http://www.britannica.com/ EBchecked/topic/648646/World-War-I [retrieved 10 March 2015]. Snowden, D. J., and Boone, M. E. (2007), “A Leader’s Framework for Decision Making,” Harvard Business Review, Vol. 85, No. 11, pp. 68–76.

64

W. B. ROUSE ET AL.

Sterman, J. D. (2002), “Systems Dynamics Modeling: Tools for Learning in a Complex World,” California Management Review, Vol. 43, No. 4, pp. 8–25. Sutton, J., Palen, L., and Shklovski, I. (2007), “Backchannels on the Front Lines: Emergent Uses of Social Media in the 2007 Southern California Wildfires,” Proceedings of the 5th International ISCRAM Conference. Tolk, A. (2003), “The Levels of Conceptual Interoperability Model,” Proceedings of the Fall Simulation Interoperability Workshop, IEEE CS Press, Orlando, FL, September. Tolk, A., and Miller, J. A. (2011), “Enhancing Simulation Composability and Interoperability using Conceptual/Semantic/Ontological Models,” Journal of Simulation, Vol. 5, No. 3, pp. 133–134. Tolk, A., Diallo, S. Y., Padilla, J. J., and Herencia-Zapana, H. (2011), “Model theoretic implications for agent languages in support of interoperability and composability,” In Proceedings of the Winter Simulation Conference (pp. 309–320). Winter Simulation Conference. Tolk, A., Diallo, S. Y., Padilla, J. J., and Herencia-Zapana, H. (2013), “Reference Modelling in Support of M&S-Foundations and Applications,” Journal of Simulation, Vol. 7, No. 2, pp. 69–82. Trist, E. (1981), The Evolution of Socio-Technical Systems—A Conceptual Framework and an Action Research Program. Ontario Ministry of Labour, Toronto, ON. Yu, Z., Rouse, W. B., and Serban, N. (2011), “A Computational Theory of Enterprise Transformation,” Systems Engineering, Vol. 14, No. 4, pp. 441–454. Zeigler, B. P., Praehofer, H., and Kim, T. G. (2000), Theory of Modeling and Simulation: Integrating Discrete Event and Continuous Complex Dynamic Systems, Academic Press, New York.

CHAPTER 3

Engineering Resilience into Human-Made Systems Scott Jackson Burnham Systems Consulting – Greater Los Angeles Area, California

This chapter describes the approach to creating a system that will anticipate, avoid, withstand, survive, and recover from threats that may be either humanmade, such as terrorist attacks, natural, such as earthquakes, or internal, such as latent faults. There is no doubt that creating such an approach is complex and difficult. The task is made difficult by many ways that a system could encounter the threat, many ways that the system could fail, and many ways that the system could recover. Nevertheless, this chapter presents a step-by-step path for accomplishing such a goal. This path requires an understanding of abstract principles and the ability to convert abstract principles into concrete solutions and vice versa. Authoritative sources confirm that these abstract principles are simplified replicas of concrete solutions and are linked to the concrete solutions by their dominant characteristics. The domains examined included fire protection, rail, aviation, and power distribution. However, these principles are universal and apply to any application domain. This conversion allows both the validation of these principles and the identification of concrete solutions that are inspired by the abstract principles. The first step is the identification of the many possible abstract principles that could be employed in enhancing the resilience of the system. The second step is to identify the many states the system could pass through on the way from a normal operating state to either a completely recovered state, to a partially functioning state, to an agreed final state, or to a decommissioned state. Next the analyst must identify the many transitions that the system could pass through from state to state on the way to the final state. Then the analyst can identify the principles that could be employed to accomplish each transition. Finally, the analyst must understand how to convert these abstract principles into concrete solutions. Once the analyst has identified one or more concrete solutions, it would be possible to evaluate the effectiveness of these solutions using simulations or analytical techniques. Figure 1 illustrates this path.  Principal

Engineer; Doctoral candidate, University of South Australia.

Copyright # 2013 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved

65

66

S. JACKSON

Fig. 1

The path to a resilient system.

The chapter also has two additional sections: the first section describes how the concept of resilience applies in the aviation domain and the second section explains how resilient systems can be created in the heat of a crisis.

I. OVERVIEW OF RESILIENCE A. RESILIENCE BACKGROUND The concept of resilience has existed for centuries. The Oxford English Dictionary (OED) [1973] defines resilience as the ability to “bounce back” following a disturbance. For example, in materials science, a spring could be said to be resilient if it returns to its original state following being deformed from its original state. In psychology, a human being could be said to be resilient if he or she can return to a normal state following a traumatic experience, such as an accident. In ecology, a lake can be said to be resilient if it returns to its normal state after being polluted. These were the definitions that dominated the study of resilience before the advent of systems engineering in which it was desired to create a system with human involvement that would “bounce back.”

B. DEFINITION OF RESILIENCE IN AN ENGINEERING CONTEXT More recently researchers have investigated the meaning of resilience in an engineering context. That is, they ask, what must be the characteristics of a system that will enable it to anticipate, survive, and recover from a disturbance. Systems may

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

67

be aircraft, spacecraft, or civil infrastructure systems. External disturbances may be terrorist attacks, hurricanes, or earthquakes. Internal disturbances may be undetected latent flaws, such as software errors. There have been many definitions of resilience in an engineering context. Most of these definitions are similar, but the one by Haimes [2009] is typical: “Resilience is the ability of the system to withstand a major disruption within acceptable degradation parameters and to recover within an acceptable time and composite costs and risks.” From a governmental perspective, the White House [2011], for example, defines resilience to be “the ability to adapt to changing conditions and withstand and rapidly recover from disruption due to emergencies.” In a military context according to Richards [2009] it refers to the ability of a system, usually a technological system, to recover following a hostile attack. Figure 2 is a simplified view of how a system could pass through a number of states before recovering from the disruption or degrading to a state of decommissioning. A more comprehensive view of these states will be shown in a later section.

C. MEANING OF ENGINEERED IN A RESILIENCE CONTEXT When we say that we are going to engineer resilience into an engineered system, it is important to remember that the word engineered should be interpreted in a broad sense. That is, engineered does not refer to the classical definition of engineered as a physics-based process. However, that type of engineering is included as well. Checkland [1999] uses the word in the broader sense in which you can, for example, engineer a meeting or a political agreement. Hollnagel et al. [2006] also use the word in the broader sense when they coined the term resilience engineering. In the Hollnagel et al. book, the emphasis is on organizational systems rather than on technological systems. In short, the word engineer can be interpreted in this context to mean to put something together intelligently.

D. PROACTIVE VS REACTIVE RESILIENCE Many researchers define resilience in two ways: proactive resilience and reactive resilience. Proactive resilience has to do with scenarios in which the system

Fig. 2

Resilience disruption cycle.

68

S. JACKSON

takes steps before the encounter with the threat. These steps include the anticipation of the threat, detection of the threat, and corrective measures to avoid the threat or to minimize damage. Hollnagel et al. [2006] have adopted this definition. Other researchers, for example, Haimes [2009] define resilience only to reflect events after the encounter with the threat. These researchers define the events before the encounter to be part of the protection phase. In the end, it does not matter which definition is used provided the analysis is consistent in the assumptions used and considers all the factors both before and after the encounter with the threat. This chapter employs the proactive perspective in order to include all the principles to be discussed later.

E. RESILIENCE AND SAFETY Another frequently asked question is: What the difference is between resilience and safety? The short answer is that safety is the process for avoiding the loss of people or products, while resilience is the process for anticipating a disruption and achieving the partial or total restoration of the functionality of the system in question. Furthermore, some of the systems of interest in resilience analysis are not safety critical; they can better be categorized as service critical. For example, an electrical power system is a service critical system in which providing electrical power is the service and the primary function. Leveson compares safety and resilience. She says [Leveson, 1995] that safety is the “freedom from accidents or losses.” She says [Leveson et al. 2006] that, on the other hand, resilience is “avoiding failures or losses, as well as responding appropriately after the fact.” Furthermore, traditional safety standards, such as the Department of Defense standard [DoD, 2012], only address the design of the system. They do not address the organizational contributions to safety as resilience does. Other publications, for example, Systems Engineering for Commercial Aircraft [Jackson, 1997], have addressed organizational safety.

F. RESILIENCE IN A SYSTEMS THEORY CONTEXT It is reasonable to ask whether the study of resilience is consistent with the basic principles of systems theory. Various sources including Jackson [2015] have documented these principles: the following sections summarize the principles and explain how they are relevant to the resilience of engineered systems. 1. PRINCIPLE OF FUNCTION The INCOSE Fellows [2006] state that a basic property of a system is that it needs to perform a function. This principle is consistent with the concept of resilience because the purpose of resilience is to enhance the system’s ability to retain or restore all or part of its function following a disruption.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

69

2. PRINCIPLE OF SATISFICING According to Adams [2011], the principle of satisficing recognizes the fact that complex systems are not amenable to optimization and therefore good enough is acceptable. This principle is particularly applicable to resilience because in most cases restoration to a partial but a fair enough state is acceptable. 3. PRINCIPLE OF INTERACTION This principle is closely related to the principle of cohesion. According to Adams [2011], it states that the elements of a system must interact with each other to qualify as a system. This principle is required to allow the system to interact as a whole to restore its functionality either fully or in part. 4. PRINCIPLE OF HOLISM According to Adams [2011], the principle of holism requires that all the elements of a system work together to perform any function. That is, individual components cannot perform system functions. Hence, in order for a system to restore its functionality, which is the essence of resilience, the principle of holism must be invoked. 5. PRINCIPLE OF EMERGENCE According to Adams [2011], emergence is the property of a system that cannot be attributed to any single element. The most important property that meets this definition is resilience itself. That is to say, multiple elements are required to restore functionality either fully or partially. 6. PRINCIPLE OF CONTROL According to Adams [2011], “Control is the method by which we ensure that the systems operations are regulated so that it will continue to meet the expectations of its designers and move in the direction of its goals.” Control is essential to resilience because the system itself must take some sort of action to restore functionality or to achieve a satisfactory end state. 7. PRINCIPLE OF VIABILITY According to Adams [2011], “for a system to be viable, it needs to adapt on a short-term basis, but keep its integrity in the long term.” Hence, viability is the combination of adaptation and long-term integrity both of which are essential to the resilience of the system.

70

S. JACKSON

8. PRINCIPLE OF PARSIMONY The principle of parsimony simply states that systems should be kept as simple as possible to prevent undesirable interactions from resulting in failure. Rechtin [1991], for example, underscores the importance of simplicity in architecting a system. Lack of complexity will be seen to be a key principle in resilience, which has the effect of keeping the system at a distance from its safety boundary.

G. IS RESILIENCE MEASURABLE? The measurability of resilience is a subject of debate. Haimes [2009], for example, states that resilience is not measurable. He states that there are just too many dimensions. For example, there are too many ways for the system to fail and too many ways for it to recover, and so forth. What Haimes means by this is that the degree of resilience cannot be predicted before a system is built because, for any given system, the designer does not know the types of threats and so forth that will challenge its viability. On the other hand, once a system is built, a threat is identified, and a scenario is defined, then the designer can simulate this scenario and determine just how it might recover, then you can say that the resilience is measurable. At the time of this writing, the study of measurability is still in its infancy and much more study and research need to be done.

H. IS RESILIENCE AN “ILITY”? So how does resilience fit into the context of systems engineering? In systems engineering there are three types of requirements: functional requirements, performance requirements, and constraints, often called “ilities” to refer to such fields such as reliability and safety. Resilience falls in the latter category. However, it is not a separate and distinct discipline from the other “ilities.” Its execution depends on the other ones to a great extent, for example, reliability, survivability, changeability, and robustness. Safety, for example, depends on reliability, and so forth. Ricci et al. [2014] provide a process for incorporating “ilities” into a system design. This process requires much iteration of the design.

II. DESIGN OF A RESILIENT SYSTEM This section shows how a designer creates a resilient system by using a set of abstract principles that have been identified and validated. The designer then maps a path through a series of states from a normal operating state to a final decommissioned state, a satisfactory partially functional state, or a fully functional state. The next step is to determine how these principles, when implemented into concrete solutions, can guide the system through these states. Finally, the designer can create solutions from the concrete principles using the rule of dominant

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

71

characteristics that link the abstract principles to the concrete solutions. The designer then needs to test or simulate the concrete solutions for costeffectiveness and other factors to determine which concrete solutions are appropriate. This latter step may result in many iterations of the design as pointed out by Ricci et al. [2014]. The system of interest can either be a standalone system or an SoS as described in Chapter 1 or socio-technical systems as described in Chapter 2.

A. RESILIENCE PRINCIPLES To create a resilient system, one must start with a set of abstract principles that have been identified by many authoritative sources. These abstract principles are not solutions but are simplified replicas of the solutions. These principles are universal and apply to all domains since they are abstract and can inspire designs in various domains. This section explains how resilience principles have been identified and validated by authoritative case studies and how principles are dependent on one another. The word principle is used in this chapter in a broad sense, that is, to include the rules commonly known as heuristics, or principles learned from experience as defined by Rechtin [1991]. As a matter of fact, most of the principles discussed in this chapter are indeed heuristics.

B. ABSTRACT PRINCIPLES VS CONCRETE SOLUTIONS First, it is necessary to determine how concrete solutions and abstract principles are linked. Locke [1999] and other authoritative sources have defined the relationship between abstractions and concrete entities. According to Lonergan [1992], an abstraction is a simplified replica of a concrete entity. The characteristics that an abstraction and a concrete entity have in common are known as the dominant characteristics. For example, physical redundancy is an abstract principle. The dominant characteristic of physical redundancy is that the system, or other entity, must have two or more independent branches. The dominant characteristic of two independent branches will also apply to concrete solutions. In one domain, typical concrete solutions may include redundant communications systems. In another domain, it may include redundant structures in a building or redundant branches of any system. Figure 3 shows the relationship between abstract principles and concrete solutions. In this example, both the abstract principle and the concrete solution represent a redundant communications system. Hence, the relationship between abstract and concrete entities is both vertical and horizontal. The vertical dimension links the characteristics of the abstract and the concrete entities. The horizontal dimension is the applicability to all domains and system types. Figure 4 illustrates the two-dimensional perspective of abstract principles.

72

S. JACKSON

Fig. 3

Relationship between abstract principles and concrete solutions.

C. PRIMARY PRINCIPLES AND SUPPORT PRINCIPLES Abstract principles to enhance resilience have appeared in the works of many experts. Jackson [2010], referring to these principles as heuristics, published a list of them. Later Jackson and Ferris [2013] elaborated on this list. Finally, Jackson [2015] added more principles and arrived at the list that appears in Table 1. All of these principles are statements of the characteristics of systems that will enable a system to anticipate, avoid, withstand, and recover from a disruption caused by both human-made and natural threats. As stated earlier, these principles are simplified replicas of the concrete solutions that will enhance the resilience of the system. Table 1 describes the dominant characteristic of each principle and the support principles. Support principles are principles that accomplish the same goals as the primary principles but are narrower in scope. In some cases, Jackson and Ferris inferred these principles from the general analysis and observations of the sources consulted.

D. DEPENDENCY OF PRINCIPLES Jackson and Ferris [2013] found that it was not sufficient to apply these principles singly but rather the designer must apply them in combination with other principles. The other principles are called dependent principles. The reason for this dependency is that virtually all the principles exhibit some degree of vulnerability. The vulnerability of a principle is not the same as the vulnerability of a system. The vulnerability of a system is a weakness in the system that makes it vulnerable to failure. The vulnerability of a principle, on the other hand, is a weakness in the

Fig. 4

Two-dimensional applicability of abstract principles.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

TABLE 1

73

PRIMARY PRINCIPLES AND SUPPORT PRINCIPLES AND DOMINANT CHARACTERISTICS

Primary Principles Absorption. The system shall be capable of withstanding the design level disruption. Sources: Hollnagel et al. [2006], ASIS Standard [Rijpma, 1997] ASIS has management focus

Support Principle Margin. The design level shall be increased to allow for an increase in the disruption. Source: Hollnagel et al. [2006] Hardening. The system shall be resistant to deformation. Source: Richards [2009] Context spanning. The system shall be designed for both the maximum disruption level and the most likely disruption. Source: Madni and Jackson [2009] Limit degradation. The absorption capability shall not be allowed to degrade due to aging or poor maintenance. Source: Derived

Restructuring. The system shall be capable of restructuring itself. Sources: Hollnagel et al. [2006], Madni and Jackson [2009], ASIS Standard [Rijpma, 1997]

Authority escalation. Authority to manage crises shall escalate in accordance with the severity of the crisis. Source: Maxwell and Emerson [2009] Regroup: The system shall restructure itself after an encounter with a threat. Source: Raveh [2008]

Reparability. The system shall be capable of repairing itself. Source: Richards [2009] Drift correction. When approaching the boundary of resilience, the system can avoid or perform corrective action; action can be taken against either real-time or latent threats. Sources: Hollnagel et al. [2006], ASIS Standard [Rijpma, 1997]

Detection. The system shall be capable of detecting an approaching threat. Source: derived from Jackson and Ferris [2013] Corrective action. The system shall be capable of performing a corrective action following a detection. Source: Derived from Jackson and Ferris [2013] Independent review. The system shall be capable of detecting faults that may result in a disruption at a later time. Source: Haddon-Cave [2009]

(Continued )

74

S. JACKSON

TABLE 1

PRIMARY PRINCIPLES AND SUPPORT PRINCIPLES AND DOMINANT CHARACTERISTICS (Continued)

Primary Principles Cross-scale interaction—Every node of a system should be capable of communicating, cooperating, and collaborating with every other node. Sources: Hollnagel et al. [2006], ASIS Standard [Rijpma, 1997]

Support Principle Knowledge between nodes. All nodes of the system should be capable of knowing what all the other nodes are doing. Source: Billings [1997] Human monitoring. Source: Automated systems should understand the intent of the human operator. Source: Billings [1997] Automated system monitoring. The human should understand the intent of the automated system. Source: Billings [1997] Intent awareness. All the nodes of a system should understand the intent of the other nodes. Source: Billings [1997]; Madni uses the terms “predictability” and “inspectability.” Informed operator. Source: The human should be informed as to all aspects of an automated system. Source: Billings [1997] Cross-scale impediment. There should be no administrative or technical obstacle to the interactions among elements of a system. Source: Derived from case studies.

Functional redundancy. There should be two or more independent and physically different ways to perform a critical task. Sources: Leveson [1995], Madni and Jackson [2009]]; Leveson uses the term “design diversity” Physical redundancy. The system should possess two or more independent and identical legs to perform critical tasks. Sources: Leveson [1995], Madni and Jackson [2009]; Leveson Uses the term “design redundancy”

(Continued )

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

TABLE 1

75

PRIMARY PRINCIPLES AND SUPPORT PRINCIPLES AND DOMINANT CHARACTERISTICS (Continued)

Primary Principles

Support Principle

Defense in depth. The system should be capable of having two or more ways to address a single vulnerability. Source: Derived from Reason [1997] Neutral state. Human agents should delay in taking action to make a more reasoned judgment as to what the best action might be. Source: Madni and Jackson [2009] Human in the loop. There should always be human in the system when there is a need for human cognition. Source: Madni and Jackson [2009]

Automated function. Source: It is preferable for humans to perform a function rather than automated systems when conditions are acceptable. Source: Billings [1997] Reduce human error. Standard strategies should be used to reduce human error. Sources: derived from Billings [1997] and Reason [1997] Human in control. Humans should have final decision making authority unless conditions preclude it. Sources: Billings [1997], Madni and Jackson [2009]; Madni uses the term “human backup”.

Reduce complexity. The system should not be more complex than necessary. Sources: Madni and Jackson [2009], derived from Perrow [1999]; Madni uses the term “complexity avoidance” emphasizes reduction in design errors

Reduce variability. The relationship between the elements of the system should be as stable as possible. Sources: Marczyk [2012], Rechtin [1991]; Rechtin recommends “stable substructures.”

Reduce hidden interactions. Potentially harmful interactions between elements of the system should be reduced. Sources: derived from Leveson [1995] and Perrow [1999]

(Continued )

76

S. JACKSON

TABLE 1

PRIMARY PRINCIPLES AND SUPPORT PRINCIPLES AND DOMINANT CHARACTERISTICS (Continued)

Primary Principles

Support Principle

Modularity. Source: Madni and Jackson [2009], Perrow [1999]; Madni uses the term “graceful degradation” Loose coupling. The system should have the capability of limiting cascading failures by intentional delays at the nodes. Source: Perrow [1999]

Containment. The system will ensure that failures cannot propagate from node to node. Source: Derived in Jackson and Ferris [2013]

principle that results from the fact that most principles are heuristics, which, according to Rechtin [1991], are the rules based on experience and are therefore not scientific principles. Therefore, there is a lack of certainty that the principle, when implemented, will achieve its goal. When one principle exhibits this vulnerability, another principle is required to compensate for the vulnerability of the first principle. In addition to heuristics, scientifically accepted principles, such as physical redundancy, may also exhibit vulnerabilities. Ripjma [1997] shows that a system with two identical branches may be vulnerable if there is an undetected latent fault in both branches. It follows then that if Principle A has a dependent Principle B, then Principle B also has a dependent principle. The net result is a finite chain of dependent principles. Table 2 shows the primary principles and their dependent principles. Only the first-tier dependent principles are shown.

E. EVALUATING RESILIENCE PRINCIPLES FROM AUTHORITATIVE CASE STUDIES The method of evaluating the principles and support principles was to examine case studies in which domain experts in authoritative reports posed solutions that were judged to make the systems more resilient. These case studies generally covered disasters in which the systems, generally infrastructure systems, were deficient in some way. However, the posed solutions were generally concrete solutions; hence, the challenge was to interpret these solutions in terms of the abstract principles listed in Table 1. So how can an analyst pose concrete solutions that can be derived from abstract principles? The answer lies in the concept of dominant characteristics discussed earlier. As Fig. 3 shows, the concrete solutions and the abstract principles have the same dominant characteristics. This relationship is a two-way

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

TABLE 2

77

CHAIN OF DEPENDENT PRINCIPLES

Primary Principles

Dependent Principles (first tier)

Absorption

Limit degradation, latent drift correction, margin, context spanning

Physical redundancy

Functional redundancy, human in the loop

Human in the loop

Reduce human error

Reduce complexity

Restructuring, reduce variability

Restructuring

Cross-scale interaction

Reparability

Cross-scale interaction

Modularity

Absorption

Loose coupling

Modularity, cross-scale interaction, physical redundancy

Drift correction

Cross-scale interaction, corrective action

Neutral state

Cross-scale interaction

Cross-scale interaction

Absorption, physical redundancy, functional redundancy, cross-scale impediment, human in the loop

Reduce hidden interactions

Reduce complexity, cross-scale interaction

relationship. The designer can infer concrete solutions from abstract principles and the analyst can infer abstract principles from concrete solutions. In the authenticated case studies, as documented by Jackson [2015], there were six ways that the domain experts posed concrete solutions: .

Recommendations. In many reports, the domain experts recommended specific solutions to the systems to make them more resilient. These solutions were the most direct sources of posed concrete solutions.

.

Implemented solutions. Many reports documented solutions that the authorities implemented immediately after the event described.

.

Planned solutions. In many cases, the authorities documented solutions that they planned for future implementation.

.

Observations. In many cases, the domain experts made observations about the deficiencies of the existing system at the time of the event. These observations usually implied a posed solution.

.

Improvised solutions. These are the solutions that first responders implemented in the heat of the disaster. Some of these solutions were successful and others were not. When the solution was successful, then the report implied a posed solution.

78

.

S. JACKSON

Failures. In many cases, the reports documented failures that implied a posed solution because of their absence in the existing system. Petroski [2006], for example, documents what designers can learn from failures.

This analysis compared these abstract principles derived from the case studies to the ones identified in Table 1 and found a correspondence. This correspondence constituted the validation of the principles. Table 3 presents a list of example posed solutions and implied principles from each one of these case studies. The following list shows case studies for which the domain experts provided posed concrete solutions. The purpose of this list is to provide a representative set of cases that imply abstract principles. It is important, once again, to point out that the case studies did not directly imply abstract principles. The abstract principles resulted from the derivation using the rule of dominant characteristics described before. The case studies covered four domains: fire prevention, aviation, rail, and power distribution. In addition, there is much overlap among these case studies. For example, although the selected domain for the Inangahua earthquake was rail, due to an overturned train, the case study also had much to say about how emergency management systems are organized. Following are the case studies, their sources, and the primary domain of interest: .

1906 San Francisco earthquake and fires W Primary domain: fire protection W Authoritative source: Hansen and Condon [1989]

.

1968 Inangahua, New Zealand earthquake W Primary domain: rail W Authoritative source: New Zealand Ministry of Civil Defense [1968]

.

1911 Triangle Fire in New York W Primary domain: fire protection W Authoritative source: New York Factory Investigating Commission (NYFIC) [1912]

.

2001 World Trade Centre attacks W Primary domain: fire protection W Authoritative source: 11 September Commission [2004]

.

2003 US – Canada blackout W Primary domain: power distribution W Authoritative source: US-Canada Power System Outage Task Force [2004]

.

2006 Nimrod accident W Primary domain: aviation W Authoritative source: Haddon-Cave [2009]

.

2008 Metrolink accident W Primary domain: rail W Authoritative source: National Transportation Safety Board (NTSB) [2010]

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

TABLE 3 Case

79

POSED SOLUTIONS AND IMPLIED PRINCIPLES

Example Solution

Solution Type Recommendation

Implied Abstract Principle

World Trade Center

Single Frequency Communications

Cross-scale interaction

World Trade Center

Improvised deployment Improvised of generators solution

Restructuring

Metrolink

Positive train control

Recommendation

Drift correction

San Francisco

Communications by horseback

Improvised solution

Functional redundancy

San Francisco

Triple redundant water system

Implemented solution

Physical redundancy

Eyjafjallajo¨kull Icelandic Volcano

Volcanic ash detection

Planned solution

Drift correction

United 93

Direct communication with pilots

Observation

Cross-scale impediment (support principle)

Triangle Fire

Fireproof buildings

Recommendation

Limit degradation (support principle)

Nimrod

Improved airworthiness Recommendation

Independent review (support principle)

London Bombings

Improved crisis management

Implemented solution

Restructuring

Inangahua

Improved crisis management

Recommendation

Restructuring

US –Canada Blackout

Intentional delays in power control at nodes

Recommendation

Loose coupling

US –Canada Blackout

Combination of redundant functions and intentional delays

Recommendation

Defense in depth

80

S. JACKSON

.

2005 London Bombings W Primary domain: rail W Authoritative source: 7 July Review Committee Report [2006]

.

2001 United 93 attack W Primary domain: aviation W Authoritative source: 11 September Commission [2004]

.

2010 Eyjafjallajo¨kull Icelandic volcano eruption W Primary domain: aviation W Authoritative source: International Volcanic Ash Task Force (IVATF) [2010]

The domain experts in these case studies posed solutions that are found in Table 3.

F. EXAMPLE PRINCIPLES ILLUSTRATED IN CASE STUDIES The case studies in Table 3 illustrated various principles for which the authors of the authoritative reports posed solutions. Following are some example cases and addressed principles: 1. WORLD TRADE CENTER Following the 2001 World Trade Center attacks, the various agencies found that they could not communicate with each other because they were all operating on different radio frequencies. The 9/11 Commission [2004] strongly recommended that standards should be established to enable the agencies to operate on a single frequency. This was an example of the cross-scale interaction principle. 2. METROLINK One of the major causes of the 2008 Metrolink accident near Los Angeles was the fact that the Metrolink train could not detect another train coming. One of the recommendations of the National Transportation Safety Board [2010] was the installation of positive train control. This is an example of the drift correction principle. 3. SAN FRANCISCO EARTHQUAKE AND FIRES According to Hansen and Condon [1989], the US Army could not communicate within itself or with other agencies because the telephone lines had been damaged during the 1906 earthquake and fires. They then implemented the improvised solution of sending messages by horseback. This was an example of the functional redundancy principle. Also, in the same disaster, the earthquake rendered the city without any water whatsoever for extinguishing fires. Since that time, according to the San Francisco

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

81

Fire Department [2011], the city now has a triple redundant water system. This is an example of the physical redundancy principle. 4. EYJAFJALLAJO¨KULL ICELANDIC VOLCANO Following the eruption of the Eyjafjallajo¨kull Icelandic volcano in 2010, there was an extensive interruption of airline traffic due to volcanic ash in the air. The BBC [2010] reported the existence of a project between the airline EasyJet and the aircraft company Airbus to develop a device called AVOID (Airborne Volcanic Object Image Detector) to detect volcanic ash and to allow pilots to avoid an encounter with a volcanic ash cloud. This is an example of the drift correction principle. 5. UNITED 93 According to the 9/11 Commission [2004], also part of the World Trade Center event in 2001, a group of terrorists hijacked the United Airlines Flight 93. Federal Aviation Administration (FAA) protocol prevented the FAA from contacting the pilot directly. The report strongly criticized this deficiency. This failure to communicate was an example of the cross-scale impediment support principle. 6. TRIANGLE FIRE In the year 1911, a fire broke out in a garment factory in New York killing many workers, mostly female. The New York Factory Investigating Commission in their report [1912] noted that the building had been categorized as “fireproof.” However, there were many aspects of construction that made the building vulnerable to fires. They addressed these deficiencies in the report and recommended improvement. This is an example of the limit degradation support principle of the absorption principle. 7. NIMROD In 2006, the Royal Air Force Nimrod aircraft caught fire and crashed over Afghanistan killing the entire crew. The Haddon-Cave report [2009] found major deficiencies in the oversight of the design and manufacture of the aircraft. The major recommendation was an overhaul of the airworthiness process. This is an example of the independent review support principle of the drift correction principle. 8. LONDON BOMBINGS In the London Bombings of 2005, the 7 July Review Committee [2006] noted the many difficulties the various agencies of the city had in cooperating with each

82

S. JACKSON

other. The net result was a recommendation for a complete restructuring of the crisis management system. This is an example of the restructuring principle. Also, as part of the restructuring, the committee noted the difficulties in communication and recommended a new system. This is another example of the crossscale interaction principle. 9. INANGAHUA Similarly, following the 1968 Inangahua earthquake in New Zealand, the Ministry of Civil Defense found many flaws in the emergency response system and recommended changes. This is an example of the restructuring principle. 10.

U.S. –CANADA BLACKOUT

The U.S.-Canada Power System Outage Task Force [2004] recommended following the blackout of 2003 that higher standards should be applied to the human operators at the nodes of the power system. These standards contained recommendations for the humans to execute intentional delays when performing their jobs. These delays constitute the loose coupling principle. In addition, the Task Force recommended redundancy of “all critical functions.” This is an example of the physical redundancy principle. The combination of the loose coupling principle and the physical redundancy principle becomes the defense in depth principle.

G. GAP ANALYSIS ACROSS DOMAINS An important question to be answered is whether the appropriate principles are actually implemented in the domains when they are needed. If they are not, this lack of implementation is called a gap. This section presents a gap analysis for the posed solutions and implied principles in the domains examined. As stated earlier, the principles listed in Table 1 are domain independent. However, the importance of these principles and the implantation of these principles do vary from domain to domain. The domains examined include fire protection, rail, aviation, and power distribution. Figure 5 presents a gap analysis for the domains of interest. When an authoritative source, such as those listed above, pose a solution that is lacking, this lack of a solution is called a gap. This analysis is not statistically significant since there are only 10 case studies represented in it. They are only representative results. The histogram of Fig. 5 shows that the two dominant principles lacking in the selected case studies are the absorption and cross-scale interaction principles. Other significant principles lacking include redundancy (both physical and functional), defense in depth, neutral state, loose coupling, and drift correction. The lesson to be learned from these results

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

83

Fig. 5 Histogram of gaps in the implementation of principles across selected domains. (From Jackson [2015].) is that before other principles are invoked the system needs to be able to withstand a threat (absorption) and to remain together as a system (cross-scale interaction) and that in general systems that have failed have not met these two criteria.

H. IMPORTANCE OF PRINCIPLES ACROSS DOMAINS This section answers the question of how the gaps vary from domain to domain. Of course, gaps and importance are not the same thing. If there are gaps, then those principles are important. But there are other principles that should be present in each domain. One can call the remainder of these principles nongap principles because, in general, the owners of these systems usually implement them. The one domain that is most compliant with principles is the aviation domain since that domain is so heavily regulated. Table 4 presents an evaluation of the importance of the individual principles in each of the four domains of interest. For this table the following code applies: .

P—This is a primary principle for this domain. It is important to the functionality of the systems of interest.

.

S—This is a secondary principle for this domain. It is of importance on an occasional basis.

.

N—This principle is not important or of negligible importance for this domain. The judgments for this table were based on the following criteria:

.

The principles were found to be lacking when domain experts judged them to be necessary (see the gap analysis before).

84

S. JACKSON

TABLE 4

DOMAIN IMPORTANCE MATRIX

Principles

Domains Rail

Fire Protection

Aviation

Power Distribution

Absorption

P

P

P

P

Physical redundancy

P

P

P

P

Functional redundancy

P

P

P

P

Defense in depth

P

S

P

P

Reduce complexity

S

P

P

P

Reduce hidden interactions

S

P

P

P

Human in the loop

P

P

P

P

Restructuring

N

P

S

N

Reparability

P

P

P

P

Modularity

S

P

S

P

Neutral state

N

P

S

N

Loose coupling

N

N

S

P

Drift correction

P

P

P

S

Cross-scale interaction

P

P

P

P

Source: Jackson [2015].

.

The principles were found to be consistent with the types of systems and the functionality required in each domain.

.

The principles were seen to be necessary in the context of the case studies.

To illustrate the importance of these principles, the following examples should provide some insight: .

The human in the loop principle is primary in all domains. Examples include the role of the pilot in the aviation domain and operators at the nodes of power grids in the power distribution domain.

.

The modularity principle is primary in the power distribution domain because of the potential for modular power sources.

.

The restructuring principle is primary in the fire protection domain because of the need to adapt the organization to the crisis at hand.

.

The Metrolink case study illustrated the need for the drift correction principle in the rail domain.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

85

.

The cross-scale interaction principle is important in all domains, but it is particularly important in the fire protection domain because of the need for reliable communications among first responder units.

.

The loose coupling principle is important in the power distribution domain because of the importance of preventing cascading failures.

I. STATE-TRANSITION ANALYSIS The next step is to determine the states through which the system may pass in its journey from a normal operating state (either State A or State B) to a final agreed diminished state (State F) or decommissioning (State G) or the system may return to the operational state (State A). Jackson et al. [2015] determined that there were seven such states that are shown in Fig. 6. In addition, Fig. 6 shows 28 ways in which the system can transition from state to state. Two of these states (States A and B) are normal operating states. Two of the states (States F and G) are terminal states. Transitioning from an operational state to a terminal state may require passing through intermediate states or directly from the operational state to a terminal state or return to its original state. The designer may wish to consider all the paths from a normal operational state to a terminal state. Of course, State A may also be a final state if the system returns to its original state. This analysis may seem onerous, but it is absolutely necessary for determining the design of a resilient system. Furthermore, identifying the possible abstract principles, as the previous section showed, is only part of the process. The task of identifying one or more concrete solutions is an additional, and necessary, step, to be described in the next section. This figure shows that there are 24 ways that a system can transition from a normal operating state while passing through intermediate states to the final state. Each state represents a different level of functionality of the system. For example, the system can pass from State A to State D using Transition 8 and then from State D to State F using Transition 14. Each transition will require a different concrete solution derived from different principles. The analyst will need to decide how to reconcile these two solutions. They can either be the same solution or a combination of two solutions. In addition to these 24 ways, there are four ways (Transitions 25 through 28) the system can transition directly from a normal operating condition (State A or B) to a final condition (State G or F). Only one concrete solution is required to accomplish each of these four transitions. In general, these solutions will address the graceful degradation characteristic. The next section will explain how the analyst can identify solutions to accomplish each transition. This process involves identifying the abstract principles to accomplish each transition.

Fig. 6

Resilience states and transitions. (From Jackson et al. [2015].)

86 S. JACKSON

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

J.

87

SELECTING ABSTRACT PRINCIPLES TO ACCOMPLISH EACH TRANSITION

Now that the designer knows from Fig. 6 which transitions will enable the system to pass from state to state, he or she needs to know which abstract principles, when converted to concrete solutions, will be appropriate for these transitions and which ones will be suitable for the system in question. Not all principles will accomplish all transitions. To know which principles will accomplish the transition, the designer needs to know exactly what the principles will accomplish when they are converted to concrete solutions. These accomplishments are embodied in the goals or characteristics of the principles. In addition to these accomplishments, the selection of principles will depend on the characteristics of the system being addressed. In many cases, there may be more than one abstract principle that will accomplish a given transition. The designer needs to evaluate all the abstract principles that may apply to a given transition. In some cases, the choice may be obvious since the choice may depend on the specific type of system being designed. Hence, the initial screening of the principles will depend on two factors: first, the objective implied in the definition of the principle and second, the type of system being addressed. The goals are reflected in the characteristics of the principles. There are five characteristics of importance: robustness, buffering, adaptability, tolerance, and interaction. Table 5 defines each of the five characteristics. In addition, it is necessary to understand the reasoning behind each principle. This table will be helpful in determining which principles apply to specific transitions. Table 5 also groups the principles according to these characteristics. This table also defines the goal that each characteristic intends to achieve. The reader will notice that the cross-scale interaction principle supports all five characteristics because the system must remain a whole system during the course of the encounter with the threat. In most cases the selection of a principle will be based on the need to improve the functionality of a system or to restore its functionality. In some cases, a principle will be selected in order to control the decline in functionality to a state of agreed to diminished functionality (State F) or to a decommissioned state (State G). US Airways Flight 1549 described by Parie`s [2011] is an example of a controlled decline in functionality. To answer which principles when implemented into a concrete solution support a transition the answer to one or more of the following questions should be yes: .

Will the concrete solution enable the system to withstand a threat?

.

Will the concrete solution assist the system in achieving a gradual degradation in functionality?

.

Will the concrete solution ensure that the system remains partially functional? For the purpose of this analysis, partial functionality includes all levels of functionality including full functionality.

88

S. JACKSON

TABLE 5

PRINCIPLES AND ASSOCIATED CHARACTERISTICS AND GOALS

Primary Principles

Characteristics

Absorption

Robustness

Ability to withstand a threat within the design limit across a broad range of conditions and retain functionality

Adaptability

Ability to restructure in the face of a threat

Tolerance

Ability to operate at a distance from the safety boundary, degrade gracefully, and retain some functionality

Interaction

Ability of the system to maintain relationships between elements and act as a whole before, during and after an encounter with a threat

Physical redundancy Functional redundancy

Goals

Cross-scale interaction Restructuring Reparability Drift correction Human in the loop Cross-scale interaction Reduce complexity Loose coupling Neutral state Reduce hidden interactions Defense in depth Modularity Functional redundancy Cross-scale interaction Cross-scale interaction

.

Will the concrete solution ensure that the system remains at a distance from its safety boundary, that is, its point of failure?

.

Will the concrete solution return the system to full functionality following and encounter with a threat?

.

Will the concrete solution allow the system to degrade to a point of agreed to a diminished state?

.

Will the concrete solution allow the system to detect the approach of a threat? For the purpose of this analysis, a fully functional system may have a latent flaw.

.

Will the concrete solution correct for the possible encounter with a threat? For the purpose of this analysis, correction can include either a complete avoidance

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

89

of the threat or other corrections to minimize the reduction in functionality caused by the threat. .

Will the concrete solution allow the system to degrade to a nonfunctional state?

.

Will the concrete solution allow the system to degrade to a decommissioned state? For the purpose of this analysis, systems in both nonfunctional and decommissioned states remain complete systems.

The following sections will discuss the logic of each principle in terms of the questions asked just now and in terms of the characteristics in Table 5. The following sections summarize this reasoning for each principle: 1. ABSORPTION PRINCIPLE This principle ensures that the system will withstand all threats within the design limit. It is therefore a basic principle associated with the robustness characteristic. The margin support principle ensures that the system will withstand threats to a specified degree beyond the design limit. The context spanning support principle ensures that the system will withstand both the highest level threat and the most frequently encountered threat. The limit degradation support principle ensures that the system will not be vulnerable due to aging or lack of maintenance. Table 6 lists the transitions most applicable to this principle. 2. PHYSICAL REDUNDANCY PRINCIPLE The purpose of this principle is to ensure that one branch of a system with two or more identical branches survives and remains functional when one branch is damaged by a threat. With this principle the system will maintain its original functionality even if one branch is damaged. This principle also supports the robustness characteristic since it allows the system to retain its functionality. Table 6 describes the transitions applicable. 3. FUNCTIONAL REDUNDANCY PRINCIPLE Similar to physical redundancy, this principle ensures that one branch of a twobranch system remains functional when one branch is damaged. However, this principle does not ensure that the system will maintain its functionality when one branch is damaged because one branch is unlikely to be as functional as the other branch. So in the end, the system retains partial functionality. For the purpose of this analysis, the term partial functionality includes all levels of functionality including full functionality. This principle satisfies the criteria for supporting the tolerance characteristic since it limits the decline in functionality of the system. One rationale for implementing this principle is that it is less vulnerable to latent faults as the physical redundancy principle is. This principle has been

90

S. JACKSON

TABLE 6

RULES FOR MAPPING TRANSITIONS AND PRINCIPLES

Rule

Applicable Transitions

1) Any transition resulting in a partially functional state ð modularity, physical redundancy, functional redundancy.

7, 8

2) Any transition to a lower (less functional or nonfunctional) state ð all Tolerance principles.

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

3) Any transition to the same state ð all Robustness principles.

1, 3, 10, 12, 18

4) Any transition resulting in an increase or restoration in functionality ð all Adaptability principles

2, 5, 20

5) Any transition resulting in heightened awareness ð drift correction, human in the loop

4

6) The human in the loop principle can execute the following other principles: drift correction, neutral state, loose coupling, functional redundancy.

All relevant transitions for the named principles.

7) The cross-scale interaction and defense in depth principles apply to all transitions

All transitions

found to be most useful for creating improvised systems in crisis environments as Sect. G explains for the San Francisco earthquake and fire case study. This is the primary principle for assuring that the system can transition from a normal operating state, State A, to a partially functional state, State D, through Transition 8. Table 6 describes the applicable transitions associated with this principle. 4. DEFENSE IN DEPTH PRINCIPLE The concept of defense in depth is that two or more principles will be required to address a given system vulnerability. This principle is independent of the transitions it is passing. Therefore, it will apply to all transitions as shown in Table 6. For the purpose of this analysis, it is assumed that any two principles will suffice for implementing the defense in depth principle. When the designer considers the actual physical characteristics of a system, the number of applicable principles will be more limited. Conceptually, the defense in depth principle could support any of the characteristics in Table 5 since it requires two other principles to achieve its goals.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

91

However, it is shown in Table 5 as supporting the tolerance characteristic because the second principle to complete it adds an extra degree of resilience beyond the first principle. 5. HUMAN IN THE LOOP PRINCIPLE The purpose of this principle is to ensure that humans are incorporated as elements of the system when necessary. The premise is that humans are more capable of assessing the situation and taking appropriate action than automated systems. In addition, humans qualify as a redundant branch of the system as part of the functional redundancy principle. Because of its flexibility, it can assist in a wide variety of transitions. For example, it can perform the drift correction function when assisting the system in the transfer from normal operations in State A to the heightened awareness State B through Transition 4. In addition, it can transfer the system from normal operations to State D the partially functional state through Transition 8. Also, from State A, this principle can assist the system to the agreed to diminished State F or the decommissioned State G. From State C, the nonfunctional state it can assist in transitioning the system to State D the partially functional state through Transition 18. Also, from State C it can enable the system to degrade to the decommissioning State G or to an agreed diminished State F. From the partially functional State D, it can help the system stay in the same State D through Transition 18, or transition to the agreed diminished State F through Transition 17, or the decommissioned State G through Transition 15. From the damaged but functional State E it can assist the system to partial functionality through Transition 22 or to decommissioning through Transition 24. Finally, from the heightened awareness State B, it can assist the system either to the agreed diminished State F or the decommissioned State G. Table 6 describes the transitions for this principle. As Table 6 shows this principle can restore the system to full functionality. This principle supports the tolerance characteristic as shown in Table 5 because it has the role of limiting the decline in the functionality of the system. In practice, it performs the same role as the functional redundancy principle that supports the same characteristic. 6. REDUCE COMPLEXITY PRINCIPLE The premise of this principle is that systems with many components, many interfaces, and nonlinear interactions between the components are more likely to fail than simpler systems. Systems with humans and software generally fall into this category. Hence, reducing this complexity generally reduces the likelihood of failure. This principle is protective in nature; that is, its main purpose is to protect a system against disruption when it is approaching its safety boundary. In this way, it is complementary to the absorption principle. Table 6 describes the transitions that apply to it.

92

S. JACKSON

This principle supports the tolerance characteristic since it has the function of increasing the distance of the system from its safety boundary. This principle is included because it is the conclusion of most experts that complexity will reduce the effectiveness and functionality of the systems and domains discussed in this chapter. For example, regarding the 2003 U.S. – Canada blackout, the U.S.-Canada Power System Outage Task Force [2004] states that “It is not practical to expect operators will always be able to analyze a massive, complex system failure and to take the appropriate corrective actions in a matter of a few minutes.” The implication is that if the system had been less complex, the operators would have been able to control the failure. Other researchers, on the other hand, argue that in some instances complexity can be beneficial. One such researcher is Taleb [2014]. However, Taleb’s examples of beneficial complexity are almost completely restricted to natural systems and to natural processes. One such natural process discussed by Taleb is evolution. He says that “evolution proceeds by undirected, convex bricolage or tinkering, inherently robust. . . .” Thus, we have to leave the door open to the possibility that if human-made systems can be created in the same way, then complexity can be beneficial. The developer should be aware of this possibility. 7. LOOSE COUPLING PRINCIPLE This principle is another example of seeking ways to keep the system as far as possible from its safety boundary following an encounter with a threat. That is, its purpose is to allow the system to degrade as gradually as possible until it reaches a lower level of functionality or loses functionality altogether. In the literature, for example, the U.S.-Canada Power System Outage Task Force [2004], the actions to execute this principle are normally performed by human operators at the nodes of a system. These actions consist of intentional delays in performing the control functions. These actions could conceivably be performed by automated systems, but that is not normally the case. In the instance of power grids, these actions consist of load sharing and off-loading. These actions are normally required after a major disruption in power has occurred; therefore, the goal is to prevent further disruption in the form of cascading failures. This principle can also be used to prevent partially functional and damaged but functional states to decline even further. Like the reduce complexity principle, this principle supports the tolerance characteristic since it has the function of increasing the distance of the system from its safety boundary. Table 6 summarizes the transitions associated with this principle. 8. NEUTRAL STATE PRINCIPLE This principle, similar in concept to the loose coupling principle, has the goal of preventing a further decline in functionality following an encounter with a

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

93

threat. It performs this function by allowing the human operator to delay actions until the operator has assessed the situation. For this reason, the states and transitions will be exactly the same as for the loose coupling principle as shown in Table 6. This principle also supports the tolerance characteristic since it has the function of increasing the distance of the system from its safety boundary. 9. REDUCE HIDDEN INTERACTIONS PRINCIPLE This principle is also similar in concept to the reduce complexity principle. It has to do with the harmful interactions that may occur among the elements of a system. These interactions occur most often when the system is complex. Hence, reducing complexity will, at the same time, reduce the likelihood of interactions. However, hidden interactions may be caused by factors beyond complexity. Leveson [1995], for example, states that in some cases, interactions may be caused by a lack of rigorous project management. The states and transitions associated with this principle are the same as that for the reduce complexity principle as shown in Table 6. This is another principle that supports the tolerance characteristic since it has the function of increasing the distance of the system from its safety boundary. 10.

RESTRUCTURING PRINCIPLE

This principle allows the system to be restructured to maintain or restore functionality or it can be used to arrest the decline in functionality. This restructuring can occur before, during, or following the threat event. The restructuring of the crisis management system in both the Inangahua and London Bombings is an example of restructuring before a crisis. The restructuring of the power system in New York following the World Trade Center attacks is an example of restructuring after the threat event. In neither case does the restructuring return the system to a normal operating state but only to a partially functional state, State D, or to a nonfunctional state, State C. It may not seem logical to transition to a nonfunctional state, but the rationale is that this principle could accomplish that in a more controlled way. It could also return it to a damaged but functional state, State E. In short, this principle would be involved in achieving all states except the normal operating state, State A, or the heightened awareness state, State B. In addition, this principle could be involved in returning the system to any state that may have existed before the threat event. Returning the system to a normal operating state, State A, from the heightened awareness state, State B, shows that this principle can be used as part of the corrective action required in the drift correction principle. This principle supports the adaptability characteristic because it presents a variety of ways the system could be reconfigured to achieve resilience. Table 6 presents a summary of these transitions associated with this principle.

94

11.

S. JACKSON

REPAIRABILITY PRINCIPLE

This principle allows the system to recover all or part of its functionality by means of a repair capability. This capability can either be performed autonomously or with the aid of a human. This principle supports the adaptability characteristic because it allows the system to adapt to changing conditions by means of repair. Table 6 shows the transitions to which this principle applies. 12.

DRIFT CORRECTION PRINCIPLE

This is the basic principle that allows the system to detect a threat when it is approaching its safety boundary and can correct for it. Detection can be either by an automated device or a human. Correction can allow the system to completely avoid the threat or encounter it. Transitions 2 are 4 are the basic transitions that allow this principle to operate. Transition 4 is the transition that detects the threat while it is in the normal operating state, State A, whereas Transition 2 returns it from the heightened awareness state, State B, back to State A. However, the corrective action may take the system into several other directions. For example, Transition 11 may take it to a nonfunctional disrupted state, State C, or to a partially functional disrupted state, State D. State 27 will take the system to an agreed diminished state, State F. State 28 will take it to a decommissioned state, State G. Table 6 describes the applicable transitions states for this principle. This principle supports the adaptability characteristic because it allows the system to correct for impending threats. Table 6 shows the transitions associated with this principle. 13.

MODULARITY PRINCIPLE

This principle is one of the most widely used and understood principles. The basic premise is that if a system consists of many nodes, such as a power distribution system, and one node is damaged, the remaining nodes will continue to function. If several nodes are damaged and cannot function, the assumption is that at least one node will remain undamaged and can continue to function. It is not assumed that the undamaged nodes will continue to provide the complete functionality that the entire system provided, but it will provide part of that functionality. The state of interest following the encounter with the threat is the partially functional disrupted state, State D. To arrive at this state, the system can pass through Transitions 7, 8, 10, 11, and 18. Among these, Transitions 10 and 18 involve starting at State D and returning to it. Regardless of what principle brought the system back to State D, the system will end in State D that reflects the modularity principle.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

95

Transition 7 begins with a nonfunctional disrupted state, State C, and proceeds to the partially disrupted state, State D. In this transition, it is assumed that State D represents the entire system in a nonfunctional state, that is, all the nodes are not functional. The system is assisted through Transition 7 by means of some other principle, for example, the repairability principle but arrives at State D where the modularity principle dominates. Transitions 8 and 11 are more straightforward. They just assume that the threat damages one or more nodes of the system and leaves the rest fully functional in State D. This principle supports the tolerance characteristic because it allows the system to decline to a partially functional state without losing all functionality. Table 6 summarizes the transitions associated with this principle. 14.

CROSS-SCALE INTERACTION PRINCIPLE

The rationale behind this principle is that it is the basic principle that ensures that the system remains a system and complies with its basic requirement of interaction in support of all five of the characteristics described above. This means that all of the elements of the system have a relationship with all the other elements. It is assumed that interaction remains even when the system is damaged or is headed to an agreed diminished state (State F) or decommissioning (State G). It also maintains interaction when it is partially functional (State D) or damaged but functional (State E). Therefore, this principle applies to all states and transitions listed in Table 6. For the purpose of this analysis, it is assumed that the system remains an intact system even when approaching the decommissioning state (State G). In reality, the system may have been damaged beyond the point of possessing the properties of being a system. This situation is outside the scope of this analysis. At this point, the designer is only one step from determining candidate concrete solutions. This step will be described in the following section. a.

Rule-based Mapping

To ensure that the mapping of transitions and principles is complete the project has established a set of seven rules. These rules reflect that character of each principle as implemented in a specific transition. Table 6 describes these rules and the implied transitions. The transition – principle mapping will be shown in the analysis of each principle. The logic of Table 6 is based on the description of each pair of states and the transition needed to allow a system to transition from state to state. For example, State A is a normal operational state, and State B is a partially functional state. The transition from State A to State B (Transition 8) would be a partial loss in functionality. The appropriate characteristic is tolerance, which has a goal of degrading gracefully. This transition can be achieved with the modularity principle providing the system itself can accommodate that principle and its physical manifestations.

96

S. JACKSON

TABLE 7 Characteristics Robustness

Adaptability

Tolerance

Interaction

MAPPING OF CHARACTERISTICS, PRINCIPLES, AND TRANSITIONS Principles

Transitions

Absorption

1, 3, 10, 12, 18

Physical Redundancy

1, 3, 7, 8, 10, 12, 18

Functional Redundancy

1, 3, 7, 8, 10, 12, 18

Drift correction

2, 4, 5, 20

Restructuring

2, 5, 20

Reparability

2, 5, 20

Human in the loop

2, 4, 5, 6, 8, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28

Modularity

6, 7, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Neutral state

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Loose coupling

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Reduce complexity

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Reduce hidden interactions

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Functional redundancy

6, 8, 11, 13, 14, 15, 16, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28

Cross-scale interaction

All

Other principles that can achieve this goal are physical redundancy and functional redundancy. This principle supports the interaction characteristic because it is the one principle that supports the relationships among all the components of the system and permits the system to remain as a system. Table 7 maps these characteristics with the principles and transitions.

III. CREATING A CONCRETE SOLUTION The final step is to create a concrete solution that corresponds to the identified abstract principles for each transition. That is to say, in the preceding steps, the designer identified several abstract principles for each transition in Fig. 6. Each abstract principle will then inspire a concrete solution based on the dominant

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

97

characteristics described in Table 1. The abstract principles will be based on the goals described in Table 5. The designer may have narrowed the number of principles for consideration by restricting the principles that are compatible with the physical constraints applicable to a particular system type. That is to say, in many circumstances, the goal may be to create a new system from a previous system or to create a completely new system. Either way, the abstract principles may not be compatible with the design assumptions or constraints of the existing system. The simplest example is if the existing system is a completely automated system with no humans, then the designer can dispense with any principles that are human-focused. The human in the loop is the most obvious. Other possibilities for dismissal are the neutral state and loose coupling principles. The designer must also select the possible paths through which the system may pass on its way from a nominal state to a final state. Figure 7 illustrates several such possible paths. The figure shows a typical path from a nominal state to the concrete solutions described in this chapter. This chart has several features:

Fig. 7

Resilience paths to concrete solutions.

98

S. JACKSON

.

An example path is from State A (Nominal Operation) to State D (Partially Functional Disrupted State).

.

Another path shown is from State D (Partially Functional Disrupted State) to State F (Agreed to Diminished State). This path is a continuation of the first path so that the combination of the two paths constitutes a complete path from Nominal Operation to Diminished State.

.

Each path requires a different transition. From Fig. 6, these are called Transitions 8 and 14. Each transition will be defined by the change in states associated with each transition.

.

Each transition may result in more than one abstract principle. The principles identified here are called (arbitrarily) Principles 1, 2, 3, and 4. All of the principles identified here must satisfy the characteristics described in Table 4.

.

Each principle here may be converted one or more concrete solutions using the rule of dominant characteristics. The solutions identified here are arbitrarily called Solution 1 through Solution 8. The designer can then evaluate each of these solutions using simulations and other standard engineering techniques. Of course, the process may be greatly simplified if it is found that some principles do not apply to the current system or scenario. In this case, these principles can be eliminated from consideration. For example, if the system is not capable of executing delays in control actions, then both the neutral state and loose coupling principles will be eliminated. In the ideal case, only one concrete solution will remain as a logical solution.

.

The process described here may result in two or more separate concrete solutions, but hopefully only one. This conclusion results from the fact that paths A – D will result in one concrete solution, perhaps Solution 3, whereas paths D– F will result in another concrete solution, perhaps Solution 5. The analyst will need to decide whether these two solutions are really the same solution or whether both solutions need to be implemented. If the latter is the case, the analyst will need to decide whether the two solutions are compatible. In the best case scenario, perhaps a concrete solution derived from the functional redundancy principle will suffice for both paths.

For example, as this figure shows, to transition from State A the nominal operational state to State D the partially functional disrupted state through Transition 8, the designer may choose either the physical redundancy principle or the modularity principle. The physical redundancy principle will leave the system partially functional if only one branch of the system is disabled. The modularity principle will leave the system partially functional if at least one node of the system is left in a functional state. Hence, multiple principles are possible. Hence, selecting the appropriate principles for each transition requires an intimate understanding of the principles. At this point, the designer may wish to consider the physical limitations imposed by the current system being

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

99

examined. This analysis may reduce the number of candidate principles the designer will have to analyze. To select the appropriate concrete solutions, the designer will, of course, employ the same dominant characteristics as the candidate abstract principles. In the end, the designer will conduct quantitative analysis or simulations to select the single concrete solution appropriate to the system in question. Criteria for selection may include: .

The concrete solution must be compliant with the dominant characteristics of the abstract principle.

.

The solution will achieve the goal of the characteristic of the principle.

.

The recovery of the system will be acceptable.

.

The solution will be cost-effective based on development and operation.

.

The solution will be applicable to the system type; this criterion may have been considered in the selection of the candidate principles.

.

The solution will comply with constraints imposed by the domain of interest.

.

The solution will be compatible with concrete solutions determined from other transitions in the same path from the initial to the final state.

A. THOUGHT EXPERIMENT FOR ANALYZING ABSTRACT PRINCIPLES A complete description of the principles having been provided and the concept of transitions having been explained, the following thought experiment can provide some insight into how the analyst can convert abstract principles into concrete solutions and then evaluate them for the extent to which the principle can enhance the resilience of an engineered system. The steps are as follows: .

Step 1. Select an abstract principle from Table 1; the abstract principle of physical redundancy will suffice for this thought experiment. Remember that this principle requires the system to possess at least two identical branches, each one capable of sustaining the nominal load if the other is damaged.

.

Step 2. Convert this abstract principle into a typical concrete solution according to the process described in Sect. III. An example is the redundant structures recommended by NIST [2005]. This process can be greatly simplified if the designer can restrict the concrete solutions to the ones that are pertinent to the system and the scenario at hand. For example, if the system has no human components, it is not necessary to consider the human in the loop principle. Ideally, only one concrete solution will be suitable.

.

Step 3. Imagine a threat, such as the aircraft that attacked the World Trade Center in 2001.

100

S. JACKSON

.

Step 4. Imagine the threat damaging only one of the two branches of the structure. Historically, all branches of the structure were destroyed, but in this hypothetical example only one branch was damaged leaving the other with its normal capability.

.

Step 5. Consult Fig. 6 that shows the relevant states and transitions. The total structure will begin in State A when it encounters the threat. The impact of the threat will take the system along Transition 19 to State E where it leaves the system damaged but functional, the second branch having been damaged, but the primary branch maintaining normal functionality.

This exercise can be performed with any principle for a system that encounters a threat of any type leaving the system in a final state after having traversed an appropriate transition. .

Step 6. Examine each principle for its relevance to the problem at hand. For example, if the system under consideration contains human elements, the human in the loop principle will be important. If the system has no humans, this principle can be ignored.

B. RESILIENCE IN THE AVIATION DOMAIN Resilience is particularly important in the aviation domain, especially commercial aviation. Regulatory agencies have already mandated many resilience principles into that domain, and manufacturers have implemented them. Before the advent of systems engineering, the term resilience was used widely with reference to humans, namely, the operators. However, this section shows that the resilience of the aircraft, that is, hardware and software, is equally important. This section discusses the basic principles of resilience and shows how they can be applied to the design of an aircraft system consisting of hardware, software, and humans. This section also describes incidents in which the principles were either applied, as in the case of the US Airways Flight 1549 and the Miracle on the Hudson, or examples of the failure to apply the principles. Jackson [2015] provides further elaboration on these principles. 1. DEFENSE IN DEPTH PRINCIPLE The basic idea of defense in depth is that the system should be capable of having two or more ways to address a single vulnerability. As explained in the Resilience in Everyday Life section, defense in depth calls for each principle to have a backup principle so that a second principle will correct for any “leaks” that occur due to the vulnerability in the design that results from in adequacies in the primary principle. This section illustrates this principle using a well-known case study. This principle was particularly evident in the Miracle on the Hudson incident in

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

101

which the pilot of the US Airways Flight 1549 plane was able to ditch the plane in the water following an encounter with a flock of geese that rendered the propulsion system powerless as described by Parie`s [2011]. Five principles are important in this case study: drift correction, absorption, functional redundancy, human in the loop, and, of course, defense in depth. It must be remembered though, that principles are not tantamount to concrete solutions. All principles must be rendered in the form of concrete solutions as described in the section Creating a Concrete Solution. The first important principle in this case study is the drift correction principle. The first step of using the drift correction principle to rid the runway of geese failed. As described in Table 1, drift correction is the principle that allows the system to anticipate a disruption and correct for it. The second important principle is the absorption principle. The use of this principle is to ensure that the engines could continue running also failed. Also, as described in Table 1, absorption is the principle that allows the system to withstand a disruption. However, in the resilience context, absorption alone will not make a system resilient. In the US Airways case, absorption is the capability of the engines to absorb the bird strike, which failed. So if both drift correction and absorption fail, then we must call on the defense in depth principle to ask what the next step must be if the engines cannot absorb the threat. Jackson and Ferris [2013] describe this principle and many others mentioned in this section. In commercial aircraft, as in other domains, the absorption requirement rarely is large enough to account for all threats. In this case, the bird strike requirement accounts for a large number of birds. However, bird strikes larger than the requirement are common, but not frequent. Similarly, lightning strike requirements do not account for the largest possible lightning strike. Hence, there is always a small probability that these requirements will be exceeded, as they were in this case. Final recovery depended on two key principles. The first is functional redundancy which ensured that the plane was capable of continued power and control. Functional redundancy, also called physical diversity by Leveson [1995], also described in Table 1, is different from physical redundancy in that the system has two physically different and branches but each branch performs the same goal but to a perhaps different level of performance. The second important principle is human in the loop that allowed the pilot to ditch the plane safely. Human in the loop, described in Table 1, is the principle that calls for human intervention when necessary. It might be argued that the pilot in this case could have made other decisions that might have saved the entire aircraft. Be that as it may, there is no doubt that the decisions he made did result in saving the lives of the 155 persons on board complying with the satisficing principle of systems theory as described by Adams [2011]. Figure 8 illustrates these principles in this case study. This sequence of principles illustrates the concept of dependency discussed in the section called

102

S. JACKSON

Fig. 8 Defense in depth in the miracle on the Hudson case study. Dependency of Principles. That is to say, the functional redundancy and human in the loop principles were necessary because of the vulnerability of the drift correction and absorption principles. In this case functional redundancy and human in the loop became the dependent principles as defined in the Dependency of Principles section. Defense in depth is also important in the prevention of collisions between aircraft in flight. The first layer belongs to air traffic control (ATC) who are responsible for tracking the aircraft by radar and making sure that they are at the assigned altitudes, which should be different for each aircraft. This layer uses the drift correction design principle. The second layer involves the use of the Traffic Collision Avoidance System (TCAS) described below. This system warns the pilot of an impending collision with another aircraft. Hence this layer also employs the drift correction design principle. However, the pilot executes the third layer of defense in depth using the human in the loop principle to perform a corrective maneuver. Hence, the human in the loop design principle is required also. Table 1 describes all these principles. Hence, both the pilot and the technological systems are required to enhance resilience in this case study. 2. LIMIT DEGRADATION SUPPORT PRINCIPLE A pervasive threat to the absorption capability of an aircraft is the degradation of the capability due to poor maintenance or aging. This threat requires the invocation of the limit degradation support principle described by the Jackson and Ferris [2013]. Poor maintenance and aging are constant threats to aircraft, both commercial and military. One of the more notable examples was the crash of the Nimrod aircraft which caught fire due to the failure of fuel seals as reported by the Royal Air Force Board of Inquiry (RAF) [2007]. These seals had not been replaced during the maintenance procedures. The Nimrod was a military derivative of the commercial aircraft Comet that crashed in Afghanistan in 2006. The same incident resulted in an independent investigation by Haddon-Cave [2009] who recommended a completely overhauled airworthiness process. This

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

103

recommendation focused on a rigorous independent review process as reflected in the independent review support principle discussed in Table 1. The idea behind the independent review support principle is that persons from independent organizations should evaluate all possible risks and take action to prevent them from materializing. 3. MARGIN SUPPORT PRINCIPLE It is a common practice in design to add a margin to the absorption capability especially in the commercial aircraft domain. This margin allows for uncertainties in loads and in the capability of the aircraft. For most structural components the margin is 50 percent. Woods [2006] lists margin as a characteristic of resilience. 4. PHYSICAL REDUNDANCY PRINCIPLE Physical redundancy is a widely accepted design principle. Some aircraft employ the two-load-path design philosophy. This is a form of physical redundancy, as described in Table 1. That is, if part of the structure, for example the skin, fails due to metal fatigue or any other reason, the rest of the structure will be able to sustain the loads that are incurred. The physical redundancy principle has an inherent vulnerability, namely, that undetected or latent faults in one leg of the system may exist in the other leg as well since both legs are identical. Software is especially prone to this type of vulnerability. That is why the functional redundancy principle should be used when possible or as a backup to the physical redundancy principle. 5. FUNCTIONAL REDUNDANCY PRINCIPLE Functional redundancy is a more robust principle than physical redundancy. The functional redundancy principle, described in Table 1, called design diversity by Leveson [1995], avoids most of the vulnerabilities of physical redundancy. The dominant characteristics are that the system must have at least two physically different and independent branches. An aircraft having both mechanical and fly-by-wire (FBW) control systems is an example of functional redundancy. 6. HUMAN IN THE LOOP PRINCIPLE The human in the loop principle, discussed by Madni and Jackson [2009] is one of the most widely accepted principles in the commercial aviation domain. This principle asserts that humans should always be employed as system elements when there is a need for human cognition. The fact that human pilots continue to be the dominant mode of aircraft control attests to this assertion. Human dominance in ATC is further evidence of its validity.

104

S. JACKSON

The human in the loop principle is also widely used even in the space domain. This principle was responsible for the safe landing of Apollo 11 on the moon when the software systems were overloaded, as described by the National Geographic News [2010]. 7. AUTOMATED FUNCTION SUPPORT PRINCIPLE Billings [1991] elaborates on the human in the loop principle by saying that when there is a choice between the human or an automated system performing a given function, the task should always be performed by the human provided the task is within the time limitations and capability of the human. The premise is that the human has a broader understanding of the environment than the automated system does and would be able to anticipate emergency situations that the automated system could not handle. Of course, the human performing the task is not always desirable. It is dependent on whether the human can perform the task or not. This rule generally applies to tasks in a cockpit environment. The designer will need to balance this principle against the possible advantages of the flight envelope protection system discussed in Jackson [2015]. As discussed in that book, the purpose of that system is to prevent the pilot from allowing the aircraft to exceed its flight envelope. 8. REDUCE HUMAN ERROR SUPPORT PRINCIPLE Statistics show that human error is the primary cause of aircraft accidents. However, it is axiomatic that human error cannot be eliminated. However, there are numerous accepted ways to reduce and minimize human error. Reason [1990] presents a comprehensive list of these methods. Hence, even though the human in the loop principle calls for humans to be the primary backup principle, even this principle is vulnerable to failure. The reduce human error support principle minimizes that possibility. 9. REDUCE COMPLEXITY PRINCIPLE Complex systems are subject to highly erratic operation and sometimes failure. The reduce complexity principle states that the complexity of the system should be reduced as much as possible. Like defense in depth and reduce hidden interactions, the purpose of this principle is to remove the system from its safety boundary as much as possible. The phenomenon of complexity is prevalent in both the aircraft system and the larger system called the large-scale supply chain system. One of the effects of complexity is the phenomenon of hidden interactions discussed in the section on the reduce hidden interactions principle discussed below. Many experts point to three primary contributors to complexity. According to

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

105

Marczyk [2012], for example, there are three primary contributors to complexity as discussed below. The first contributor is the number of elements of the system, in our case the number of parts of the aircraft. The number of elements by itself does not create complexity; however it is an important multiplier. It is possible for a system to have many components and not be complex. Such a system is called complicated rather than complex. The example often given, by for example Giachetti [2010], is a clock whose parts fit together perfectly. It is the lack of a perfect fit that makes the system complex; this lack of fit is called variability as discussed below. The second contributor is the number of interfaces. Some authors, for example Carson [2012], consider this to be the primary contributor. With respect to the number of interfaces, this is an aspect entirely within the purview of the aircraft designer, in particular, the aircraft architect who determines the number and relationships among the subsystems. The methodology the designer might use is called cluster analysis as described by Hitchins [1993]. The premise of cluster analysis is that the parts of the aircraft with the greatest functional binding will be a candidate to be an identifiable subsystem and will have the fewest interfaces among those parts that constitute the subsystem. As an example, the designer may want to decide whether to make guidance and navigation two software modules or one. Cluster analysis will help make that decision by minimizing the number of interfaces. The third contributor is the variability in the relationships among the parts of the aircraft. Jackson [2015] describes this factor more in depth. However, in short, this factor has to do with the variability in the quality of the data or other parameters that are exchanged among the parts of the aircraft. Jackson [2015] also discusses the larger supply chain system where variability is even more important. 10.

REDUCE VARIABILITY SUPPORT PRINCIPLE

As described above, variability among interactions is one of the primary contributors to complexity. Hence, reducing complexity requires reducing variability. Variability can be seen in its simplest technical terms, such as the variability of an electrical current between two components. If there are wide fluctuations in this current, then there will be variability and resultant complexity. Of course, real complexity is a composite of all the fluctuations of different parameters all over the aircraft. Jackson [2015] also discusses the variability of parameters between the various elements of the supply chain also contributing to its complexity. Hence, rigorous management of all of these parameters will minimize the complexity of both the aircraft and the supply chain. This variability can actually be measured by the term known as information entropy and the equation known as Shannon’s equation [1948]. Shannon originally intended this equation to apply only to information systems but it is now accepted as applicable to systems in general.

106

11.

S. JACKSON

RESTRUCTURING PRINCIPLE

Restructuring is an important resilience principle. Woods [2006] refers to this rule as the restructuring principle. He states that in order for a system to be resilient, it must be capable of restructuring itself. Although it may seem unreasonable for an aircraft to restructure itself during operation, it has happened. The most obvious example is the Sioux City DC-10 crash of 1989. In this case, the pilot was not able to control the aircraft when the control system was damaged. The pilot managed to maintain some degree of control by using the propulsion controls. Most, but not all, of the passengers survived. This change of controls was tantamount to reorganizing the control system of the aircraft. However, the FAA did not see fit to mandate propulsion control. Nevertheless, this incident did illustrate the principle of restructuring and its value. 12.

DRIFT CORRECTION PRINCIPLE

The drift correction principle discussed by Dekker et al. [2008] is the primary principle in the preevent phase of disruption. We saw above that drift correction was a primary consideration in the US Airways Flight 1549 case even though this principle alone was not sufficient to save this aircraft. It is the principle that enables the system either to anticipate or detect its drift towards an unsafe condition and makes a corrective action. Drift correction can either be in real-time or a longterm anticipation of a problem which may involve latent conditions not evident to the operators. Reason [1997] describes latent conditions in which flaws in the system are hidden until a catastrophic failure occurs. The concept of latent flaws brings to mind a well-known axiom of system safety, paraphrased as follows: Just because your aircraft has not had an accident does not mean it is safe. One of the most well-known threats to aircraft safety is bird strikes as discussed by Parie`s [2011] that was a major factor in the US Airways Flight 1549 case discussed above in the context of the defense in depth principle. So the anticipation of this threat and steps to thwart it are an example of long-term drift correction. Skybrary [2013] points out that in many parts of the world mammals, such as reindeer, are also threats to aircraft because they can unexpectedly walk onto runways. Many corrective actions have been taken, such as reducing the bird population. However, as was seen in the Flight 1549 case, this was not effective, hence the need for defense in depth as discussed above. Typical of latent flaws in the aircraft domain is the degradation of the aircraft due to poor maintenance or aging as described above for the Nimrod case in the discussion of the limit degradation support principle. Corrective actions for these flaws are more frequent, and detailed inspections and oversight of the maintenance operations are required to detect them. For some types of flaws, for example structural cracks due to fatigue, special equipment may be necessary to detect them. Hence, these inspections would constitute the implementation of the drift correction principle.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

107

Regarding real-time drift correction, some existing methods already fall into that category, for example Terrain Avoidance Warning System (TAWS) as described by Skybrary [2012]. This system warns of an approaching mountain or any other geological feature that may be a hazard to the aircraft and allows the pilot to perform a corrective maneuver. Another example of drift correction is the Traffic Collision Avoidance System (TCAS). This is mandated system that uses a transponder to warn the pilot of the possible collision with another aircraft. Another interesting example of real-time drift correction is the volcanic ash detector, called AVOID, which is in the planning stage, as reported by the BBC News [2010]. This is a joint project between the airline EasyJet and Airbus. This device would warn the pilot if the volcanic ash density is becoming too dense and thus allow the pilot to take an alternative route. This development hinged on the adoption of a criterion for the maximum particle size of volcanic ash through which an aircraft could fly as also reported by the BBC [2010]. This development followed the eruption of the Icelandic volcano Eyjafjallajo¨kull in 2010, disrupting air traffic. It is not known whether this device will be mandatory on aircraft, but in any case, some airlines may voluntarily choose to install it. 13.

CROSS-SCALE INTERACTION PRINCIPLE

The cross-scale interaction principle finds its roots in systems theory the principle of interaction described by Hitchins [1993]. This principle is an adaptation of the cross-scale interaction concept described by Woods [2006]. That is to say, for a system to be a system—and an aircraft is a system—there must be a relationship among all the parts. We have seen much evidence of the relationships among the parts of the aircraft in all the interfaces discussed by Jackson [1997]. Of primary interest is the relationship between the pilot and the aircraft automation. Billings [1997] lays out his set of principles (which he calls requirements) for all of these relationships. All of Billings’ principles are in fact heuristics, that is, principles based on his own experience, observations, and common sense. 14.

INFORMED OPERATOR SUPPORT PRINCIPLE

The informed operator support principle as formulated by Billings [1997] states that the operator, that is the pilot, should be completely knowledgeable about the operations of the automated system. In other words, the automated system should not perform any operation that the pilot does not understand. This support principle is particularly important in the implementation of flight envelope protection, as described by Jackson [2015]. Billings argues that this principle is most important when the aircraft is not operating in normal operating conditions. The value of this principle is evident in those cases in which the failure to observe it resulted in a catastrophic event. For example, according to Zarboutis and Wright [2006], the aircraft in the Nagoya incident of 1994 was in a go-around

108

S. JACKSON

mode while the pilot desired to land. This miscommunication between the pilot and the automated system led to the ultimate crash and loss of life. 15.

KNOWLEDGE BETWEEN NODES SUPPORT PRINCIPLE

Another Billings [1997] principle, the knowledge between nodes support principle is similar to the informed operator support principle except broader in scope. This principle pertains to any combination of elements on the aircraft, not just the pilot and the automated system. However, the Nagoya case study satisfies this principle as well. 16.

HUMAN MONITORING SUPPORT PRINCIPLE

This principle is the reverse of the informed operator principle; it states, according to Billing [1997] that the automated system should monitor the human pilot. According to Billings, one of the most important aspects of human monitoring that the automated system should perform is to know when the human has made an incorrect data entry. 17.

AUTOMATED SYSTEM MONITORING SUPPORT PRINCIPLE

This principle, according to Billings [1997] goes beyond the informed operator principle; it states that the human should know the intent of automated system actions. Once again, this principle seems to have been absent in the Nagoya incident. If the automated system does not behave in accordance with pilot intentions, this principle will have been violated. 18.

CROSS-SCALE IMPEDIMENT SUPPORT PRINCIPLE

The cross-scale impediment support principle states that there should be no administrative or technical impediments to communication or cooperation among the nodes of a system. This support principle can be inferred from various case studies suggest this support principle which was documented by Jackson and Ferris [2013]. Among the more notable examples of when these impediments existed was the 9/11 Commission’s [2004] concern that the FAA could not communicate directly with the pilot of United 93 that a terrorist threat was imminent. This lack of communication was the result of protocols in the FAA at that time. 19.

REDUCE HIDDEN INTERACTIONS PRINCIPLE

Perrow [1999] states that many failures result from hidden interactions that result from excessive complexity. It is from this observation that the reduce hidden interactions rule is inferred.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

109

There are many ways that interactions can occur on aircraft and cause failures. Prominent among these is the effects of electromagnetic interference (EMI) when electrical devices, such as solenoids or generators, are located near data lines. A dramatic example of hidden interactions was the Helios 522 aircraft in which a malfunctioning pressurization system and the automated flight control system combined to cause the crew and all the passengers to die of hypoxic hypoxia as described by Dekker et al. [2008]. So how can hidden interactions be reduced? At first, it might seem like a formidable task, and it can be. The most tedious method would be a mapping of all possible interactions as suggested by Jackson [2010]. This method can be expedited by exploring specific categories of interactions one at a time, such as EMI. In short, this would be a fruitful area to investigate with model based systems engineering (MBSE) as described in Chapter 4. A more global approach is to reduce the complexity of the entire aircraft in accordance with the reduce complexity rule discussed above. However, even this approach would involve examining the variability of the parameters between the individual elements. Chapter 5 describes how technology can be infused in the conceptual design of an aircraft. Finally, Leveson et al. [2006] suggest that the problem and the solution might be managerial in nature. That is to say, closer cooperation between different groups would reduce the chance that there would be any inconsistencies in the design approaches that would cause hidden interactions. Within systems engineering such a managerial approach exists; it is called integrated product teams (IPTs). An IPT has members from all technical disciplines working on the same product, for example, the wing or fuselage. The premise is that this close working relationship would result in fewer inconsistencies and hence fewer hidden interactions.

IV. IMPROVISED SOLUTIONS AS RESILIENT SYSTEMS IN CRISIS ENVIRONMENTS In Section II.E, one of the categories of solutions posed by domain experts was one called improvised solutions. Improvised solutions are systems that are created in the heat of a disaster by first responders using whatever skills they may have and whatever resources may be at hand. So the fundamental question at hand is: how can a system created with limited resources by people with limited skills ever qualify as a resilient system? The purpose of this section is to show that, contrary to widespread opinion and intuition, improvised and resilient systems can indeed be created in the heat of a crisis situation if: first, the right assets are available; secondly, the first responders have the adequate skills; and third the conditions are right. This section will show how this has been done in the past and what the process is for accomplishing it in the future.

110

S. JACKSON

Some of the associated questions are: .

What is the process for creating valid systems in an improvised manner in a crisis environment?

.

What resources should be at hand to make this process work?

.

When these are improvised systems successful and when are they unsuccessful?

.

Are there authenticated case studies in which successful improvised systems were created?

.

Similarly, what were the circumstances in which the improvised systems were unsuccessful?

.

What qualities should first responders have to create successful improvised systems?

It may seem, at first, that if the designers put resources at hand, the system would not really be improvised but rather planned. This is not really the case; it is tantamount to having a hammer or saw available if you need them. In the context of infrastructures, resources might include alternative sources of water and power and sandbags. So yes, a certain degree of planning is required. This section shows that there are two requirements for achieving a successful improvised system: First, the system must pass through specified “phases of creation” to be described below. Secondly, and simultaneously, the improvised system must possess one or more of the characteristics that meet the abstract principles listed in Table 1. In short, first responders can create successful improvised systems if the proper resources and skills are available and “prestaged” for their use. This section will elaborate on the planning needed for the creation of successful improvised systems.

A. CAN IMPROVISED SYSTEMS BE SUCCESSFUL? An observer new to the topic of resilience might ask: can a system developed in a crisis environment be successful, that is, does not a truly successful system need to be planned well in advance? The answer is yes they can be successful if the conditions are right, the material is available, and the right people are creating the system. By successful, it is meant that an improvised system might recover part of the functionality of the preevent system. Furthermore, successful systems have been developed and deployed in crisis environments. We will describe some of those successful systems later in this section. This is not to say that developers can always build successful systems. There are some situations in which the developers reach the limit of their capability and the limits of the environment. So planning does not necessarily mean that the developers need to know the exact nature of the threat, or that they need to know exactly what materials they

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

111

need, or that they know what the exact design of the improvised system will be. They, however, do need to be skilled in using the materials at hand and converting them into a resilient system.

B. PHASES OF CREATION To transition from an initial to a final state a system must pass through a series of phases of creation. The phases of creation described in this chapter are adapted from the modes of being described by Palmer [2009]. The final state will be either successful or unsuccessful. In theory the system may pass through these phases in any order. However, for our purposes the order of interest would be from the system at rest in the Initial phase to a fully Operational phase. On the way through these phases the system will also pass through one or more of the following phases: The Decision phase, the Deployment phase, and the Limiting phase, which will be described below. 1. THE INITIAL PHASE The Initial phase is particularly relevant to the system as it exists before the crisis occurs. In this chapter the entities in the Initial phase usually are a collection of things as systems or entities that exist before they are used and before the original system has been disrupted. They are as follows: 1. This is the delivered system before it is used. 2. This is the original working on-site system prior to disruption in normal use. This is the system before the disruptive event occurs and before on-site system developers have taken any steps to change the system or create an improvised system. 3. These are the plans the operators or first responders of the original system employ to operate the system in an undisrupted environment. Or there can be contingent plans for dealing with the system in tractable ways which has left its normal thresholds that are used to bring the system back under control and into normal operation. 4. Prestaged assets. These are system components, either physical assets, humans, or procedures that are not part of the original system. So prestaged assets can be either physical or conceptual entities. Haimes [2008] refers to these assets as prestaged supplies. Haimes discusses the example of prestaged water supplies. But it can refer to other assets, such as electrical generators. Prestaged assets can be either stationary, that is, in storage, such as sandbags, or assets that are in use, such as roads. In the latter case the capacity of the asset should be sufficient to handle the crisis beyond their normal use. In the World Trade Center case described by Mendoc¸a and Wallace, the prestage assets were the portable generators which the power authority had kept in storage.

112

S. JACKSON

5. Resilience principles. These are the abstract principles that are listed in Table 1 and can be implemented with concrete solutions in the heat of a crisis. 2. DECISION PHASE The Decision phase of the system is when the developers are making important decisions. It turns out that there are more possibilities to be potentially realized in this phase of creation than any other phase. But especially prominent outcomes in this modality are the improvised designs that are created in the response to disasters by those dealing with those disasters. This is because the on-site system developers have to make quick decisions about how to solve the problem at hand and what resources are available to do it, the so-called “prestaged assets” or other systems that are immediately present that may be used for needed parts. In the World Trade Center case described by Mendoc¸a and Wallace [2006], the developers made the decision to deploy the portable generators around the city. 3. DEPLOYMENT PHASE Deploying the assets and creating the new system can often be the most difficult of the creation stages. Developers can find the challenges of deployment overwhelming and not always successful as will be seen in the Limiting Stage described below. In the Deployment stage developers may turn to extraordinary means to achieve success or even to survive. In the World Trade center case described by Mendoc¸a and Wallace [2006], the developers needed to deploy the portable generators around the island of Manhattan and connect them to the electrical distribution system. This deployment took about five hours. 4. OPERATIONAL PHASE The Operational phase is the phase in which the system is in when it is being used. In the context of this paper the emergent properties are assumed to be originally in the operational. These emergent properties can also belong to the system that has been created by the on-site system developers after the disruption has occurred. These emergent properties are not necessarily those of the original as delivered functioning system. They may emanate from parts of the original system and elements of the prestaged assets from elements in the Initial phase. They may also emanate from elements put together from other systems for parts or by adding elements not part of the original design that are custom made to fill some gap in the original system unforeseen or due to break down of standard parts no longer available. In the context of this section it is an improvised system with the desirable emergent properties. This system can also be considered

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

113

an unexpected upgrade or unplanned enhancement to the original system. But since changes have been made, it can now be considered system with emergent properties. This is also the phase in which the on-site system developer sees the limitations of the improvised system and knows under what conditions it can be employed or not. 5. LIMITING PHASE The Limiting phase is the phase which represents the impossibility of any solution to the problem at hand. In the next section the Limiting phase will represent the impossibility of resolving the crises encountered in the authoritative case studies examined, for example, the impossibility of putting out a fire without water which is what happened following the 1906 San Francisco earthquake and fires described by Hansen and Condon [1989].

C. PHASES OF CREATION AND ABSTRACT PRINCIPLES This section shows that on-site system developers, extenders, enhancers, rebuilders, or “bricoleurs”[ ] in historical contexts have created improvised solutions and composed systems with properties which now enhance its resilience. In each reviewed case the improvised system was enhanced in some way which sought to contain some aspect of the dominant characteristics of a previously indentified abstract principle from Jackson and Ferris [2013]. These bricoleurs appear to be operating in the five phases of creation described above.

D. RESILIENCE PRINCIPLES AND CASE STUDIES This section discusses only five of the approximately 34 principles described in Table 1 to illustrate how the resilience of systems can be enhanced. The five principles are cross-scale interaction, functional redundancy, containment, restructuring, and human in the loop. Containment is actually a support principle and limiting condition of the more well-known abstract resilience principle of loose coupling. For the abstract resilience principles listed in Table 1, these “common qualities” are the dominant characteristics that systems in any domain may possess. They are also the dominant characteristics possessed by the concrete solutions, or the “specific examples” of the definition. The concrete solutions in this paper are the improvised solutions that have been observed in the case studies and possess emergent properties. Table 1 shows the dominant characteristics associated with each of the five principles.  One who performs “

. . . the construction or creation of a work from a diverse range of things that happen to be available, or a work created by such a process” http://en.wikipedia.org/wiki/Bricolage

114

S. JACKSON

E. INTEGRATION OF PHASES OF CREATION AND ABSTRACT RESILIENCE PRINCIPLES TO CREATE RESILIENT SYSTEMS To understand how system developers create improvised designs with emergent properties it is necessary to understand how the phases of creation and the abstract resilience principles are integrated. To begin with a system exists in its normal operational phase. This system can be a power system, a water system, or simply an urban building or group of buildings, such as the World Trade Center. Before the system developers create the improvised design with emergent properties, the following conditions exist: .

The system encounters a threat. This threat may be a terrorist threat or a natural phenomenon, such as an earthquake.

.

The system exists in an environment which may consist of other infrastructure systems including communications, first responder capability, and so forth.

.

Prestaged assets exist which may be used in creating the improvised system with emergent properties. The section above on the Initial stage describes typical prestaged assets, such as generators, water, and so forth. The abstract resilience principles themselves are prestaged assets.

As time progresses following the threat event, the system developers will follow two converging paths to create an improvised design with emergent properties. These are the five phases of creation described above. Of course, the transition from phase to phase can be in any order. Therefore, the developer must decide which order of phases to use. In a typical scenario the process will normally begin with the Initial phase in which the system is operating in its normal phase with the prestaged assets available. The developer must also decide which abstract resilience principle to use in the design of the improvised system with emergent properties. The developer will normally make this decision in the Decision phase since that is the phase in which decisions are made. The selection of the best principle will not determine the final design since the principles are abstract, and abstract principles only provide a simplified version of the final design based on the dominant characteristics previously identified and described in Table 1. With sufficient time, the system developer might be able to compare choices of concrete improvised designs. However, in crisis environments this is rarely the case. The system developer in the role of bricoleur must use whatever is at hand. For example, in the case of the World Trade Center attacks, the first responders had little choice except to use the generators that were available, as described below. In the end, in the best scenario, the system developer will have chosen a satisfactory improvised design that will have delivered, at least partially, the performance of the original pre-event system. In the worst-case scenario the developers may find themselves in the Limiting phase in which the selected designs simply do not work. Perhaps in a future event they will have chosen prestaged assets

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

115

which were more suitable. However, as will be shown below, system developers did create satisfactory emergent improvised designs that were resilient.

F. CASE STUDIES FOR IMPROVISED SYSTEMS Three case studies illustrate the concrete improvised systems that were achieved by attempts to use the principles described in Table 1. In all three case studies the elements of the systems were understood to be in one or more of the phases of creation described above. As the crisis occurred in the operational system the mode of apprehension of the system changed by those charged to bring it back to normal operation and out of that change in the phase of creation improvised changes to the system were effected that would not have been considered under normal operation. Many times these changes were not planned in the original design but were fabricated on the spot to address pressing needs that arose from the crisis which were unforeseen by the original system designers. All three case studies illustrate that the humans in the system, that is the on-site system operators and improvised developers were not merely outside entities creating emergent systems, but rather they were integral components of the systems.

G. 1906 SAN FRANCISCO EARTHQUAKE AND FIRES The San Francisco earthquake and fires of 1906 illustrate the importance of the functional dependency, human in the loop, and cross-scale interaction principles and the containment support principle. It also illustrates the conditions under which the improvised systems were either successful or unsuccessful. In this case study the water system in San Francisco was rendered inoperable by the earthquake in that city as described by Hansen and Condon [1989]. This situation resulted in at least two improvised solutions, one successful and the other unsuccessful. Because of the lack of water, the fire department was unable to extinguish the fires that resulted from the earthquake. Hence, the existing fire protection system was the prestaged asset in the Initial phase before the earthquake which ceased to be ready to hand in the actual disaster and thus useless for its intended purpose. One set of on-site system developers set out to contain the fires by using dynamite along major thoroughfares of the city as described by Hansen and Condon [1989]. The dynamite in storage constituted a prestaged system in the Initial phase as defined above. However the system developers’ decision to use the dynamite was while they were in the Decision phase. The applicable principle for this operation is the containment principle described in Table 1. The dominant characteristic of the containment principle is that the system has a barrier between the point of disruption and the rest of the system. Of course, humans were required to deploy and detonate the dynamite; hence, the human in the loop principle was implemented. The point is that one preplanned prestaged asset of water for firefighting failed and so some other way to stop the fires spreading had to be tried. The idea of using dynamite was to deny the fires oxygen so that

116

S. JACKSON

they would not jump the street barriers within the city. The same principle is used in other environments, for example, setting backfires to prevent the spread of fires in brush areas. The second improvised solution in the San Francisco case study had to do with the fact that the horses that normally pulled the fire wagons bolted and were unavailable to pull the wagons as described by Hansen and Condon [1989]. The plans to use the horses and the horses themselves were part of the Initial phase elements. The improvised system consisted of fire fighters trying to pull the water wagons. They reached this conclusion while they were in the Decision phase. As might be expected, this method proved to be unsuccessful. The realization that this method would not work came to them while they were in the Limiting phase, i.e. overwhelmed by the task of moving the wagons by human power alone. The primary principle to be fulfilled was the cross-scale interaction principle described in Table 1. This is the principle responsible for assuring an interaction among the nodes. In order to ensure this principle, the system, in this case, was required to fulfill the functional redundancy principle. Functional redundancy is one way to ensure that cross-scale interaction is maintained as described in Table 1. The dominant characteristics of functional redundancy are 1) the system contains two or more branches that are physically different and 2) the branches are independent. In this case the prestaged assets of the Initial phase were the group of fire fighters who were trying to pull the wagons. However, this was one case in which the on-site system developers concluded that the improvised solution was not feasible. Although humans were used to pull the fire wagons, the human in the loop principle was implemented however the improvised system was not successful. The point was that there was at that time no other prestaged asset other than humans to pull the fire engines when the horses failed due to their unreliability in the crisis situation and the fire wagons were only designed to be pulled by horses and were too heavy to be pulled by humans that were available in the time needed to reposition the fire wagons. Trucks did not yet exist and steam powered engines were not readily available thus illustrating that not all attempts at creating improvised solutions are successful.

H. 1968 NEW ZEALAND INANGAHUA EARTHQUAKE The important principles for the 1968 New Zealand Inangahua earthquake were the functional redundancy and the human in the loop principles. In 1968 a cargo train in New Zealand was derailed by the earthquake of that year as related by the New Zealand Ministry of Civil Defense [1968]. Cargo carriers, the on-site system developers, implemented a system of delivering cargo by road. All of these elements were present and in the Initial phase. The new road system cargo delivery was the emergent improvised system to replace the existing disrupted railway transportation system. As explained in

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

117

Section IV.1, a prestaged asset can either be in storage or in use before the disruptive event occurs. If it is in use, its capacity must be sufficient to handle the demand after the event. In this case the increased demand is for the cargo that was previously being transported by rail. The applicable principle in this case was functional redundancy as described in Table 1. Since the prestaged asset system of roads was physically different from the system of the railway, the definition of functional redundancy is met. This is one case, in contrast to the San Francisco case, in which the functional redundancy solution was successful. Of course, more humans and different vehicles were required to deliver the cargo by road; hence the human in the loop principle was employed. The humans who made these decisions were the on-site system developers who created the improvised solution with the emergent property of resilience while they were in the Decision phase.

I. 2001 WORLD TRADE CENTER ATTACKS The 2001 World Trade Center attacks illustrated the importance of the restructuring and the human in the loop principles. Much of the island of Manhattan was left without power following the attacks of 11 September 2001 as related by Mendoc¸a & Wallace [2006]. Power was restored in a remarkable way by deploying portable electrical generators throughout the impacted area. This restoration of power constituted the emergent property. Before the attack the existing electrical grid system was operational and in the Initial phase with respect to the power consumers. There was also an existing plan on how to operate the original system in the absence of a disruption. Furthermore, the “prestaged asset” of extra separate stored generators, was part of the logistical support meta-system. The extra generators were meant to work as a stop gap resource to plug up holes in the existing grid one at a time. All of these elements were part of the Initial phase. The operation of the electrical grid from the point of view of normal operations of the system was in Operational phase as there is constant switching of load as energy requirements change dynamically. Operators were experiencing increased load switching needed before the system itself went down which is what sent them and emergency workers who needed power into the Limiting phase. One of the resilience principles was the restructuring principle described in Table 1. The restructuring principle is dependent on the cross-scale interaction principle as illustrated in Fig. 10 described in Jackson and Ferris [2013]. The logic is simply that a system cannot restructure itself unless its elements can communicate with each other. It is not relevant why these generators were available as prestaged assets. It is probable that the power company kept them there for more routine power failures which New York, like other areas, suffers from time to time. The point is that they were there and they were available and the on-site crisis responders and

118

S. JACKSON

improvised emergency system operators knew they were there and elements in the prestaged assets of the Initial phase. Following the attacks and the loss of power, the on-site system developers had to consider their options. This situation put them in the Decision phase. So they conceptualized a solution in which the available generators would be deployed en masse instead of one at a time throughout the city. Part of the solution was a joint operation among major groups including the power company, the police, and the U.S. Army. This solution was in perfect alignment with the restructuring principle described in Table 1. Next, the on-site developers deployed these generators as they had planned. This new system became the improvised system with the emergent property of resilience replacing the original electrical grid system. This new system now existed in first the Deployment phase and then the Operational phase replacing the normal electrical generators and perhaps transformers which were inoperable. This new system was generating needed power. The on-site developers referred to in this case study are, of course, humans; hence, the human in the loop principle was applicable to this case study.

J.

SUMMARY OF PROCESSES OF THE REALIZATION OF RESILIENCE FROM IMPROVISED SOLUTIONS

Table 8 summarizes the process of the design of systems using the phases of creation described above and the principles described in Table 1. This table unites the phases of creation and the resilience principles. First, it lists the phases of creation as they are experienced by the on-site system developers. For each of the phases of creation the elements of that phase are described in the context of the case studies. Each case study is described in terms of the principles which enabled the system to recover from the disruption, except for the one case in which recovery was not possible.

K. COMMENTS ON THE ANALYSIS OF IMPROVISED SYSTEMS The question very often asked is whether resilient systems have to be totally improvised or whether steps can be taken in advance to ensure that the most effective systems are indeed implemented. Couldn’t prestaged assets be put in place in advance in case of an impending disaster? This is a reasonable question. Before we answer this question, it is also reasonable to ask whether or not putting prestaged assets in place in advance constitutes planning and therefore the systems are not really improvised. The answer is yes to a certain degree. However, these steps are logically necessary and how they are labeled is of minor importance. One could call them preplanning. The problem is that the nature of disasters is so unpredictable and prestaged assets of what, at what cost, and can one be confident that the assets will be appropriate and not be lost in the “event.” Many times the worst case scenarios

Existing fire system Existing plans Standing reserve (dynamite) Principle of containment Principle of human in the loop

Implemented dynamite fire break system

Conception of dynamite as fire brake as a system

Realization of the limits of the existing water system with no pressure

Decision Phase Improvised System implemented in crisis

Deployment Phase Exploring alternative potentials beyond those given in the system as designed and operated prior to the disaster. Humans transform the tools in their hands to try to meet the emergency that threatens them.

Limiting Phase is the overwhelming of humans in situations beyond their control and which have gotten out of hand

Containment (San Francisco)

Realization of limits to pulling fire wagons by hand

Conception of system to pull fire wagons by hand

Attempted system to pull fire wagons by hand

Existing fire system Existing plans Principle of functional redundancy Principle of internode interaction Principle of human in the loop

Functional Redundancy (San Francisco)

Realization of the limits of the rail line when blocked by a wrecked train

Conception of system to deliver cargo by road

Implemented system of delivering cargo by road instead of rail

Existing rail system Existing plans Standing reserve (cars and trucks), roads Principle of functional redundancy Principle of human in the loop

Functional Redundancy (Inangahua)

Abstract Principles and Case Studies

Realization of limits on fuel capacity

Conception of reorganized power system

Implemented system of en masse deployed generators to provide power

Existing power system Existing plans Standing reserve (generators) Principle of reorganization Principle of internode interaction Principle of human in the loop

Reorganization (World Trade Center)

SUMMARY OF PROCESS OF CREATING RESILIENT DESIGNS FROM IMPROVISED SOLUTIONS

Initial Phase Physical entities and abstractions

Design Phases

TABLE 8

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS 119

120

S. JACKSON

considered during design are overwhelmed by actual disasters. In addition, real disasters may present a need for assets that may not be available. Many prestaged assets would probably go unused even if a disaster occurred because they are of the wrong type or not positioned so that they are readily usable. The simple answer is that there are certain basic needs that it would be wise to supply in the event of an earthquake or hurricane. However the cost of doing so might be prohibitive. Whether it is wise to build shelters for a possible attack which might be fended off by other means is less certain. Having a prestaged asset of trained disaster relief and emergency responders is probably the most important resource that one may have in preparation for a disaster. But if the logistics system is not able to aid these responders then their effectiveness is diminished and it is best if the systems are fully designed to be resilient as possible but no matter how well planned are the system responses to possible disaster. There will be unimagined scenarios played out in actual disasters which humans will have to cope with by entering higher phases of creation and producing spontaneous self-organized responses as appropriate to systems breakdowns discovered on ground as the emergency unfolds. The key is that no matter how resilient we make our systems in the crisis they will be used by humans who have switched into higher modes of situational awareness beyond circumspective concern such as the Decision phase in response to the crisis. Not only are the human teams working in a heightened state of effectiveness and efficiency due to this flow modes of consciousness that they share but also they are seeing the technological infrastructure through new eyes as they search for ways to deal with the adverse effects of the disaster using that infrastructure. In that state developers create new possibilities are explored never thought of previously for the design of solutions to problems. Building systems that are aware of the fact that they will not be seen in just Initial or Operational phase in an emergency would be very helpful to those who are trying to figure out how to reconfigure the existing available resources to meet the contingencies of a dire situation in the best way possible given the time and the spatial organization of the elements that are available to meet the emergency. Improvised organizations with emergent properties are a slightly different matter. Regarding the World Trade Center attacks Tierney for example, states the following: .

Although technically subject to a wide variety of principles and authorities, the networks that emerged to handle response-related demands operated in a relatively decentralized fashion, especially in the immediate aftermath of the attacks. In contrast with more hierarchically-arranged groupings, response networks were loosely-coupled and operated in a semi-autonomous manner. This is not to imply that systems of coordination were lacking. Such systems were present. The city’s emergency operations Center was organized into functional work groups composed of organizations that were responsible for related and complementary tasks-for example, law enforcement, transportation, utilities, and human needs. [Tierney, 2003]

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

121

The implication from this passage is more of a naturally occurring phenomenon and may provide insight into the power restoration described in this chapter. There is another important implication in this passage as well, namely, that restructuring works best when lines of communication and coordination are kept open. This aspect can be planned and implemented in advance. The communication and coordination aspects are in agreement with the cross-scale interaction principle described in Table 1. This principle states that lines of communication, cooperation, coordination, and collaboration should always be maintained between the nodes of a system. This conclusion illustrates the interdependency or linking among the principles. In addition, Scanlon [1999] provides considerable support for the restructuring principle and for the emergent properties that may be associated with it. He describes four categories of organizations: I) established organizations carrying out regular tasks, II) expanding organizations with regular tasks, III) extending organizations that take on nonregular tasks, and IV) emergent organizations taking on nonregular tasks. Of the four groups, the last three can be considered restructuring types. Based on the 1998 ice storm disaster in Canada, Scanlon concludes that type IV is the most effective. Furthermore, Scanlon claims that they are more effective if they are planned in advance. Scanlon does not explain how one plans an improvised organization in advance. It is reasonable to assume that certain organizations are given the assignment and training for taking on “nonregular” tasks. With this assignment they can prepare for how to handle any task that is nonregular. This training will probably consist of strategies on how to handle different situations and what prestaged assets may be available as well as concerning the design limits and resilience of the systems under their authority and their connection to other systems under other authorities. Following two major disasters, the London bombings of 2005 and the Inangahua earthquake of 1968, two major reports address the problem of crisis management organization and how to prepare in advance for major disruptions like these. The reports are the Ministry of Civil Defense [1968] for Inangahua and the 7 July Review Committee [2006] for the London Bombings. Both these reports laid the framework a restructuring of the crisis management infrastructure. The essence of this restructuring was an emphasis on interactions among the various agencies. The only way this can be accomplished is with a rigorous implementation of the cross-scale interaction principle. This means that communications, cooperation, collaboration, command and control, and information management have to be in place and survivable. However, it is apparent now that now sometimes via Twitter and other social network public messaging services spontaneous organizations offer to help in disasters by citizens some located worldwide and some local and so social networking media is becoming a way for spontaneous organizations to arise to respond when authorities cannot respond in some cases. The one thing that is common to all the case studies is the presence of humans to create the on-site improvised systems with emergent properties. The human in

122

S. JACKSON

the loop principle was at work in all these cases. Although it is conceivable that nonhuman systems could produce improvised systems, in practice humans usually play this role as they did in these case studies. As stated in Table 1, the premise of the human in the loop principle is that humans are more capable of recognizing unanticipated threats and dealing with them than automated systems are. Hence, if there is a single thing that can be done in advance it is to have highly trained and skilled front line disaster personnel, that is a prestaged asset of trained bricoleurs, on hand in all disaster situations.

REFERENCES 7 July Review Committee. (2006), “Report of the 7 July Review Committee,” Greater London Authority, London, pp. 124 – 141. 9/11 Commission, (2004), “9/11 Commission Report,” pp. 213, 294. 11 September Commission. (2004), “9/11 Commission Report,” National Commission on Terrorist Attacks. Adams, K. M. (2011), “Systems Principles: Foundation for the SoSE Metholdology,” International Journal of Systems Engineering, Vol. 2, No. 2/3, pp. 120– 155. BBC. (2010), New Rules to Aid ash Flight Chaos, Vol. 2011, London. http://news.bbc.co.uk/ 2/hi/uk_news/8688517.stm. BBC News. (2010), Easy Jet to Trial Volcanic Ash Detection System, Vol. 2011, London. http://www.bbc.co.uk/news/10234553. Billings, C. (1991), Aviation Automation: A Concept and Guidelines, National Aeronautics and Space Administration (NASA), Moffett Field, CA. Billings, C. (1997), Aviation Automation: The Search for Human-Centered Approach, Lawrence Erlbaum Associates, Mahwah, NJ. Carson, R. S., and Sheeley, B. (2012), “Functional Architecture as the Core of Model-Based Systems Engineering,” National Defense Industrial Association Systems Engineering Conference, The Boeing Company, San Diego, California, pp. 1 – 22. Checkland, P. (1999), Systems Thinking, Systems Practice, John Wiley & Sons, New York. Dekker, S., Hollnagel, E., Woods, D. D., and Cook, R. (2008), “Resilience Engineering: New Directions for Measuring and Maintaining Safety in Complex Systems,” Final Report, Lund University School of Aviation, Lund, Sweden. DoD. (2012), Department of Defense Standard Practice: System Safety, Department of Defense, Washington, DC. Giachetti, R. E. (2010), Design of Enterprise Systems: Theory, Architecture, and Methods, CRC Press, Boca Raton, Florida. Haddon-Cave, C. (2009), An Independent Review into the Broader Issues Surrounding the Loss of the RAF Nimrod MR2 Aircraft XV230 in Afghanistan in 2006. The House of Commons, London. Haimes, Y. (2008), “Homeland Security Preparedness: Balancing Protection with Resilience in Emergent Systems,” Systems Engineering, Vol. 11, No. 4, pp. 287– 308. Haimes, Y. Y. (2009), “On the Definition of Resilience in Systems,” Risk Analysis, Vol. 29, No. 43, pp. 498 – 501.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

123

Hansen, G., and Condon, E. (1989), Denial of Disaster, Cameron and Company, San Francisco, CA. Hitchins, D. K. (2007), Systems Engineering: A 21st Century Systems Methodology, John Wiley & Sons, Hoboken, NJ. Hollnagel, E., Woods, D. D., and Leveson, N. (eds.) (2006), Resilience Engineering: Concepts and Precepts, Ashgate Publishing Limited, Aldershot, UK. INCOSE Fellows. (2006), “A Consensus of the INCOSE Fellows,” Vol. 2016, International Council on Systems Engineering, http://www.incose.org/AboutSE/WhatIsSE. IVATF. (2010), Summary of Iceland’s Observational Network Used to Monitor the Eyjafjalllaokull Volcanic Ash Plume at the Source, International Volcanic Ash Task Force, Montreal. Jackson, S. (1997), Systems Engineering for Commercial Aircraft, Asgate Publishing Limited (in English and Chinese), Aldershot, UK. Jackson, S. (2010), Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions, John Wiley & Sons, Hoboken, NJ. Jackson, S. (2015), “Overview of Resilience and Theme Issue on the Resilience of Systems,” Insight, Vol. 18, No. 1, pp. 7 – 9. Jackson, S., and Ferris, T. (2013), “Resilience Principles for Engineered Systems,” Systems Engineering, Vol. 16, No. 2, pp. 152 – 164. Jackson, W. S. (2015), “Evaluation of Resilience Principles for Engineered Systems,” Engineering, University of South Australia, Adelaide, Australia. Jackson, S. (2015), Systems Engineering for Commercial Aircraft: A Domain Specific Adaptation, Ashgate Publishing Limited, Aldershot, UK. Jackson, S., Cook, S. C., and Ferris, T. (2015), “A Generic State-Machine Model of System Resilience,” Insight, Vol. 18, No. 1, pp. 14 – 18. Leveson, N. (1995), Safeware: System Safety and Computers, Addison Wesley, Reading, MA. Leveson, N., Dulac, N., Zipkin, D., Cutcher-Gershenfeld, J., Carroll, J., and Barrett, B. (2006), “Engineering Resilience into a Safety-Critical System,” Resilience Engineering: Concepts and Precepts, Ashgate Publishing Limited, Aldershot, UK. Locke, J. (1999), An Essay Concerning Human Understanding, Vol. 2013, Pennsylvania State University, University Park, Pennsylvania, PA. Lonergan, B. (ed.) (1992), Insight: A Study of Human Understanding, University of Toronto Press, Toronto. Madni, A., and Jackson, S. (2009), “Towards a Conceptual Framework for Resilience Engineering,” Institute of Electrical and Electronics Engineers (IEEE) Systems Journal, Vol. 3, No. 2, pp. 181 – 191. Marczyk, J. (2012), Complexity Reduction, Email communication ed., Como, Italy, pp. 1. Maxwell, J., and Emerson, R. (2009), “Observations of the resilience architecture of the firefighting and emergency response infrastructure,” Insight. International Council on Systems Engineering, pp. 45 – 46. Mendoc¸a, D., and Wallace, W. (2006), “Adaptive Capacity: Electric Power Restoration in New York City Following the 11 September 2001 Attacks,” Second Resilience Engineering Symposium, Mines Paris, Juan-les-Pins, France, pp. 209–219. Ministry of Civil Defence. (1968), “Report on the Inangahua Earthquake, New Zealand,” Department of Internal Affairs, Wellington, pp. 9, 58.

124

S. JACKSON

National Geographic News. (2010), “Moon Landing Facts: Apollo 11 at 40.” Vol. 2012, http://news.nationalgeographic.com/news/2009/07/090715-moon-landing-apollofacts.html. NIST. (2005), “Federal Building and Fire Safety Investigation,” Final Report on Collapse of the World Trade Center Towers, NIST NCSTAR 1, Vol. 1, National Institute of Standards and Technology, Washington DC, pp. 203– 204. NTSB. (2010), Collision of Metrolink Train 111 with Union Pacific Train LOF65-12 Chatsworth, California September 12, 2008, National Transportation Safety Board, Washington, DC. NYFIC. (1912), “Preliminary Report of the New York Factory Investigating Commission,” New York Factory Investigating Commission, New York. OED. (1973), The Shorter Oxford English Dictionary on Historical Principles, 4th ed., Oxford University Press, Oxford. Palmer, K. D. (2009), “Emergent Design: Explorations in Systems Phenomenology in Relation to Ontology, Hermeneutics and the Meta-Dialectics of Design,” Ph.D. Thesis, Vol. 2014, University of South Australia, Adelaide. Parie`s, J. (2011), “Lessons from the Hudson,” Resilience Engineering in Practice: A Guidebook (Ashgate Studies in Resilience Engineering Series), Ashgate Publishing Limited, Farnham, Surrey. Perrow, C. (1999), Normal Accidents: Living With High Risk Technologies, Princeton University Press, Princeton, NJ. Petroski, H. (2006), Success Through Failure: The Paradox of Design, Princeton University Press, Princeton, NJ. RAF. (2007), Proceedings of a Board of Inquiry into an Aircraft Accident, Royal Air Force. Raveh, A. (2008), “Regroup Heuristic,” Comment during tutorial on resilience ed., Utrecht, the Netherlands. Reason, J. (1990), Human Error, Cambridge University Press, Cambridge. Reason, J. (1997), Managing the Risks of Organisational Accidents, Ashgate Publishing Limited, Aldershot, UK. Rechtin, E. (1991), Systems Architecting: Creating and Building Complex Systems, CRC Press, Englewood Cliffs, NJ. Ricci, N., Fitzgerald, M. E., Ross, A. M., and Rhodes, D. H. (2014), “Architecting System of Systems with Ilities: An Overview of the SAI Method,” Conference on Systems Engineering Research 2014, University of Southern Californa, Redondo Beach, CA. Richards, M. G. (2009), “Multi-Attribute Tradespace Exploration for Survivability,” Engineering Systems, Ph.D. Dissertation, Massachusetts Institute of Technology, Cambridge, MA. Rijpma, J. A. (1997), “Complexity, Tight Coupling and Reliability: Connecting Normal Accidents Theory and High Reliability Theory,” Journal of Contingencies and Crisis Management, Vol. 5, No. 1, pp. 15 –23. San Francisco Fire Department. (2011), Water Supply System, Vol. 2011, San Francisco Fire Department, San Francisco, CA. Scanlon, J. (1999), “Emergent Groups in Established Frameworks: Ottawa Carleton’s Response to the 1998 Ice Disaster,” Journal of Contingencies and Crisis Management, Vol. 7, No. 1, pp. 30 – 37.

ENGINEERING RESILIENCE INTO HUMAN-MADE SYSTEMS

125

Shannon, C. E. (1948), “A Mathematical Theory of Communication,” The Bell System Technical Journal, Vol. XXVII, No. 3, pp. 379– 423. Skybrary. (2012), “SE 120: Terrain Awareness and Warning System.” Vol. 2014, Eurocontrol. Skybrary. (2013), “Non Avian Wildlife Hazards to Aircraft,” Bird Strike, Vol. 2014, Eurocontrol. http://www.skybrary.aero/index.php/Non_Avian_Wildlife_Hazards_ to_Aircraft?utm_source=SKYbrary&utm_campaign=ee75eb4949-SKYbrary_ Highlight_01_01_2014&utm_medium=email&utm_term=0_e405169b04-ee75eb4949276526209. Taleb, N. N. (2014), Antifragile: Things that Gain from Disorder, Random House, New York. Tierney, K. J. (2003), Conceptualising and Measuring Organisational and Community Resilience: Lessons From the Emergency Response Following the September 11, 2001 Attack on the World Trade Centre, University of Delaware, Newark, Delaware. The White House. (2011), “Presidential Policy Directive,” https://www.dhs.gov/xlibrary/ assets/presidential-policy-directive-8-national-preparedness.pdf [retrieved 24 April 2016]. US-Canada Power System Outage Task Force. (2004), “Final Report on the August 14, 2003 Blackout in the United States and Canada: Causes and Recommendations,” Washington-Ottawa. Woods, D. D. (2006), “Essential Characteristics of Resilience,” Resilience Engineering: Concepts and Precepts, Ashgate Publishing Limited, Aldershot, UK, pp. 21 – 34. Zarboutis, N., and Wright, P. (2006), “Using Complexity Theories to Reveal Emerged Patterns that Erode the Resilience of Complex Systems,” Second Symposium on Resilience Engineering, Juan-les-Pins, France, pp. 359– 368.

CHAPTER 4

Applying SysML and a Model-Based Systems Engineering Approach to a Small Satellite Design Sanford Friedenthal SAF Consulting, Reston, Virginia

Christopher Oster† Lockheed Martin Corporation, Cherry Hill, New Jersey

I. INTRODUCTION This chapter demonstrates how the OMG Systems Modeling Language (OMG SysMLTM ) and a model-based systems engineering (MBSE) approach are applied to the design of a small spacecraft. The example used to demonstrate the approach is inspired by the spacecraft design from the FireSat II example in Space Mission Engineering: The New SMAD [Wertz et al. 2011]. This chapter begins with a brief introduction to MBSE and SysML. It explains MBSE and contrasts this approach with a more traditional document-based systems engineering approach to highlight the potential value MBSE can bring to the design of a system. The chapter then introduces SysML to highlight its capability to represent systems. The remainder of the chapter describes the application of MBSE using SysML to the mission and system specification and design of a small spacecraft, with emphasis on the architecture design. This chapter ends with a short summary. The term spacecraft is used throughout this chapter, which refers to any manmade vehicle that is intended to travel in space beyond the Earth’s atmosphere. A satellite is a particular kind of spacecraft or a natural object that orbits the Earth or any other celestial body.

A. MBSE OVERVIEW MBSE is an approach to systems engineering where the model of the system is a primary artifact of the systems engineering process. The system model describes the system in terms of its specification, design, verification, and analysis information. This model is managed and controlled throughout system lifecycle to provide a consistent, precise, and traceable definition of the system.  SAF

Consulting. Systems Architect.



Copyright # 2014 by the authors. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.

127

128

S. FRIEDENTHAL AND C. OSTER

This MBSE approach is often contrasted with a more traditional documentbased approach where similar information is captured in document-based artifacts, such as text documents, spreadsheets, and drawings. In a document-based approach, the documents are managed and controlled throughout the system lifecycle. However, it is difficult to maintain consistency, precision, and traceability when the information is distributed across these different artifacts. The modelbased approach can provide significant benefit over the document-based approach that includes enhanced system specification and design quality and productivity, thereby enabling a shared understanding of the system. In particular, an MBSE approach can do the following: 1. Increase precision of the system specification and design, resulting in reduced downstream errors. 2. Improve traceability between system requirements, design, analysis, and verification information to enhance system design integrity. 3. Improve maintenance and evolution of the system specification and design baseline to leverage this information throughout the system lifecycle and its reuse across projects. 4. Provide a shared understanding of the system to reduce miscommunication among the development team. MBSE is not new, but its emphasis has changed over the years. Systems engineering has traditionally included the use of models of all kinds. However, the current emphasis for MBSE is on building an integrated system model that provides multiple views of the system. Evolving modeling standards such as SysML are enabling MBSE to mature and achieve more widespread adoption within the Aerospace and Defense industry and other industries. Model-based approaches are pervasive in many other disciplines such as mechanical design, electrical design, control systems design, and software design. MBSE is often being placed in the broader context of model-based engineering [MBE NDIA Report, 2011]. In this context, the system model is intended to integrate with other models used by systems engineers and other disciplines. The system model serves as an integrating framework that spans the different disciplines and domains.

B. KINDS OF MODELS A model is a physical, mathematical, or otherwise logical representation of a system, entity, phenomenon, or process [DoD, 1998]. A model describes various aspects of a system. For example, a three-dimensional (3D) geometric model created from a computer-aided design (CAD) tool describes the geometrical layout of a system. An analytical model of a system may describe quality characteristics of a system, such as its reliability using closed form equations, or

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

129

the dynamics of a system using simulations that provide discrete-time numerical solutions to differential equations. A model should be developed for a purpose to enable understanding of the system, entity, phenomenon, or process being modeled. The fidelity of this model may vary depending on its purpose. In some cases, a low-fidelity model may be sufficient to satisfy is intended purpose. For example, a lowfidelity geometric model may be sufficient to support trade study analysis during a conceptual design phase. Another simple example is a spring-mass model where the spring constant and first-order damping coefficient may be sufficient representations to answer certain questions, whereas second-order affects such as velocity coefficients may be necessary to understand the more detailed dynamics. A system model can be referred to as a logical model. It describes logical relationships among the elements of the system and its environment. A simple example is a block diagram that describes the interconnection between the system components. This is neither a geometric model, nor is it an analytical model that produces numerical results. However, it describes important aspects of the system. A logical model can be used to assess various qualities of the system, such as its interface compatibility with other systems or between its components. To fully benefit from a shared system model, this model must integrate with many other kinds of models and data sources. In the previous example, the mechanical components represented in a system model can be related to the same components represented in a CAD model. Similarly, key parameters in the system model, such as mass, can be related to the corresponding parameters in other analytical models. Typically, one of the models is designated as the authoritative source, such that when the source changes, any models with information that depend on the source may need to change as well. The correspondence between shared information in different models must be maintained along with the identification of the source. Model integration approaches are the focus of many industry standards and related activities.

C. SysML OVERVIEW The Systems Modeling Language [OMG SysMLTM ] is a general purpose graphical modeling language for representing systems. It is intended to enable an MBSE approach to support the specification, design, analysis, and verification of systems that may include hardware, software, data, personnel, procedures, and facilities. SysML is an extension of the unified modeling language (UML). It was adopted by the object management group (OMG) and formally released as SysML v1.0 in 2007. The current version as of the time of this writing is SysML v1.4 and continues to evolve through the OMG technology adoption process. SysML can be used to capture multiple aspects of the system including its requirements, structure, behavior, and parametric relationships. These are often

130

S. FRIEDENTHAL AND C. OSTER

called the four pillars of SysML. Specifically, SysML can be used to describe the following: 1. The system breakdown as a hierarchy of subsystems and components. 2. The interconnection between systems, subsystems, and components. 3. The behavior of the system and its components in terms of the actions they perform, and their inputs, outputs, and control flows. 4. The behavior of the system in terms of a sequence of message exchanges between its parts. 5. The behavior of the system and its components in terms of their states and transitions. 6. The properties of the system and its components, and the parametric relationships between them. 7. The text-based requirements which specify the mission, system, and components, and their traceability relationships with design, verification, and analysis. The following section summarizes some key SysML concepts to help understand the example in the remainder of the chapter. SysML Diagrams The nine kinds of SysML diagrams are shown in Fig. 1. These diagrams are used to present the different aspects of a system as described previously. The diagrams include the Requirement Diagram, four kinds of Behavior Diagrams, two kinds of Structure Diagrams, the Parametric Diagram, and the Package Diagram. Each SysML diagram has a frame, a header, and a content area as shown in Fig. 2. The frame is the enclosure for the diagram content. The diagram

SysML Diagram Kinds

Requirement Diagram

«general category» SysML Diagrams

«general category» Behavior Diagrams

Package Diagram

Activity Diagram

Sequence Diagram

State Machine Diagram

Use Case Diagram

Fig. 1

«general category» Structure Diagrams

Parametric Diagram

Internal Block Diagram Block Definition Diagram

The nine (9) kinds of SysML diagrams.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

131

Fig. 2 Each SysML diagram has a diagram frame with a diagram header that can display fields which include the kind of diagram and the diagram name. header is shown at the top left of the diagram and describes information about the diagram. The header includes the following four fields: 1. Field 1: short descriptor identifying the kind of diagram, such as bdd for block definition diagram. 2. Fields 2 and 3: kind of model element and the name of model element that the frame represents. 3. Field 4: name of the diagram to indicate the purpose of the diagram. 1. STRUCTURE DIAGRAMS Defining Blocks and their Relationships on Block Definition Diagrams SysML includes the concept of a block as a general purpose construct to represent a system, external system, or component. The component may be hardware, software, data, facility, or a person. More generally, the block can represent any logical or physical entity, including things that flow, such as water or information. A block contains features that represent its properties, functions, interfaces, and states. A simple example of a block is shown in the block definition diagram (bdd) in Fig. 3. In the figure, the Spacecraft is a block that has a Mass of 150 kilograms and performs a function called collect observation data. It also has a port called lv electrical i/f that provides the electrical interface to the Launch Vehicle. The Spacecraft can be specified to have many other properties, functions, interfaces, and other kinds of features. Each kind of feature is depicted in a unique block compartment. The diagram header in this figure designates the kind of diagram as a bdd, which is short for block definition diagram. The name of the diagram is called Example of a block. The 2nd and 3rd fields in the diagram header indicate the diagram frame represents the Structure Package. The significance of packages is explained later in this section. Relationships between Blocks The block can be related to other blocks using whole-part, reference, generalization/specialization, and connection relationships.

132

S. FRIEDENTHAL AND C. OSTER

bdd [Package] 2-Structure [ Example of a block ] «block» Spacecraft values

mass : kg = 150 operations

collect observation data() lv electrical i/f

Fig. 3

An example of a block with some of its features.

Whole– Part Relationship A whole –part relationship, also known as a composite relationship, defines a hierarchy of blocks. A partial Spacecraft decomposition into subsystems is shown in Fig. 4. The Spacecraft in this diagram is the same Spacecraft that was shown in the previous diagram, but its features are not shown. The black diamond designates the Spacecraft as the whole end, and the arrow designates the subsystems at the part ends of the whole –part relationships. Each subsystem is designated with the key word “subsystem.” The guidance, navigation, and control (GN&C) subsystem is further decomposed into a Reaction Wheel and GN&C SW. The key words designate these elements are hardware and software components. bdd [Package] 2-Structure [ Example of whole-part relationships ] «block» Spacecraft

«block» «subsystem» GN&C

«block» «subsystem» Propulsion

«block» «subsystem» Power Subsystem

2 «block» «hardware» Reaction Wheel

Fig. 4

«block» «software» GN&C SW

Partial decomposition of Spacecraft into its subsystems.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

133

Multiplicity On the block diagram in Fig. 4, the GN&C subsystem is composed of 2 Reaction Wheels, as indicated by the number 2 next to the part end of the relationship. This is called the multiplicity. More generally, the multiplicity has a lower bound and upper bound, such as 2..4. An unlimited upper bound is referred to with an asterisk ( ). A lower bound of zero means the component part is optional. If no multiplicity is shown on the part end, the default multiplicity is one. Referencing a Block It is sometimes useful to aggregate a set of components into a logical whole even though the component may already be part of another whole –part relationship. For example, a software component may be considered part of another subsystem, but logically aggregated into a Software subsystem. As shown in Fig. 5, the Software subsystem logically aggregates the GN&C SW and the Power Mgmt SW using a reference association designated by a white diamond on the whole end. This is contrasted with the whole –part relationships designated with the black diamond between the GN&C and Power subsystems and the software. This enables the components to be aggregated in different ways for different purposes. Specializing a Block from a More General Block Another SysML relationship between blocks is the specialization relationship. In particular, a more general block contains features, such as properties, bdd [Package] 2-Structure [ Example of a reference association ] «block» Spacecraft

«block» «subsystem» GN&C

«block» «software» GN&C SW

«block» «subsystem» Power Subsystem

«block» «subsystem» Software

«block» «software» Power Mgmt SW

Fig. 5 The Software subsystem logically aggregates the software components using the reference association.

134

S. FRIEDENTHAL AND C. OSTER

functions, and interfaces that a more specialized block can inherit. The more specialized block can then add its unique features. This avoids having to define common features for each specialization, and thereby facilitates reuse. A simple example is shown in Fig. 6 showing the Propulsion subsystem having two specializations to represent a Mono-Propellant Propulsion and Bi-Propellant Propulsion variant. Both specializations inherit the mass and fuel mass properties as designated by the carrot symbol (^). In addition, the Bi-Propellant Propulsion adds a unique property to specify oxidizer mass. The specialized subsystems can also redefine features of the more general subsystem. As shown in the parts compartment in each block, the more general Propulsion subsystem has one to two tanks, whereas the Mono-Propellant Propulsion has 1 tank and the Bi-Propellant Propulsion has 2 tanks. Interconnecting Parts on an Internal Block Diagram The system, subsystems, and components can be interconnected on an internal block diagram (ibd) as shown in the simplified example in Fig. 7. This shows the connection and the flow of Electrical Power from the Launch bdd [Package] 2-Structure [ Example of specialization relationship] «block» «subsystem» Propulsion parts

t : Tank [1..2] values

mass : kg fuelMass : kg

«block» «subsystem» Mono-Propellant Propulsion parts

mpt : Tank [1]{redefines t} values

^mass : kg ^fuelMass : kg

«block» «subsystem» Bi-Propellant Propulsion parts

bpt : Tank [2]{redefines t} values

^mass : kg ^fuelMass : kg oxidizerMass : kg

Fig. 6 The mono-propellant propulsion and bi-propellant propulsion subsystems inherit common features from the more general propulsion subsystem.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

135

ibd [Block] Mission Context : Launch Vehicle

: Spacecraft : GN&C

: Power Subsystem Electrical Power

pwr i/f

pwr i/f

lv eelctrical i/f

lv electrical i/f

sc electrical i/f

Fig. 7 The internal block diagram shows the connection and flow of Electrical Power between the Launch Vehicle, Spacecraft, and some of its subsystems. Vehicle to the Spacecraft. This connection is further delegated to the Power Subsystem, which distributes the power to other subsystems such as GN&C. The general name for an interconnected element is a part. A port is an interaction point on a block or part that enables its interface to be specified. The lines connecting the ports are called connectors. A connector can connect parts directly or to ports on the parts. The frame of the diagram represents the higher level block that composes the Spacecraft and Launch Vehicle, which in this case is the Mission Context block as indicated in the diagram header. When a part is shown as a dashed rectangle instead of a solid rectangle, then this part is referenced by the higher level block. Definition vs Usage SysML facilitates model reuse by distinguishing the definition of an entity and the usage of an entity in a particular context. In Fig. 8, the Avionics subsystem is composed of a primary and backup computer. Both computers have the same definition but are identified as different usages. A usage in SysML is a part, and a definition is a block. In this example, primary and backup are parts, and the On-Board Computer is the block. The two parts are shown in the ibd in Fig. 9. The colon notation is used to designate the two parts as primary: on-board computer and backup:on-board computer. Behavior Modeling There are three different kinds of behavior representations in SysML including activity diagrams, sequence diagrams, and state machine diagrams. Use case diagrams are often used to support behavior modeling as well. Each of these diagrams is summarized below. Using Activity Diagrams to Model Input, Output, and Control Flow An activity diagram transforms a set of inputs to outputs through a controlled sequence of actions. Figure 10 shows a portion of an activity diagram to Provide Electrical Power. The enclosing frame represents the activity. This activity contains actions. Each action is performed by a component of the Power Subsystem

136

S. FRIEDENTHAL AND C. OSTER

bdd [Package] 2-Structure [ Parts with same definition-bdd ] «block» Spacecraft

«block» «subsystem» Avionics primary

backup

«block» «hardware» Onboard Computer

Fig. 8

Different parts with the same definition.

which is designated by swim lanes. Inputs and outputs to and from the activity are designated as rectangles on the frame, and the inputs and outputs to and from each action are shown as small rectangles on the action. The output of one action connects to the input of another through an object flow. The Perform Mission activity shown in Fig. 11 represents a typical mission scenario. The control flows are indicated by dashed lines with no rectangles on either end. The activity execution semantics are based on the flow of tokens. An action cannot execute until a token is available on all of its inputs. This enables the modeler to precisely specify a system behavior. ibd [Block] Spacecraft [ Parts with same definition-ibd ] : Avionics primary : Onboard Computer

backup : Onboard Computer

Fig. 9 Referring to parts on an internal block diagram using the colon notation as partName:blockName.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

137

act [ Provide Electrical Power ] «allocate» Power Subsystem «allocate» Solar Array

: Generate Power

«valueType» : Solar Radiation

«allocate» Battery

: Store Power

«allocate» Subsystem

«allocate» Power Conditioner

«allocate» Power Distribution

: Condition Power

: Distribute Power

: Consume Power

«valueType» : Electrical Power

Fig. 10 Activity diagram showing its inputs and Power Subsystem actions to Provide Electrical Power. There are several kinds of control nodes to control the flow of inputs, outputs, and control. These include the initial node and the activity final node that designate the start and termination of the activity as shown in Fig. 11. When a token is available on the initial node, the activity begins execution, and when a token reaches the activity final, the activity terminates its execution. There are other control nodes such as a decision node that designate which outgoing flow executes based on the input satisfying a specified guard condition, and a join node that cannot execute until a token on each incoming flow has arrived. There are many specialized kinds of actions. A signal is sent from an activity using a send signal action, and a signal is received by an activity using an accept event action. The sending and receiving of the signal correspond to events that can control the flow of execution. Actions can also call another activity to further elaborate its behavior. This can also be presented as a hierarchy of activities similar to a traditional functional decomposition. Activity diagrams include many other features to precisely specify a behavior, such as the ability to interrupt selected actions based on the arrival of a signal, and streaming and timecontinuous inputs and outputs. act [ Perform Mission ]

: Maintain Operations

: Launch S/C

: Separate from L/V

: Control Trajectory

: Deploy Mechanisms

: Provide Observation Data

Fig. 11 Activity diagram showing the actions and control flow to Perform Mission.

138

S. FRIEDENTHAL AND C. OSTER

sd [ Mission Events ]

«block» : Launch Ops

«block» : Launch Vehicle

«block» «system of interest» : Spacecraft

«block» : Ground Station

1: Power On

2: Launch 3: Spaceraft Separation Command 4: Spaceraft Separation Complete

{tmin..tmax}

5: Solar Array Deployment Initiate

6: Solar Arrray Deployed

7: Antenna Deployed 8: On Orbit

Fig. 12 A sequence diagram used to represent a mission timeline. Using Sequence Diagrams to Model Message Exchange between Parts A sequence diagram is used to describe behavior as a sequence of messages between parts. It can also represent a timeline of events, such as the simplified mission timeline shown in Fig. 12. The parts are represented as lifelines at the top of the figure, and the messages are sent from one part to another as indicated by the lines with arrowheads. Time advances down the vertical axis. In this simplified timeline, Launch Ops sends a Power On signal to the Spacecraft, and then sends a launch signal to the Launch Vehicle. The time between the Spacecraft Separation Complete event and the Solar Array Deployed event is constrained to occur between tmin and tmax. The semantics of a message can correspond to a signal that is sent from one part to another, or to a sending part requesting a behavior to a receiving part, similar to calling a subroutine in software. Because of these semantics, sequence diagrams are often used to represent the interaction between software components. They can also be used to describe other system level behavior, although activities provide a more robust capability for modeling continuous behavior and behaviors involving more complex flows of data and control.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

139

Using State Machine Diagrams to Model the State History of a Block A state machine diagram describes the discrete states of a block and the transitions from one state to another. A state represents a condition of a block, such as the off state or on state of a system or component. The transition between the off and on states may be triggered by the receipt of a signal, such as when a user turns the power on or off. A simple example of a state machine diagram for the Payload Sensor is shown in Fig. 13. In addition to defining the states, one can also define entry, exit, and do behaviors that occur when the system or component is within a particular state. These represent the behavior that occurs upon entry to the state, upon exit from the state, or while being in the state, respectively. The Payload Sensor has a do behavior in its on state called Sense Thermal Emissions. Behaviors can also be defined on transitions between states. For example, a signal can be sent on a transition which triggers a state transition in another block. The entry, exit, do, and transition behaviors can be elaborated by activity diagrams or sequence diagrams, or with behaviors that are specified in a code called opaque behaviors. With an appropriate MBSE method, SysML can be used to integrate activities, state machines, and sequence diagrams to emphasize different aspects of the system behavior. Using Use Case Diagrams to Model the Goals of a System A use case diagram is generally used to describe the goals of a system, such as those associated with the mission objectives in Fig. 14. They consist of the use case (e.g., the goal), the system called the subject, stm [ Payload Sensor States ] and the external systems called actors that participate in achieving the goal. The actors in this figure off include the Forest Service, {power=0} the Operator, and the Fire Department. A goal of the Operator is to Provide Sensor On Sensor Off Forest Fire Data in Real Command Command Time to the Fire Department. This supports the on do / Sense Thermal Emissions

Fig. 13 State machine diagram showing the on and off states of the Payload Sensor, and the behavior that occurs while in the on state.

140

S. FRIEDENTHAL AND C. OSTER

uc [Package] Mission Use Cases[ Mission Use Cases ]

«stakeholder» Forest Service

Detect and Monitor Forest Fires in US and Canada

«include»

«stakeholder» Operator

Provide Forest Fire Data in Near Real Time

«stakeholder» Fire Department

Fig. 14 A use case diagram depicts the goals of the external users (i.e., actors) that the system is intended to support.

broader goal of the Forest Service to Detect and Monitor Forest Fires in United States and Canada as indicated by the include relationship. The actors can be depicted as either stick figures or rectangles. Once a use case is defined, activity diagrams, sequence diagrams, and/or state machine diagrams are used to elaborate the behavior needed to realize the goals. Modeling Text-Based Requirements and their Relationship to Design, Analysis, and Verification A text requirement can be expressed in SysML and then related to other parts of the model to establish traceability to the system design, analysis, and test cases. This extends traditional requirements management approaches which primarily focus on traceability relationships between requirements. The trace relationships can be navigated to provide robust impact assessments. The requirements relationships include satisfy, verify, refine, derive, and others. An example of traceability between a requirement and the corresponding design, analysis, and test case is shown in Fig. 15. The Sensing requirement for the Payload Sensor specifies the required resolution and sensitivity. This requirement is satisfied by the Resolution and Sensitivity of the Payload Sensor as indicated by the satisfy relationship from the mid range IR scanner. The required resolution and sensitivity are derived from the Spacecraft system requirements for Probability of Detection and orbit altitude. The rationale for the derivation is based on the Sensor Performance Analysis. The test case called Verify Sensor Resolution is used to verify whether the Payload Sensor satisfies its Resolution requirement.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

141

Requirements and their relationships can also be presented in requirements tables, which provide a compact way to present the requirements and their relationships. Modeling Parametric Constraints on System Properties to Support Engineering Analysis As noted previously in the structural modeling section, a block can contain value properties that represent important quantifiable characteristics about the system or components, such as its mass, power consumption, size, reliability, or cost. In addition to capturing the key properties of a system, SysML enables the capture of constraints, such as Newton’s law (e.g., Force ¼ Mass  Acceleration). The parametric diagram in Fig. 16 shows a constraint called Orbit Analysis Model with some of its input and output parameters designated as squares flush with the inside boundary. These parameters include key orbit parameters and selected mission effectiveness measures for revisit time and coverage. This Orbit Analysis Model in the figure is a proxy for an analytical model that can perform the computation. The analytical model can be thought of as a set of equations. The parameters of the analysis model are bound to the corresponding properties of the mission context and spacecraft as indicated by the binding connectors with the “equal” key word. The parameter names in the constraint and the property names in the parts do not have to match, all though they do in this example, enabling the analytical model to be reused with other design models. The binding

req [ Example requirements traceability] «requirement» Spacecraft Specification

«requirement» Functional and Performance

«requirement» Payload Sensor Specification

«requirement» Probability of Detection «deriveReqt»

«requirement» Functional and Performance

«block» «hardware» Mid Range IR Scanner values

resoulution : milliradian = 2.714 sensitivity : celsiusTemperature

«requirement» Sensing «satisfy» «satisfy»

...

operations

^sense thermal emissions()

«rationale»

Sensor Performance Analysis

«satisfy»

Id?=?"80.3.1" Text?=?"The sensor shall sense thermal emissions with a sensitivity less than TBD, and a resolution of less than 3 miliiradians."

«requirement» Orbit Altitude «deriveReqt»

«verify» «testCase» Verify Sensor Resolution

Fig. 15 The Sensing requirement is derived from other requirements with supporting rationale. This requirement is also satisfied by the Mid Range IR Scanner and is verified by a test case.

142

S. FRIEDENTHAL AND C. OSTER

par [Block] Analysis Context [ Orbit Analysis ] suject of analysis : Mission Context

«constraint» oa : Orbit Analysis Model

: Mission Enterprise revisit time

{} «equal»

revisit time

«equal»

required coverage

eccentricity

«equal»

eccentricity

inclination

«equal»

coverage

: Spacecraft [1..*] orbit : Orbit

max beta angle percent time in eclipse semi-major axis

number of satellite planes satellites per plane

«equal» «equal»

«equal»

«equal» «equal»

inclination max beta angle percent time in eclipse

semi-major axis

number of planes vehicles per plane

Fig. 16 A parametric diagram binds the parameters of the constraint to the properties of the design. relationship enables the integration between the system design model and multiple analytical models that perform the analysis. In this way, SysML provides a means to identify critical properties, reconcile the critical design properties with the analysis models, and trace the analysis results to requirements. Organizing the Model with Packages The SysML model contains model elements. Each model element represents something expressible in SysML, such as a system, a component, a component feature such as its property, function, or interface, or a relationship between components, such as a whole – part relationship. A system model can get very large as details are added.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

143

An important aspect of modeling with SysML is the need to manage the model. This begins with a well-defined model organization using packages that contain the model elements. A package is like a folder and provides a way to organize the model into logical groupings. For example, packages can contain the Requirements, Structure, Behavior, and Parametrics as shown in Fig. 17. In this example, the Structure package contains nested packages for the different levels of design that include Mission, System, and Subsystem. The other packages for Requirements, Behavior, and Parametrics can contain similar nested packages for each level of design. This model organization also includes a Support Elements package that contains other model elements that cross-cut different levels of design, such as a units library and items that flow through the system. In this way, the model is organized into a tree of packages, where each package may contain other nested packages and model elements. There is no single way to organize the model no more than there is a single way to organize what goes into your cabinets and drawers. However, a model that is organized based on the product structure and the four pillars (requirements, structure, behavior, and parametrics) is a reasonable place to start. Diagram vs Model Building the system model consists of populating the packages with model elements that describe the system and environment of interest. This is accomplished by adding elements to, modifying elements in, and deleting elements from the model using the SysML diagrams as a means for inputting and visualizing the data. Each diagram is a view that presents a subset of the information contained in the model. For example, an ibd may show the connection between the pkg [ Example Model Organization]

1-Requirements

2-Structure

3-Behavior

4-Parametrics

1-Mission

1-Mission

1-Mission

1-Mission Analysis

2-System

2-System

2-System

2-System Analysis

3-Subsystem

3-Subsystem

3-Subsystem

5-Support Elements

Fig. 17 The package diagram describes the model organization in terms of a set of nested packages.

144

S. FRIEDENTHAL AND C. OSTER

components and/or subsystems. This diagram is not typically used to present other information contained in the model such as the functions the components perform. Each diagram highlights different aspects of the system. The model is a tree of model elements, and each model element has a unique position in the tree. However, any model element may appear on zero, one, or many diagrams. For example, the Spacecraft block appeared on multiple diagrams, but is uniquely defined within the model tree. When a model element is modified, the change to the element is reflected in all diagrams that the element appears. Learning SysML and MBSE There is a lot to learn to apply MBSE effectively including a modeling language, an MBSE method, and a modeling tool. Each of these has its own complexities. SysML is an expressive language that can be used in many ways. It can achieve its intended purpose or it can be a frustrating effort that accomplishes little. Like any language, it takes time and practice to become proficient in its application. There are several references [Friedenthal et al. 2014; Delligatti, 2013] that describe the language and how it can be used. It is important to note that the model is only as good as the data that goes into it and should be subject to on-going peer review by domain experts. In addition, it is important to apply a disciplined process using proven MBSE practices and guidelines.

D. SYSTEM MODELING TOOL A system modeling tool is used to build the system model. It provides the basic capabilities to create, modify, delete, and view the model. A modeling tool also provides model checking to ensure that the models conform to the language specification or other validation rules. A simple example is that a modeling tool should permit a system element to satisfy a requirement, but not allow a requirement to satisfy a system element. There are many other capabilities that different modeling tools offer, such as the abilities to execute an activity diagram, to autogenerate documentation, to query the model, and to exchange the data via the tool’s application program interface (api). The tool should also support standards that enable integration with other modeling tools and data sources. Each tool has its strengths, weaknesses, and price point. Different vendor tools should be carefully evaluated to select a tool and environment that meet your individual, project, and organizational needs. A typical modeling tool user interface is shown in Fig. 18. The interface includes the following key elements: 1. A browser that depicts the model containment tree. 2. A diagram area to create, modify, and view the diagrams.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

145

Fig. 18 A typical system modeling tool interface that includes the diagram area, browser, pallet, and tool bar.

3. A pallet with the diagram symbols used to create the diagrams. 4. A toolbar with the menu options that provide the tool functionality. The tool typically provides a dialogue box to view the detailed information associated with a particular model element. For example, the dialogue box for a block may include its properties, functions, ports, and its relationship with other model elements. In addition, the modeling tool provides mechanisms to navigate the model, such as locating a model element in the browser. There are many other features available to exploit the rich dataset that a system model contains.

II. SPACECRAFT SYSTEM DESIGN EXAMPLE This section provides an example of how a model-based approach can be applied to specify, design, analyze, and verify a complex system such as a spacecraft. This example leverages the spacecraft design and assumptions from the FireSat II example in the Space Mission Engineering: The New SMAD [Wertz et al. 2011]. The example demonstrates the application of a model-based approach with SysML by using model artifacts to capture selected aspects of the mission and system specification and design. A reader of this section should refer to the New SMAD for the detailed specification and design, rationale, and assumptions.

146

S. FRIEDENTHAL AND C. OSTER

This section first introduces a simplified MBSE method and then highlights the application of this method to the FireSat II example.

A. MBSE METHOD OVERVIEW The top level activities for a simplified MBSE method are highlighted in the Mission and System Specification and Design process in Fig. 19. The activities are consistent with a typical systems engineering process, such as those found in the ISO/IEC 15288 standard, on system life cycle processes [ISO/IEC, 2008]. act [ Mission & System Specification and Design]

Plan the Modeling Effort

Organize the Model

Maintain Requirements Traceability

Analyze Mission & Stakeholder Needs Perform Analysis Specify System Requirements

Synthesize Alternative System Architectures

Integrate & Verify System

Fig. 19 The Mission and System Specification and Design process is applied to the design of a spacecraft using a model-based approach.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

147

Applying this method results in a system model that is captured in modeling artifacts which represent the mission and system specification and design information. This method is applied iteratively to evolve the system model in its breadth, depth, and fidelity throughout the system development lifecycle. The method is applied during the early phases of design to develop the system conceptual design and then iterated in later phases to evolve the mission and system specification and design through preliminary and detailed design. The following summarizes the basic activities and modeling artifacts. Plan the modeling effort to maximize value from the model within the project constraints. 1. Establish the purpose and goals of the modeling effort. 2. Define the scope of the model to meet its intended purpose. 3. Define the schedule for developing each modeling artifact. 4. Establish the configuration management approach for managing updates to the model. 5. Define how the modeling effort will be organized to develop and maintain the model. 6. Select the MBSE methodology and applicable modeling practices. 7. Select, acquire, and configure the modeling tools and environment. 8. Determine the training needs for the modeling effort. Organize the model to facilitate navigation, access control, and reuse. 1. Define the package diagram to organize the data contained in the system model. 2. Establish modeling conventions. Analyze mission and stakeholder needs to understand the stakeholder needs and develop the mission concept needed to satisfy the needs. 1. Identify the stakeholders and the problems to be addressed. 2. Define the top level use cases to represent the mission objectives. 3. Develop the mission requirements that support the mission objectives. 4. Create the domain model (e.g., bdd) to identify the system and external systems and users. 5. Define the measures of effectiveness (moe) that can be used to quantify the value of a proposed solution for the stakeholders. 6. Create the mission activity diagrams to represent the mission level behavior. Specify system requirements include the required system functionality, interfaces, physical and performance characteristics, and other quality characteristics to

148

S. FRIEDENTHAL AND C. OSTER

support the mission objectives and effectiveness measures and create the system context diagram (ibd) to specify the system external interfaces. 1. Elaborate the mission activity diagrams as needed to specify the system functional requirements. 2. Capture the system black box specification that specifies the system functions, interface, and key technical performance measures. 3. Capture text-based requirements that support the mission objectives and moe in a requirement diagram. Synthesize alternative system architectures by partitioning the system design into subsystems and components that can satisfy the system requirements. 1. Decompose the system into its components on a bdd. 2. Define the interconnection among the parts using the ibd. 3. Define the interaction among the parts using activity diagrams. 4. Specify the components of each subsystem. Perform Analysis to evaluate and select a preferred system solution that satisfies the system requirements and maximizes the effectiveness measures (Note: this is done concurrently with the previous activities). 1. Capture the analysis context (bdd) to identify the mission and system analysis to be performed, such as orbit analysis, delta-V, mass properties, power, reliability, cost, and other critical properties. 2. Capture each analysis as a parametric diagram. 3. Perform the engineering analysis to determine the values of the system properties (Note: the analysis is performed using engineering analysis tools). Maintain requirements traceability to ensure that the proposed solution satisfies the system requirements and stakeholder needs (Note: this is done concurrently with the previous activities). 1. Capture the traceability between the system requirements and the stakeholder needs (e.g., use cases, measures of effectiveness) on a requirements diagram (or table). 2. Capture the traceability between the system design and the system requirements on a requirements diagram (or table). 3. Identify test cases needed to verify the system requirements on a requirements diagram and capture the verification results. Integrate and verify system defines the test environment and executes the test cases to verify that the system satisfies its requirements and validate that the system satisfies the stakeholder needs.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

149

1. Develop verification context that maps test cases to test objectives and defines the verification environment. 2. Define how the test cases are performed (e.g., high-level test procedures) using an activity diagram. 3. Define the sequence of test cases to support test planning. These activities are presented sequentially in the rest of the chapter, but in practice, are applied iteratively and in an order that depends on the objectives for each particular increment of design and development. More detailed examples of how SysML can be used to support a functional analysis and allocation method and an object-oriented systems engineering method (OOSEM) are included in the modeling examples in chapters 16 and 17 respectively of A Practical Guide to SysML [Friedenthal et al. 2014]. Many other systems engineering techniques can be applied to the design of a spacecraft and integrated with this end-to-end MBSE method such as those described in other chapters of this book. These include techniques used in the ROSETTA approach such as quality function deployment (QFD) for requirements elicitation and analysis (Chapter 5), design of experiments (DoE) for performing design sensitivity analysis and design optimization (Chapter 5), robust and resilient design techniques to ensure that a spacecraft can operate and recover from failure and/or off-nominal conditions (Chapter 3), and the system of systems integration and analysis techniques including interoperability analysis to understand how to integrate the spacecraft into its system of systems context that includes the launch vehicle, ground station including its operators, external users, and other mission elements (Chapter 1).

B. PLAN THE MODELING EFFORT The purpose of this activity is to develop a plan that maximizes the value of the model to its stakeholders within the project constraints. It is preferred practice to develop a plan for each project phase in terms of design increments or capability deliveries within the phase. The overall plan for each phase includes the selection of the MBSE methodology, tools, and training and how to organize the modeling effort to achieve the project objectives. The increment plans focus on the increment objectives and delivering specific modeling artifacts within the increment schedule. These plans are integrated with the overall project plans. The purpose of the modeling effort for each project phase and design increment must be clear. The broad goals may be to ensure a consistent and cohesive architectural description that spans multiple discipline and subsystem views, supports requirements flowdown to ensure the quality of the mission, system, subsystem, and component requirements, and support verification. The specific modeling goals for a design increment support the critical information needs to create or update selected modeling artifacts such as the mass equipment list, system block diagram, interface specifications, and selected analysis and trade studies.

150

S. FRIEDENTHAL AND C. OSTER

The scope of the effort defines the breadth, depth, and fidelity of the model to meet its intended purpose. For example, during an early phase of design, the model may support development of the system breakdown and mass estimates and may not require detailed modeling of component interfaces and software specifications. The modeling schedule supports the project schedule. The longer term schedule identifies which modeling artifacts are needed to support each project milestone, and the increment schedules define intermediate milestones to develop the modeling artifacts including peer reviews and increment baselines. The configuration management approach is established to manage updates to the model. During the early phase of design, the control on the baseline is less constrained than later phases. Typically, the configuration management process enables different users to check out parts of the model, make changes, and check the model back in. The organization and roles of the people involved in the development and use of the model evolve as the modeling practice matures on the project. During the early phases of a project, a small core modeling team often defines the modeling practice, and performs the modeling to ensure that the modeling practices are adhered to. The broader team members provide input to the core team and review the modeling artifacts to ensure that their inputs are accurately reflected in the model. As the modeling practice matures on the project, custom interfaces such as web browsers may be developed to enable the broader team to enter data directly into the model and view data from the model in ways that are meaningful to them. The core modeling team continues to ensure the overall integrity of the model and adapts the modeling practices and tools to support the different needs of the project. The MBSE methodology and applicable modeling practices are selected to achieve the model purpose and scope. The methodology may implement various aspects of the projects organization’s systems engineering process such as the one used in this chapter. The specific modeling practices should be piloted and validated to ensure that they yield the desired results. For example, the practice for modeling failure modes should facilitate the failure modes and effects analysis (FMEA). The modeling tools are often selected based on organizational considerations, since the organization makes substantial investments in maintaining its engineering tools and environments. However, there are often specific customizations and integrations with other tools that are needed to support the project specific model purpose, scope, and practices. The training needs for the modeling effort include who needs what training, when the training is needed, and how it will be delivered. For example, in the early phases of a project, the small core modeling team requires more intensive training in the modeling language, tool, methodology, and practices, whereas other members of the project team may only require training to enable them to interpret the modeling artifacts.

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

151

C. ORGANIZE THE MODEL The purpose of this activity is to define the model structure to organize the mission and system specification and design data to facilitate navigation, access controls, and reuse. In a document-based approach, the documents needed to support the mission and system specification and design activities are identified and often organized in a document tree that is then reflected in the structure of a document repository. In this model-based approach, the system model is organized into a package structure as described in Sec. I.C. The traditional documents can still be generated by querying the system model and presenting the information in a way that is useful to the stakeholders. Figure 20 shows the model organization for the spacecraft model as a package structure that consists of a top level package with nested packages contained within it. Each package contains model elements, which is analogous to a folder that contains data. The model elements in different packages can be related to one another through model relationships.

pkg [Package] Spacecraft Mission Context [ Model Organization ]

Spacecraft Mission Context

1-Requirements

6-Spacecraft

10-Support Elements

Black Box Specification

Interface Definitions

2-Structure Physical Architecture

Value Types

3-Use Cases Mission Level Verification

Viewpoints

4-Behavior

7-Ground Station 5-Analysis 9-Launch Operations 8-Launch Vehicle

Fig. 20 The model organization includes packages to capture specification and design data for the mission, Spacecraft, other mission elements, and supporting model elements.

152

S. FRIEDENTHAL AND C. OSTER

The top level package is called the Spacecraft Mission Context, which contains nested packages that define the mission Requirements, Structure, Use Cases, Behavior, and Analysis. The Spacecraft package contains nested packages to define the Spacecraft specification and design including the Black Box Specification, Physical Architecture, and Verification. The Ground Station, Launch Vehicle, and Launch Operations packages contain the model elements that describe them. The Support Elements package contains supporting information in the Interface Definitions, Value Types, and Viewpoints packages. The Interface Definitions package contains model elements that can be reused throughout the system, such as flows, signals, and port definitions. The Value Types package contains a library of quantities and units. The Viewpoints package contains the viewpoints that identify stakeholders and their concerns. These packages are initially empty containers, but are populated with data as one applies the MBSE methodology. This iterative process enables continuous refinement of the information about the mission and system specification and design. The sections in the rest of the chapter describes the model elements that are contained in these packages. Establish Modeling Conventions This activity also includes establishing modeling conventions to be applied throughout the modeling effort. Examples include naming conventions for different kinds of elements, diagram naming conventions, and diagram layout conventions. For this example, naming conventions include the following: 1. Blocks, Activities, and other Classifiers begin with upper case 2. Parts, properties, and actions begin with lower case 3. Port names are appended with i/f (for interface) 4. Activity names are defined as verb– noun Another example is to establish conventions for annotating the model. For example, each model element can include a text description that can be included in a glossary and used to support the generation of documentation from the model. Annotations also are used to capture design rationale. Other examples of modeling conventions are the selection of when to use a particular kind of port to specify the interfaces.

D. ANALYZE MISSION AND STAKEHOLDER NEEDS The purpose for this activity is to analyze the mission and stakeholder needs and establish the mission objectives, measures of effectiveness, mission requirements, and mission concept to address these needs. The activity includes identifying the stakeholders and determining what problem needs to be addressed by the mission, which in this case is to reduce the damage caused by wildfires across the United States. The mission objectives, measures of effectiveness, and

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

153

mission requirements are derived from the needs. The mission concept including the definition of the mission elements and their functions are developed to satisfy the mission requirements and optimize the mission within the specified technical, cost, schedule, regulatory constraints, and risk thresholds. Identify Stakeholders and their Concerns Some of the FireSat II stakeholders are identified in Fig. 21. For this example, the primary stakeholders include the end user, which in this case is the Forest Service and Fire Department, and the mission sponsor, who provides the funding and program oversight to address the stakeholder needs. The concept of viewpoint and view is used to capture the stakeholders concerns. A viewpoint is used to specify stakeholder concerns in terms of what they care about. A view conforms to a particular viewpoint by presenting the relevant information from the system model that addresses the stakeholder concerns. A view is presented to the stakeholder as a diagram, a table, an entire document, or any other artifact. The format of the artifact can also be specified, such as a document in Word, PDF, or HTML.

Fig. 21 The viewpoints capture the stakeholders and their concerns, and the conforming views present the specific information from the model to addresses their concerns.

154

S. FRIEDENTHAL AND C. OSTER

Viewpoint and view are important architecture concepts that provide a means to explicitly capture and address a broad spectrum of stakeholder concerns in the model. The viewpoint and view concept provide a mechanism to query the model for the information that addresses these concerns and present the resulting information in a way that is useful to each stakeholder. For FireSat II, the Forest Service is concerned about loss of life and property because of forest fires. The Fire Department is concerned about their ability to put out a fire in a timely manner to prevent the loss of life and property. The sponsor cares about the ability to satisfy the mission requirements within cost and schedule. The development contractor is concerned about establishing a feasible design to satisfy the requirements, and the Operator is concerned about his or her ability to operate the spacecraft to achieve the mission objectives. An Operator’s view may focus on the command and telemetry. As the project evolves, other stakeholders are identified and defined in the model in a similar way. Each set of stakeholder concerns can impose additional requirements and constraints on the mission and system. The stakeholders, viewpoints, and views are captured in the Viewpoints package that is shown in model organization in Fig. 20. Define the Mission Objective(s) The mission objectives should relate directly to stakeholder value. For the New SMAD, a primary objective of the Forest Service is for FireSat II to “detect and monitor forest fires in United States and Canada using as little financial resources as possible,” and an operator objective is to “provide forest fire data to the end user in real time, archive data, monitor, and maintain the health and safety of FireSat II.” These objectives are reflected in the use case diagram in Fig. 22. As emphasized in The New SMAD, the mission and system specification and design process is intended to provide balanced solutions that meet these objectives within cost, schedule, and risk constraints. Specify the Mission Requirements The mission requirements are specified to support the mission objectives. The New SMAD specifies the mission requirements shown in Table 1 and is the source of the mission requirements for this example. The requirements are captured in the model and can be presented in both tabular and graphical views. Fig. 22 The use case diagram reflects the mission objectives.

uc [Package] Mission Objectives[ Mission Use Cases ]

«stakeholder» Forest Service

Detect and Monitor Forest Fires in US and Canada

«include»

«stakeholder» Operator

Provide Forest Fire Data in Near Real Time

Archive Data

Reference: NEW SMAD pg 52

Monitor and Maintain Health & Safety of FireSat II

«stakeholder» Fire Department

APPLYING SYSML AND A MBSE APPROACH TO A SMALL SATELLITE DESIGN

TABLE 1 #

155

THE SOURCE MISSION REQUIREMENTS ARE CAPTURED IN THE MODEL AS DEFINED IN TABLES 3 AND 4 OF THE NEW SMAD Id

Name

1

34

Mission Requirements-SMAD Table 3-4

2

34.1

Performance

3

34.1.1

Weather

4

34.1.2

Resolution

5

34.1.3

Geo-location Accuracy

6

34.2

Coverage

7

34.3

Interpretation

8

34.4

Timeliness

9

34.5

Secondary Missions

10

34.6

Commanding

11

34.7

Mission Design Life

12

34.8

System Availability

13

34.9

Survivability

14

34.10

Data Distribution

15

34.11

Data Content, Form, and Format

16

34.12

User Equipment

17

34.13

Cost

18

34.14

Schedule

19

34.15

Risk

20

34.16

Regulations

21

34.17

Political

22

34.18

Environment

23

34.19

Interfaces

24

34.20

Development Constraints

Text

Work through light clouds

50 meter resolution

1 km geolocation accuracy

Coverage of specified forest areas within the US at least twice daily.

Identify an emerging forest fire within 8 hours with less than 10% false positives

Interpreted data to end user within 5 minutes

Monitor changes in mean forest temperature to +/- 2 C

Commandable within 3 min of events. download units of stored coverage areas.

8 years

95% excluding weather. 24 hour maximum downtime

Natural environment only. not in radiation belts.

Up to 500 fire-monitoring offices + 2,000 rangers worldwide (max of 100 simultaneous users)

Location amd extent in lat/long for local plotting. avg temp for each 40 m2

10 X 20 cm display with zoom and touch controls, built-in GPS quality map

Non-recurring