Reliability Engineering and Computational Intelligence for Complex Systems: Design, Analysis and Evaluation (Studies in Systems, Decision and Control, 496) [1st ed. 2023] 3031409965, 9783031409967

This book offers insight into the current issues of the merger between reliability engineering and computational intelli

141 50 8MB

English Pages 233 [224] Year 2023

Table of contents :
Preface
Contents
Mathematical Methods for Reliability Engineering and Computational Intelligence
Experimental Survey of Algorithms for the Calculation of Node Traversal Probabilities in Multi-valued Decision Diagrams
1 Introduction
2 Reliability Analysis
2.1 Structure Function
2.2 Basic Reliability Measures
2.3 Series–Parallel Systems
3 Decision Diagrams
3.1 Node Traversal Probability
3.2 Depth-First Search
3.3 Breadth-First Search
4 Experimental Comparison
5 Conclusion
References
Reliability Analysis of Data Storage Using Survival Signature and Logic Differential Calculus
1 Introduction
2 Redundant Array of Independent Disks
3 Mathematical Background
4 Case Study
5 Conclusion
References
Digital Techniques for Reliability Engineering and Computational Intelligence
Software Tests Quality Evaluation Using Code Mutants
1 Introduction
2 Mutation Testing
2.1 Mutation Testing Metrics
2.2 Algorithm of Mutation Testing
3 Software Tests Quality Evaluation
4 Model Example
5 Conclusion
References
Hacking DCNs
1 Introduction
2 Previous Work
3 Background
4 Methodology
4.1 Classification Evaluation
4.2 Misclassification Label Prediction
5 Experimental Results
5.1 Age and Gender Sensitivity
5.2 Prediction of Label Changes
6 Discussion
7 Conclusion
References
Markov Model of PLC Availability Considering Cyber-Attacks in Industrial IoT
1 Introduction
2 PLC Architecture
3 Evaluation of the Dependability of the PLC Considering Dos-Attacks on Its Components
4 Simulation of the Markov Model of PLC Availability
5 Conclusion
References
Advanced Networking and Cybersecurity Approaches
1 Motivation
2 State-Of-The-Art
2.1 Firewall Techniques
2.2 Blockchain Techniques
2.3 CIDN Deployment
3 Network Planning with Segmenting Within a Campus LAN
3.1 Unsegmented Networks
3.2 Segmenting Best Practices
3.3 Conventional Cybersecurity Approaches
4 Foundations for Advanced Cybersecurity
4.1 Open Web Application Security Project
4.2 MITRE Corporation
4.3 SIEM Market
5 Honeypotting for Advanced Security
5.1 Honeypotting with Gateways and Firewalls
5.2 Honeypotting and Vulnerability Monitoring
5.3 Production Honeypots
5.4 Research Honeypots
5.5 Practical Honeypotting
6 Conclusion
References
Use Cases for Reliability Engineering and Computational Intelligence
Application of Machine Learning Techniques to Solve the Problem of Skin Diseases Diagnosis
1 Introduction
2 Theoretical Background
3 Research Methods
3.1 Input
3.2 Pre-processing with Sobel
3.3 Brightness Normalization
3.4 Pre-processing with PCA
3.5 CNN for Image Detection and Classification
4 Results
4.1 Input Data Preprocessing
5 Conclusions
References
Analyzing Biomedical Data by Using Classification Techniques
1 Introduction
2 Metabolomics
2.1 Analyzing of Metabolomic Data
3 Datamining Techniques
3.1 Tools for Analyze Metabolomics Data
3.2 Glioblastoma Multiforme Data Analysis
4 Decision Tree Induction
4.1 Experimental Settings
5 Conclusion
References
Wildfire Risk Assessment Using Earth Observation Data: A Case Study of the Eastern Carpathians at the Slovak-Ukrainian Frontier
1 Introduction
2 Risk Assessment Methodology
2.1 Approach Concept
2.2 Applying Earth Observation Data
2.3 Risk Evaluation
3 A Case Study of the Eastern Carpathians at the Slovak-Ukrainian Frontier
3.1 Study Area
3.2 Fuels Data
3.3 Earth Observation Data Time Series
3.4 Risk Map
4 Discussion
5 Conclusions
References
Digital Safety Delivery: How a Safety Management System Looks Different from a Data Perspective
1 Introduction
2 Data Analysis and BowTies
2.1 Time-Series Data Analysis
2.2 BowTies
2.3 Complex Barriers
3 Online Process Safety Performance Indicators and Safety Management Systems Using Big Data
3.1 Online PSPIs
3.2 Safety Management Systems with Big Data
4 Conclusion
References
Reliability Optimization of New Generation Nuclear Power Plants Using Artificial Intelligence
1 Introduction
1.1 New Generation Nuclear Power Plants
1.2 Risk
1.3 Probabilistic Risk Analysis
1.4 Artificial Intelligence. Evolutionary Algorithms
1.5 Contributions from Different Perspectives
1.6 The Focus on This Work
2 Definition of the Objective Function and Its Constraints
3 Definition of Artificial Intelligence
4 Results
5 Discussion and Conclusions
References
Algorithmic Management and Occupational Safety: The End Does not Justify the Means
1 Introduction
1.1 Reliability Engineering, Predictive-Based Safety and Algorithmic Management
1.2 Responsible Use of Algorithmic Management
2 The Many Forms of Algorithmic Management
2.1 Personal Protective Equipment and Computer Vision
2.2 Control on the Shop Floor
2.3 Safe Driving Behaviour
3 Guidelines for Responsible Algorithmic Management in Safety
3.1 The Conflicting Goals of Health and Productivity
3.2 Fighting the System
3.3 The Desired and Mandatory Transparency
3.4 What Laws Apply Now
4 Conclusion
5 Discussion
References
Technologies and Solutions for Smart Home and Smart Office
1 Motivation and the Aims of the Work
2 Challenges for Smart Home and Smart Office
3 System Integrators for Smart Office
3.1 Secure IoT Platforms for Smart Office
3.2 Scenario 1: Automatization Sensors via NB-IoT
3.3 Scenario 2: Energy-Efficient EnOcean Sensor Constellation
4 Platforms for Easy Smart Home Integration
4.1 Tuya IoT Development Platform
4.2 Home Assistant
4.3 Azure IoT Hub
4.4 Heterogenous Automatization Example for Smart Home
4.5 Testing of MAKS PRO System Based on LoRaWAN for Deployment of Smart Homes and Smart Offices
5 Advanced Security for IoT and IIoT
6 Designing a Unique IoT System Using Edge/Cloud Computing and Artificial Intelligence
6.1 Configuring Data Collection and Analysis for the Designed IoT System
6.2 IoT System Testing
6.3 Testing an Intelligent IoT System for Temperature Forecasting in the Smart Office Server Room
7 Conclusion
References

Recommend Papers

Reliability Engineering and Computational Intelligence for Complex Systems: Design, Analysis and Evaluation (Studies in Systems, Decision and Control, 496) [1st ed. 2023] 3031409965, 9783031409967

121 112 36MB Read more

Analysis and Design of Markov Jump Discrete Systems (Studies in Systems, Decision and Control, 499) [1st ed. 2023] 9819957478, 9789819957477

This book proposes analysis and design techniques for Markov jump systems (MJSs) using Lyapunov function and sliding mod

111 45 23MB Read more

Analysis and Design of Markov Jump Discrete Systems (Studies in Systems, Decision and Control, 499) [1st ed. 2023] 9819957478, 9789819957477

This book proposes analysis and design techniques for Markov jump systems (MJSs) using Lyapunov function and sliding mod

115 18 3MB Read more

Systems, Decision and Control in Energy III (Studies in Systems, Decision and Control, 399) 3030876748, 9783030876746

This book describes new energy saving methods and technologies for heat power engineering. The book is devoted to topica

101 60 10MB Read more

Systems, Decision and Control in Energy V (Studies in Systems, Decision and Control, 481) 3031350871, 9783031350870

The book consists of 8 parts: Energy Informatics, Electric Power Engineering, Heat Power Engineering, Nuclear Power Engi

113 1 27MB Read more

Systems, Decision and Control in Energy II (Studies in Systems, Decision and Control, 346) 3030691888, 9783030691882

This book examines the problems in the field of energy and related fields (chemical, transport, aerospace, construction,

108 66 14MB Read more

Information Technologies in the Design of Aerospace Engineering (Studies in Systems, Decision and Control, 507) [1st ed. 2024] 3031435788, 9783031435782

This book proposes a solution to the problem of incorrect use of automation tools to perform complex design work. Curre

118 116 18MB Read more

Performance Evaluation Models for Distributed Service Networks (Studies in Systems, Decision and Control, 343) 3030670627, 9783030670627

This book presents novel approaches to formulate, analyze, and solve problems in the area of distributed service network

112 62 5MB Read more

Advanced Control Methodologies For Power Converter Systems (Studies in Systems, Decision and Control, 413) [1st ed. 2022] 3030942880, 9783030942885

This book aims to present some advanced control methodologies for power converters. Power electronic converters have bec

117 73 9MB Read more

Stochastic Switching Systems: Analysis and Design (Control Engineering) 0817637826, 9780817637828

Stochastic switching systems represent an interesting class of systems that can be used to model a variety of systems ha

117 30 3MB Read more

Reliability Engineering and Computational Intelligence for Complex Systems: Design, Analysis and Evaluation (Studies in Systems, Decision and Control, 496) [1st ed. 2023]
3031409965, 9783031409967

Author / Uploaded
Coen van Gulijk (editor)
Elena Zaitseva (editor)
Miroslav Kvassay (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Studies in Systems, Decision and Control 496

Coen van Gulijk Elena Zaitseva Miroslav Kvassay Editors

Reliability Engineering and Computational Intelligence for Complex Systems Design, Analysis and Evaluation

Studies in Systems, Decision and Control Volume 496

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Systems, Decision and Control” (SSDC) covers both new developments and advances, as well as the state of the art, in the various areas of broadly perceived systems, decision making and control–quickly, up to date and with a high quality. The intent is to cover the theory, applications, and perspectives on the state of the art and future developments relevant to systems, decision making, control, complex processes and related areas, as embedded in the fields of engineering, computer science, physics, economics, social and life sciences, as well as the paradigms and methodologies behind them. The series contains monographs, textbooks, lecture notes and edited volumes in systems, decision making and control spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Coen van Gulijk · Elena Zaitseva · Miroslav Kvassay Editors

Reliability Engineering and Computational Intelligence for Complex Systems Design, Analysis and Evaluation

Editors Coen van Gulijk School of Computing and Engineering University of Huddersfield Huddersfield, UK TNO Healthy Living and Work Leiden, The Netherlands

Elena Zaitseva Faculty of Management Science and Informatics University of Žilina Žilina, Slovakia

Miroslav Kvassay Faculty of Management Science and Informatics University of Žilina Žilina, Slovakia

ISSN 2198-4182 ISSN 2198-4190 (electronic) Studies in Systems, Decision and Control ISBN 978-3-031-40996-7 ISBN 978-3-031-40997-4 (eBook) https://doi.org/10.1007/978-3-031-40997-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This volume is the second installment in a series of papers about the fusion of two scientific domains: Reliability Engineering and Computational Intelligence (RECI). Reliability engineering is an established domain that has a very good practical and scientific background for the analysis of reliability of systems. Computational intelligence is relatively new in reliability engineering, but it has been an equally wellestablished branch of research with many groups over the world attempting to develop useful computational intelligence tools in different fields. Today, the continuous drive for digitalization causes reliability engineering and computational intelligence to merge. Combining the fields paves the way to progress on big data analytics, uncertain information evaluation, reasoning, prediction, modeling, optimization, decision-making, and, of course, more reliable systems. The second international RECI workshop took place at the Delft University of Technology on November 13, 14, and 15, 2022. Many aspects of the merger of Reliability Engineering and Computational Intelligence were discussed from different perspectives (https://ki.fri.uniza.sk/RECI2022/about.html). Plenary lectures were delivered by Prof. Ah-Lian Kor on expressive cardinal representation; Prof. Piero Baraldi on computational intelligence for maintenance; and Prof. Frank Coolen on reliability evaluation with multi-state components using survival signatures. These lectures yielded excellent discussions about what the core knowledge for RECI might actually be. Out of 34 abstracts, 17 were selected for discussion and after review by two or three peers, 13 were accepted for publication in this volume. As in the first edition of the RECI workshop, three parts emerge from the papers: mathematical methods for RECI, digital techniques for RECI, and progress in RECI applications. This volume shows that mathematical techniques have developed steadily and purposefully in the sense that the work from this workshop follow from earlier work that was reported last time. It forms the core for the RECI domain. For the second part, digital techniques for RECI, some topics are the same and some topics differ. This workshop saw a focus on reliability of software (systems) in terms of hardware and software reliability and on cybersecurity. Where reliability of software and hardware was treated in the first installment of RECI, cybersecurity was not. So cybersecurity has entered the field but in a very specific way: in the design of v

vi

Preface

hardware and software for cybersecurity. Thus the emphasis is on technical systems for cybersecurity, not human factors or governance. The third part, applications, represents a varied group of use cases. That is to say, scientific progress is inspired by industry problems and the solutions developed for them. The approach is fundamentally different but the developments are as challenging as those for the mathematical or digital techniques. As there are a plethorium of industrial problems, the topics in this cluster vary widely, as they did in the first installment of the RECI workshop. Medical use cases were requested for the workshop and two papers have progressed as a chapter. The editors would like to highlight these papers as they showcase value for life. We thank Professor Pieter van Gelder for local organization and making the workshop a successful one. We thank all authors and reviewers for their excellent contributions and we thank our funders in through the “Advanced Centre for Ph.D. Students and Young Researchers in Informatics”—ACeSYRI (610166EPP-1-2019-1-SK-EPPKA2-CBHE-JP), “University-Industry Educational Centre in Advanced Biomedical and Medical Informatics”—CeBMI (612462-EPP-1-2019-1SK-EPPKA2-KA), both of the European Union’s Erasmus+ programme, and “New Methods Development for Reliability Analysis of Complex System” (APVV-180027) of the Slovak Research and Development Agency. Huddersfield, UK Žilina, Slovakia Žilina, Slovakia May 2023

Coen van Gulijk Miroslav Kvassay Elena Zaitseva

Contents

Mathematical Methods for Reliability Engineering and Computational Intelligence Experimental Survey of Algorithms for the Calculation of Node Traversal Probabilities in Multi-valued Decision Diagrams . . . . . . . . . . . . Michal Mrena and Miroslav Kvassay

3

Reliability Analysis of Data Storage Using Survival Signature and Logic Differential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrik Rusnak, Peter Sedlacek, and Stanislaw Czapp

21

Digital Techniques for Reliability Engineering and Computational Intelligence Software Tests Quality Evaluation Using Code Mutants . . . . . . . . . . . . . . . Peter Sedlacek, Patrik Rusnak, and Terezia Vrabkova

39

Hacking DCNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Lukac and Kamila Abdiyeva

49

Markov Model of PLC Availability Considering Cyber-Attacks in Industrial IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maryna Kolisnyk, Axel Jantsch, Tanja Zseby, and Vyacheslav Kharchenko Advanced Networking and Cybersecurity Approaches . . . . . . . . . . . . . . . . Andriy Luntovskyy

61

79

Use Cases for Reliability Engineering and Computational Intelligence Application of Machine Learning Techniques to Solve the Problem of Skin Diseases Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Eduard Kinshakov and Yuliia Parfenenko

vii

viii

Contents

Analyzing Biomedical Data by Using Classification Techniques . . . . . . . . 117 J. Kostolny, J. Rabcan, T. Kiskova, and A. Leskanicova Wildfire Risk Assessment Using Earth Observation Data: A Case Study of the Eastern Carpathians at the Slovak-Ukrainian Frontier . . . . 131 Sergey Stankevich, Elena Zaitseva, Anna Kozlova, and Artem Andreiev Digital Safety Delivery: How a Safety Management System Looks Different from a Data Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Paul Singh and Coen van Gulijk Reliability Optimization of New Generation Nuclear Power Plants Using Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Jorge E. Núñez Mc Leod and Selva S. Rivera Algorithmic Management and Occupational Safety: The End Does not Justify the Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Thijmen Zoomer, Dolf van der Beek, Coen van Gulijk, and Jan Harmen Kwantes Technologies and Solutions for Smart Home and Smart Office . . . . . . . . . 189 Andriy Luntovskyy, Mykola Beshley, Dietbert Guetter, and Halyna Beshley

Mathematical Methods for Reliability Engineering and Computational Intelligence

Experimental Survey of Algorithms for the Calculation of Node Traversal Probabilities in Multi-valued Decision Diagrams Michal Mrena and Miroslav Kvassay

Abstract The structure function is one of the basic reliability analysis tools. Its purpose is to describe the topology of the system. In general, the structure function has the form of a discrete function. In addition to the structure function, the reliability analysis often considers component state probabilities. We use decision diagrams to effectively represent the structure function. The basic task in probabilistic analysis of the system is the calculation of the system state probabilities. For this purpose, we can utilize the structure of the decision diagram. In the paper, we compare two basic approaches to the calculation of various probabilities using the decision diagram. The first approach uses a breadth-first search traversal of the diagram while the other approach uses a depth-first search traversal. In the paper, we present an experimental comparison of the two approaches. We use the results to optimize computational methods for reliability engineering since the comparison shows that the former approach is more suitable for the calculation of system state probabilities and the latter approach is more suitable for the calculation of system availability. Keywords Availability · Binary decision diagram · Multi-valued decision diagram · Node traversal probability · System state probability · Unavailability

1 Introduction A characteristic property of each system is that it is made of further indivisible parts— components of the system. The performance (state) of the components determines the performance (state) of the system. The simplest system type is Binary State System M. Mrena (B) · M. Kvassay Faculty of Management Science and Informatics, University of Zilina, Zilina, Slovakia e-mail: [email protected] M. Kvassay e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_1

3

4

M. Mrena and M. Kvassay

(BSS) [1]. BSS and each of its components can be in one of two states, namely the failed state—denoted by the number 0, and the functional state—denoted by the number 1. However, many real systems operate on multiple levels. We call such a system Multi-State System (MSS) [2]. MSS can be further classified into homogeneous MSS and nonhomogeneous MSS. In homogeneous systems, the system and its components have the same number of states. We denote the number of states using the letter m. The number 0 denotes the failed state, the number m − 1 agrees with the perfectly functioning state, and intermediate numbers represent states such as “functioning with limitations” or “functioning”. In nonhomogeneous systems, different components, and the system itself can have a different number of states. This is characteristic of systems consisting of components of different natures. We denote the number of states of such a system using the letter m and the number of states of the i-th component by the letter m i , where i ∈ {1, 2, . . . , n} and n is the number of system components. We assume that a system in a state denoted by a higher number performs better—that is, the states are ordered. However, this does not apply in some situations. To illustrate, we refer to an example from [3], where the authors consider a diode that is in one of the states “functioning”, “failed because of a short circuit”, and “failed because of an open circuit”. In this case, it is not possible to define an ordering on failed states since we cannot tell which of the failures is “better”. BSSs as well as MSSs are usually composed of a huge number of interconnected components. Failures or degradations of the components have various effects on system operation. Furthermore, a lot of current systems contain elements of artificial intelligence that aim to replace some activities performed by a human or to allow the system to learn and improve its performance based on the results of previous tasks. All these elements rapidly increase the structural complexity of modern systems. This increase in structural complexity calls for the development of new approaches and methods that will be available to ensure the reliability of such systems. For practical usage, these approaches and methods should be transformed into software solutions. One such solution is the C++ TeDDy library [4], which aims to support reliability analysis of complex systems by providing methods for efficient representation of and manipulation with discrete functions. Discrete functions [5] are very typical in reliability analysis. One of the most obvious examples is structure function [1, 2], which defines how a system state depends on the states of its components. The structure function of a system with high structural complexity is usually very complicated. It has a lot of variables with many dependencies. To analyze the reliability of such a system, an efficient representation of this function is needed and tools for efficient manipulation with it are very important. In the TeDDy library, decision diagrams together with approaches of Multiple-Valued Logic (MVL) are used for this purpose. The decision diagram [6, 7] is a well-proven representation of a discrete function. It is a graph structure consisting of edges and internal and terminal nodes. They can be used for the representation of series–parallel systems as well as systems with various other topologies. A big advantage of decision diagrams is that we can directly associate various probabilities (including component state probabilities) with parts of the decision diagram. For the purposes of reliability analysis, we are most interested

Experimental Survey of Algorithms for the Calculation of Node …

5

in the probabilities associated with the terminal nodes. These can directly correspond to e.g., system state probabilities. However, the calculation of probabilities at terminal nodes is not a trivial task. Therefore, researchers have proposed several algorithms for this task. In general, the input of these algorithms is the decision diagram and the component state probabilities. The output is probabilities that characterize the entire system or the influence of individual components of the system. A common property of these algorithms is that they use an efficient traversal of the diagram to visit each node exactly once. This allows them to efficiently process even large diagrams. A traversal of the diagram can be realized in several ways. In this paper, we focus on comparing two principal approaches using the TeDDy library. The first approach uses a breadth-first search of the diagram, and the second approach uses a depth-first search of the diagram in a way we are familiar with from graph theory. Although both approaches ultimately produce the same result, we are interested in comparing their speeds and memory requirements with a different number of system states and the calculation of different reliability characteristics. Understanding the behavior of the two approaches in different use cases allows us to optimize computational methods used in different areas of reliability engineering.

2 Reliability Analysis 2.1 Structure Function The structure function describes the topology of the system. It is a mapping from the states of system components to the state of the entire system. In the most general case of a nonhomogeneous MSS, it has the form of an integer function [8]: φ(x1 , x2 , . . . , xn ) = φ(x): {0, 1, . . . , m 1 − 1} × {0, 1, . . . , m 2 − 1} × · · · × {0, 1, . . . , m n − 1} → {0, 1, . . . , m}, (1) where n is the number of components of the system, m i is the number of states of the i-th component for i ∈ {1, 2, . . . , n}, the variable xi describes the state of the i-th component and x = (x1 , x2 , . . . , xn ) is the state vector. In the case of a homogeneous MSS, m i = m j = m for i, j ∈ {1, 2, . . . , n}, and the structure function takes the form of a Multiple-Valued Logic (MVL) function [9]. In the simplest case of a twostate system, m i = m j = m = 2 for i, j ∈ {1, 2, . . . , n}, and the structure function corresponds to a Boolean function. Since the last two cases are only a special case of the integer function, in the rest of the paper we will focus on the general case of the integer function, while the results can also be applied in the two more specific cases.

6

M. Mrena and M. Kvassay

2.2 Basic Reliability Measures The structure function itself allows us to calculate the topological characteristics of the system. These characteristics assume that all system components are equally reliable which is advantageous in situations when we do not have information about the real reliability of components. One of the basic topological characteristics is the relative frequency of system states defined for state j as [10]: Fr = j = TD(φ(x) = j ),

(2)

where φ(x) = j is a function that takes the value 1 at the points where the structure function φ(x) takes the value j and the value 0 at all other points. The notation TD(.) indicates the truth density—the relative number of points in which the function takes the value 1. Functions of the type φ(x) = j are called pseudo-logic functions [5] or Boolean-valued integer functions. Such functions are common in the calculation of various reliability characteristics. Characterizing a system based only on its topology may not always be accurate enough, since components might not be equally reliable. Therefore, in addition to information about the topology of the system, we also need information about the availability of individual components. We call this information the component state probabilities. We denote the probability that component i is in state s as: pi,s = Pr{xi = s},

(3)

where i ∈ {1, 2, . . . , n} and s ∈ {0, 1, . . . , m i − 1}. The basic probability characteristic is system availability. Similarly, as with the relative frequency of states, we define it with respect to state j of the system as follows [1]: A≥ j ( p) = Pr{φ(x) ≥ j},

(4)

where p = ( p1 , p2 , . . . , pn ) is the vector of component state probabilities and the function φ(x) ≥ 1 takes the value 1 at the points in which the structure function φ(x) takes the value greater or equal to j and the value 0 in all other points. The notation Pr{g} denotes the probability that the pseudo-logic function g takes the value 1. System availability with respect to state j agrees with the probability that the system is in state j or better. Another probabilistic characteristic that is closely linked to availability, is the system state probability defined for state j as follows [11]: P = j ( p) = Pr{φ(x) = j}.

(5)

Consequently, we can define availability in terms of system state probabilities as:

Experimental Survey of Algorithms for the Calculation of Node …

A≥ j ( p) =

m−1

P =h ( p).

7

(6)

h= j

Furthermore, we can also define system state probability in terms of availability: P= j

⎧ ⎨ 1 − A≥1 ( p) = A≥ j ( p) − A≥ j+1 ( p) ⎩ ≥m−1 A ( p)

if j = 0 if j ∈ {1, 2, . . . , m − 2} . if j = m − 1

(7)

We can effectively use the two above relations in the calculation of one characteristic if the other one is known. This implies that the key procedure in probabilistic reliability assessment of complex systems with known structure functions is an efficient calculation of either system availability (4) or system state probabilities (5).

2.3 Series–Parallel Systems The nature of the system topologies described by the structure function varies for different types of systems. One of the topologies typical for many real-world systems is the series–parallel topology, which is a result of combining simple series and parallel topologies. Examples of series–parallel systems include systems with components connected in series having hot or cold spares [12] or different complex flow transmission networks [13]. A more specific example of a series–parallel system is an offshore wind farm analyzed in [14]. The reliability block diagram depicted in Fig. 1 shows a highlevel view of the wind farm topology. The wind farm contains multiple wind turbine strings (in Fig. 1, there are three) connected in parallel. All the strings are connected in series with the system of collection cables. Each string consists of a series of individual wind turbines. The authors aim to compare different arrangements of wind turbines—the number of strings and the number of turbines in the strings. They use probabilistic characteristics of the system such as failure probability or system unavailability. Furthermore, they use the probabilistic characteristics to calculate measures such as average available output power which is a more comprehensible measure for comparison. Let us consider a specific example of an analysis of a series–parallel wind farm system similar to the above-mentioned offshore wind farm. The wind farm consists of 12 wind turbines described by variables x1 , x2 , . . . , x12 . There are three types of turbines—each type having different reliability. In Table 1 we can see component state probabilities corresponding to the three types. In the example, we consider the system as a BSS, therefore the table contains probabilities of state 1 (denoted as pi,1 ) and 0 (denoted as pi,0 ) of the components. In the example, we consider two arrangements of wind turbines. The first arrangement consists of three strings each

8

M. Mrena and M. Kvassay

Fig. 1 Topology of the offshore wind farm presented in [14]

Table 1 State probabilities of three wind turbine types

Type

pi,1

pi,0

Wind turbine

1

0.7

0.3

x8 , x12

2

0.8

0.2

x2 , x3 , x5 , x6 , x 10

3

0.9

0.1

x1 , x4 , x7 , x9 , x11

having four turbines and the second one consists of four strings each having three turbines. Our goal is to decide which of the arrangements is better in terms of system availability given that we place the turbines sequentially according to their index. We will show how to calculate the system availability of both arrangements. First, we calculate the availability of individual strings (denoted as Ai for i-th string) which is a simple product of the availabilities of turbines in the string. Then we calculate the availability of the entire system (denoted as A) by using the inclusion–exclusion principle. The following equations show the calculation for the first arrangement: A1 = p1,1 p2,1 p3,1 p4,1 = 0.9 ∗ 0.8 ∗ 0.8 ∗ 0.9 = 0.5184 A2 = p5,1 p6,1 p7,1 p8,1 = 0.8 ∗ 0.8 ∗ 0.9 ∗ 0.7 = 0.4032 A3 = p9,1 p10,1 p11,1 p12,1 = 0.9 ∗ 0.8 ∗ 0.9 ∗ 0.7 = 0.4536 A = A1 + A2 + A3 − A1 A2 − A2 A3 − A1 A3 + A1 A2 A3 = 0.842954324

(8)

and the following equations show the calculation for the second arrangement: A1 = p1,1 p2,1 p3,1 = 0.9 ∗ 0.8 ∗ 0.8 = 0.576 A2 = p4,1 p5,1 p6,1 = 0.9 ∗ 0.8 ∗ 0.8 = 0.576 A3 = p7,1 p8,1 p9,1 = 0.9 ∗ 0.7 ∗ 0.9 = 0.567 A3 = p10,1 p11,1 p12,1 = 0.8 ∗ 0.9 ∗ 0.7 = 0.504 A = A1 + A2 + A3 − A1 A2 − A1 A3 − A1 A4 − A2 A3 − A2 A4 + A1 A2 A3 + A1 A2 A4 + A1 A3 A4 + A2 A3 A4 − A1 A2 A3 A4 = 0.96138987 (9) Obviously, in such a simple example, it is clear that the second arrangement is better even without the calculation of availabilities. Nevertheless, the results

Experimental Survey of Algorithms for the Calculation of Node …

9

are useful in quantifying the difference and illustrating the basic principles of probabilistic analysis. Unfortunately, in many situations, the system topology is much more complicated than the above-presented examples. Sometimes it is not even possible to describe the system using a reliability block diagram at all. Therefore, in such situations, it is necessary to use some other efficient structure function representations. One such representation is the decision diagram which we focus on in the rest of the paper.

3 Decision Diagrams A decision diagram is a graph structure that can efficiently represent all types of discrete functions that we use to define a structure function. Historically, the first and simplest type of decision diagram is the Binary Decision Diagram (BDD) [6], which represents a Boolean function. Its generalization is the Multi-valued Decision Diagram (MDD) [7], which represents an MVL function or an integer function. In the following description of decision diagrams, we will focus on the description of the most general MDD representing an integer function [7], since it also includes specific cases of BDD and MDD representing an MVL function. In Fig. 2 we can see all three types of diagrams mentioned. MDD consists of two types of nodes. Internal nodes (depicted as circles in the figures) are associated with variables. Each internal node is associated with exactly one variable. Terminal nodes contain the value of the function. Nodes are stored in levels in MDD. A special case is the last level, which always contains all terminal nodes. Another special case is the first level, which always contains exactly one node—the root of the diagram.

Fig. 2 BDD on the left representing Boolean function f (x) = x1 (x2 ∨ x3 ), MDD representing three-valued MVL function g(x) = max(x1 , x2 , x3 ) in the middle, and MDD representing integer function h(x) = max(x1 , x2 , x3 ) where x1 , x2 ∈ {0, 1} and x3 ∈ {0, 1, 2} on the right

10

M. Mrena and M. Kvassay

When describing MDDs in the literature, the authors usually describe MDDs that have the property of being reduced and ordered. Therefore, by its full name, the authors refer to Reduced Ordered Multi-valued Decision Diagrams (ROMDD). Since this type of MDD is the most widely used, we refer to diagrams only by the abbreviation MDD and we assume that all MDDs described in this paper a reduced and ordered. The reduced property means that every node in MDD is unique, that is, there are no two isomorphic subgraphs in the MDD. The ordered property is based on the arrangement of nodes in levels. We can say that in an ordered MDD, the variables associated with internal nodes are visited in the same order on each path from the root to a terminal node. MDD, which has these two properties, is a canonical representation of the discrete function [7], which is one of the reasons for its popularity. Another reason is the size of the diagram with respect to the number of variables n. Though in the worst case, the number of nodes in the diagram depends exponentially on n, in many practical situations the dependency is much better. A path in an MDD is one of the key concepts for probabilistic analysis. If we want to find the value of the function represented by MDD for a given state vector, we start at the root of the diagram. If the root is also a terminal node, then the diagram represents a constant function and we simply read the value of the function directly from the root. In most cases, however, the root is an internal node. In this case, the node is associated with a variable xi and has m i outgoing edges marked with the numbers 0, 1, . . . , m i − 1. According to the value of the variable xi , we select one of the edges, thereby moving to the next node. We repeat this procedure until we end up at a terminal node, from which we read the value of the function. The sequence of nodes and edges that brought us to the terminal node is called a path in MDD [15]. In Fig. 3 we can see a path marked in grey. Note that this path does not contain an internal node associated with variable x2 , therefore it corresponds to two different state vectors, namely (1, 0, 2) and (1, 1, 2).

Fig. 3 MDD shows a single path corresponding to two different state vectors

Experimental Survey of Algorithms for the Calculation of Node …

11

3.1 Node Traversal Probability If an MDD represents a structure function, we can assign a probability to each path from the root to a terminal node. The probability agrees with the sum of probabilities of state vectors associated with the path [15]. Let us consider the local decision that we make in each internal node when creating a path. Since the variable xi associated with the node describes one of the components of the analyzed system, then the probability of choosing the edge labeled by state s agrees with the component state probability pi,s . Therefore, we can assign this probability to each edge starting from an internal node. In Fig. 4 we can see MDD with these probabilities. We call such a diagram a probabilistic MDD. One way to obtain the probability of a path is to simply multiply the probabilities of all edges in the path. Considering the reliability analysis, however, we are more interested in the probability of visiting terminal nodes, considering all state vectors. In general, we call the probability of visiting a node while traversing the diagram the Node Traversal Probability (NTP) [15]. If the MDD represents a structure function, then the NTP of a terminal node agrees with the probability of the system state described by the value in the given terminal node. The easiest way to obtain the NTP for any node is to calculate the sum of probabilities of all paths ending in that node [15]. Although this procedure is simple, in practice it is difficult to apply for larger MDDs. Even though one path in MDD often corresponds to multiple state vectors, their number is still proportional to the number of all state vectors. This number, unfortunately, is exponential with respect to the number of variables n. Therefore, we need to use more sophisticated algorithms. The key feature of these algorithms should be that they traverse the diagram efficiently—in such a way that they visit each node at most once. Such algorithms should use classical approaches from graph theory for the traversal, namely breadth-first search and depth-first search. To describe these algorithms, we will use the node structure, whose definition can be seen in Fig. 5. In our description, we assume that it is possible to store the probability directly in the Fig. 4 Probabilistic MDD

12

M. Mrena and M. Kvassay

Fig. 5 Pseudocode of a structure representing a node of an MDD

nodes of the MDD. If a particular library for the manipulation of MDDs does not allow this, it is always possible to use a map from a pointer to a node to probability. However, this approach adds one level of indirection to the algorithm.

3.2 Depth-First Search The algorithm for calculating NTP using depth-first search (DFS) has been presented in [3, 16]. It is based on a universal recursive diagram traversal introduced by Bryant under the name Traverse [6]. In addition to the NTP calculation, we can use it, for example, to calculate the number of state vectors for which the function takes on a certain value [6] or to calculate the Fussell-Vesely’s importance [17]. Since, unlike trees, diagrams share subgraphs, we need to maintain information about the nodes we have already processed. In our pseudocode, we use the isMarked property of the node for this purpose. If a specific library does not have such a feature, we can use a set of pointers to already processed nodes to store this information. However, this step adds a layer of indirection to the algorithm and increases its memory requirements. In Fig. 6 we can see the pseudocode of the algorithm implemented in the ntp_prob function, with the dfs function implementing the actual algorithm using recursion. The output of the algorithm is the sum of NTPs in the selected terminal nodes. The input is the root of the diagram (root), a list of values identifying selected terminal nodes (selected), and a list of lists (or two-dimensional array) of component state probabilities (ps). The ntp_dfs function initializes the prob property of all nodes to the value 0.0 and in selected terminal nodes to the value 1.0. It then calls the recursive step of the algorithm (dfs) on the root of the diagram. Within the recursive step, we first process all sons of a node and only then update the probability in the node. When updating the probability, we add the product of the probability on the edge (which is stored in a separate list to avoid duplication) and the probability in the son (to which the edge leads and which has already been processed) for each output edge of the node. The last node in which we update the probability is, therefore, the root of the diagram. Hence, the root contains the result of the algorithm which is the sum of the NTPs of selected terminal nodes. In Fig. 7 we can see the values of prob properties of nodes of MDD after execution of the DFS algorithm. In this

Experimental Survey of Algorithms for the Calculation of Node …

13

Fig. 6 Pseudocode of an NTP calculation algorithm utilizing DFS traversal

case, we specified that we wanted to calculate the sum of NTPs in the terminal nodes representing values 1 and 2, which agrees with system availability with respect to state 1. In the end, we read the result of the algorithms from the prob property of the root node (0.994). The key property of the DFS algorithm is that it calculates the sum of the NTPs in terminal nodes. On the one hand, this can be considered an advantage since the sum agrees with the system availability (6). On the other hand, this can also be a disadvantage in a situation when we need to calculate the NTPs of individual terminal nodes which agree with system state probability (5). In general, if a function takes one of m values, we need to use the basic version of the algorithm m − 1 times to calculate NTP in each terminal node.

3.3 Breadth-First Search The second algorithm we are considering in this paper is based on the definition of NTP presented in [15]. According to the definition, the NTP of a node is a sum of NTPs in source nodes multiplied by the probability on the edge for each incoming edge. Therefore, we need to traverse the diagram by levels from the first level (containing the root) downwards. When we work with trees, we call such a traversal the level-order traversal. In general, when we work with graphs, we call such a traversal a breadth-first (BFS) search traversal. In cases of both trees and

14

M. Mrena and M. Kvassay

Fig. 7 Probabilistic MDD after the application of the DFS algorithm from [3, 16] on the probabilistic MDD shown in Fig. 4

graphs, we utilize an auxiliary structure—a First-In-First-Out (FIFO) queue [18]. However, in the case of MDD, which is a graph and not a tree, we cannot apply a classic BFS traversal using the FIFO queue, as it would not lead to a diagram traversal by its levels. Let us consider an MDD in which the 0-labeled edge of the root leads to a node at the last internal level and the other edges lead to the nodes at levels above the last internal level. In this case, using the standard BFS traversal, the node at the last internal level would be processed second (after the root), so that it would not have yet received information from the above levels and it would forward its—yet incomplete NTP—to its sons. Fortunately, we can solve this problem by replacing the FIFO queue with a priority queue [18], where the level of the node serves as a priority, with a smaller number being considered a higher priority and the root having the smallest level (typically 0). In Fig. 8 we can see the pseudocode of the algorithm implemented in the ntp_bfs function. Unlike the DFS algorithm, the output of the BFS algorithm is the NTP for each node in the diagram. In the above pseudocode, the output is a list of NTPs in the terminal nodes. The input of the algorithm is the root of the diagram (root) and the lists of lists (or two-dimensional array) of the component state probabilities (ps). The algorithm first initializes the prob property of all nodes to the value 0.0 and the prob property of the root to the value 1.0. The control of the algorithm is further ensured through a priority queue, in which we insert the root of the diagram. In addition to classic priority queue implementations (e.g., heap [19]), we can also use one of the monotonic priority queue implementations (e.g., bucket queue [18]) in this case since priorities (levels) are non-decreasing. Until the queue is empty, we keep removing nodes with the highest priority. For each output edge of the node, we calculate the product of the NTP at the node and the probability at the edge. We add this value to the target node’s NTP. If the target node is not terminal and has not yet been processed, we add it to the queue. The algorithm ends if there is no node in the

Experimental Survey of Algorithms for the Calculation of Node …

15

Fig. 8 Pseudocode of an NTP calculation algorithm utilizing modified BFS traversal

priority queue. The value of the prob property in each node corresponds to the NTP of the node. If we want to calculate, for example, the availability (4) of the system with respect to state j, we add the NTPs in the terminal nodes representing the values j, j +1, . . . , m −1. In Fig. 9 we can see the values of the prob properties for all nodes after running the BFS algorithms. In order to evaluate e.g., system availability with respect to the state j, we need to add the values of prob properties in terminal nodes representing values 1 and 2. In the picture, we can see that 0.5 + 0.494 = 0.994 which agrees with the result obtained by the DFS algorithm presented in Fig. 8. The advantage of this algorithm is that after one call to the algorithm, the terminal nodes contain the system states probabilities (5) and we can calculate the availability with respect to all system states in a constant time using (6). On the other hand, the disadvantage is that the algorithm needs an auxiliary structure (which in the case of the recursive DFS algorithm is implicitly also present as a call stack). The algorithm processes each node of the diagram exactly once, and the complexity of inserting and removing from the priority queue for each node needs to be considered in its computational complexity.

16

M. Mrena and M. Kvassay

Fig. 9 Probabilistic MDD after the application of the BFS algorithm from [15] on the probabilistic MDD shown in Fig. 4

4 Experimental Comparison We implemented the algorithms explained above as a part of the TeDDy library. Their theoretical description suggests that the DFS algorithm should be more advantageous in cases where we need to calculate the sum of NTP in several terminal nodes. This use case is quite common since it agrees with the calculation system availability. On the other hand, the BFS algorithm should be more advantageous in a situation when we need to calculate NTP for each terminal node separately. We compared the algorithms first in the calculation of all system state probabilities and then in the calculation of system availability with respect to state 1 of the system for three and five-state systems. Note that for the calculation of system state probabilities the DFS algorithm needed to be run twice for a three-state system and four times for five state system. We chose two types of structure functions. The first type was a structure function of a random series–parallel system. The MDDs representing such functions are very compact [20]—there are relatively few nodes at the individual levels of the diagram. This allowed us to examine functions with thousands of variables. The second type was a structure function defined by a random Sum of Product expression. We represented logical products in the expression using the min function and logical sums using the max function. MDDs representing such functions are, in contrast to series–parallel functions, relatively complicated—the number of nodes increases considerably even for functions with tens of variables. In each experiment, we generated 1,000 random functions for each type of function with a given number of variables (n) and the number of system and component states (m). In Table 2 we can see the average number of nodes in the generated diagrams.

Experimental Survey of Algorithms for the Calculation of Node …

17

Table 2 The average number of nodes in 1,000 randomly generated MDDs for different function types with different parameters Function type

n

m

The average number of nodes

Series–parallel

50,000

3

Series–parallel

10,000

5

347,204

Sum of products

40

3

13,528,106

Sum of products

20

5

2,220,734

280,365

Since we can implement the BFS algorithm in several ways—using different implementations of the priority queue, we have included three versions of this algorithm in the comparison. The first version uses a heap, which is a typical implementation of a priority queue found in the standard libraries of most programming languages. The other two versions use the bucket queue, which is an implementation of the monotone priority queue. We implement the bucket queue as a list of lists, where the index in the primary list agrees with the priority of the element, and the list at the given index contains all elements with the given priority. We implement the primary list using arrays, and we implement the secondary list in the first version using arrays (array-bucket) and in the second version using a singly-linked list (linked-bucket). The TeDDy library is written in C++ language and was compiled using the g++ 11.3.1 compiler using the O3 optimization flag. The experiments were performed on a server with Intel Xeon Gold 5218 CPU (2.30 GHz), 128 GiB DDR4 RAM running AlmaLinux 9.0. In Figs. 10, 11, and 12 we can see the results of the experimental comparison. Bars in each figure represent the average NTP calculation time obtained from 1,000 randomly generated systems with a given type of structure function and parameters m and n. The three orange bars (leftmost) represent the results for the three versions of the BFS algorithm. The blue bar (rightmost) represents the result of the DFS algorithm. Therefore, to compare the BFS and DFS algorithm, we compare the smallest of the orange bars and the blue bar.

Fig. 10 Comparison of DFS and BFS algorithms on random sum of products structure functions in the computation of NTPs for each terminal node

18

M. Mrena and M. Kvassay

Fig. 11 Comparison of DFS and BFS algorithms on random series–parallel structure functions in the computation of NTPs for each terminal node

Fig. 12 Comparison of DFS and BFS algorithms on random series–parallel structure functions and random sum of products structure functions in the computation of system availability with respect to state 1

Figure 10 presents a comparison of algorithms for three-state and five-state systems described by a structure function in the form of SoP when calculating all NTPs for all terminal nodes. In the case of three-state systems, the DFS algorithm is comparable, even slightly faster, than the BFS algorithm, even though we have to run it twice. However, in the case of a five-state system, the advantage of the BFS algorithm, which only needs to be run once, becomes apparent compared to the DFS algorithm, which in this case must be run four times. In both cases, we can see that the fastest version of the BFS algorithm is the one that uses a bucket queue in which the lists are implemented by arrays. In Fig. 11 we can see a similar comparison of algorithms for three-state and fivestate systems, but in this case, it concerns series–parallel systems with a significantly larger number of components. The comparison of the DFS and BFS algorithm, in this case, gives almost the same results i.e., that the DFS algorithm is faster for a three-state system, but lags behind the BFS algorithm for a higher number of states. It is however interesting to compare the different versions of the BFS algorithm. In contrast to the previous experiment, we can see that for m = 3 the difference in implementations is quite small and for m = 5, bucket queues are faster, though the difference in the implementation of secondary lists seems to be negligible. In the two above-presented experiments, we compared the algorithms in the calculation of NTPs for each terminal node. However, when calculating availability, we

Experimental Survey of Algorithms for the Calculation of Node …

19

are interested only in the sum of the NTPs. Therefore, in Fig. 12 we can see a comparison of algorithms for the calculation of the availability of a five-state system with respect to system state 1. The results for both types of structure functions clearly show that the DFS algorithm is significantly faster in this case. Obviously, according to the results of previous experiments, we can expect that the difference would be smaller for a smaller number of states and higher for a higher number of states. The results of the experiments confirmed the theoretical assumptions based on the nature of the algorithms that the BFS algorithm is more suitable for calculating NTPs for all terminal nodes and the DFS algorithm is more suitable for calculating the sum of NTPs multiple terminal nodes—typically when calculating system availability (4). These results indicate that it is beneficial for reliability analysis tools that use decision diagrams to implement both algorithms for different use cases. Another interesting result is that the BFS algorithm is an example use case for a monotone priority queue. The results of all versions of our experiments showed that the use of a simple bucket queue can speed up the algorithm compared to the classical priority queue implemented by the heap.

5 Conclusion Reliability analysis of various systems often requires considering component state probabilities. In conjunction with the structure function, these probabilities allow us to perform a probabilistic analysis of the system. The details of working with probabilities in calculations depend on the representation of the structure function. A decision diagram is an effective tool for the representation of the structure function and for the implementation of probabilistic calculations. The key task of probabilistic analysis is the calculation of the Node Traversing Probabilities of the terminal nodes of the diagram. We subsequently use the probabilities for the calculation of various probabilistic characteristics of the system, such as system availability (4) or system state probabilities (5). In the paper, we focused on two basic approaches to probability calculation, which are based on the depth-first search (DFS) and breadth-first search (BSF) traversals of the diagram. Both approaches were implemented in the TeDDy library and compared experimentally. Conducted experiments showed that the DFS algorithm is more advantageous for the calculation of system availability and the BFS algorithm is more advantageous for the calculation of system state probabilities. For reliability analysis tools using decision diagrams, it is, therefore, advantageous to implement both algorithms. An interesting result is also the comparison of different implementations of the priority queue in the BFS algorithm. We have shown that we can speed up the algorithm by using a bucket queue. Acknowledgements This research was supported by the Slovak Research and Development Agency under grant “New methods development for reliability analysis of complex system” (reg.no. APVV-18-0027) and by Grant System of University of Zilina No. 3178/2022.

20

M. Mrena and M. Kvassay

References 1. Rausand, M., Høyland, A.: System Reliability Theory, 2nd edn. Wiley, Hoboken, NJ (2004) 2. Natvig, B.: Multistate Systems Reliability Theory with Applications. Wiley, Chichester, UK (2011) 3. Zang, X., Wang, D., Sun, H., Trivedi, K.S.: A BDD-based algorithm for analysis of multistate systems with multistate components. IEEE Trans. Comput. 52, 1608–1618 (2003). https://doi. org/10.1109/TC.2003.1252856 4. Mrena, M.: TeDDy—decision diagram library. https://github.com/MichalMrena/DecisionDiag rams 5. Yanushkevich, S., Miller, D., Shmerko, V., Stankovic, R.: Decision Diagram Techniques for Micro- and Nanoelectronic Design Handbook. CRC Press, Boca Raton, FL (2006) 6. Bryant, R.E.: Graph-based algorithms for Boolean function manipulation. IEEE Trans. Comput. 35, 677–691 (1986). https://doi.org/10.1109/TC.1986.1676819 7. Srinivasan, A., Ham, T., Malik, S., Brayton, R.K.: Algorithms for discrete function manipulation. In: 1990 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers, pp. 92–95 (1990) 8. Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Trans. Reliab. 65(4), 1710–1723 (2016). https://doi.org/10.1109/TR.2016.2578948 9. Zaitseva, E., Levashenko, V.: Reliability analysis of multi-state system with application of multiple-valued logic. Int. J. Qual. Reliab. Manag. 34(6), 862–878 (2017). https://doi.org/10. 1108/IJQRM-06-2016-0081 10. Kvassay, M., Zaitseva, E.: Topological analysis of multi-state systems based on direct partial logic derivatives. Springer Ser. Reliab. Eng. 265–281 (2018). https://doi.org/10.1007/978-3319-63423-4_14 11. Griffith, W.S.: Multistate reliability models. J. Appl. Probab. 17, 735–744 (1980). https://doi. org/10.2307/3212967 12. Abouei Ardakan, M., Zeinal Hamadani, A.: Reliability optimization of series–parallel systems with mixed redundancy strategy in subsystems. Reliab. Eng. Syst. Saf. 130, 132–139 (2014). https://doi.org/10.1016/J.RESS.2014.06.001 13. Levitin, G.: The Universal Generating Function in Reliability Analysis and Optimization. Springer (2005) 14. Huang, Q., Wang, X., Qian, T., Yao, L., Wang, Y.: Reliability assessment of DC series–parallel offshore wind farms. J. Eng. 1457–1461 (2019). https://doi.org/10.1049/joe.2018.8474 15. Nagayama, S., Mishchenko, A., Sasao, T., Butler, J.: Exact and heuristic minimization of the average path length in decision diagrams. J. Mult. Valued Log. Soft Comput. 11 (2005) 16. Xing, L., Dai, Y.S.: A new decision-diagram-based method for efficient analysis on multistate systems. IEEE Trans. Dependable Secure Comput. 6, 161–174 (2009). https://doi.org/10.1109/ TDSC.2007.70244 17. Kvassay, M., Rusnak, P., Zaitseva, E., Stankovi´c, R.S.: Multi-valued decision diagrams in importance analysis based on minimal cut vectors. In: 2020 IEEE 50th International Symposium on Multiple-Valued Logic (ISMVL), pp. 265–270 (2020) 18. Skiena, S.S.: The Algorithm Design Manual. Springer Publishing Company, Incorporated (2008) 19. Williams, J.W.J.: Algorithm 232: heapsort. Commun. ACM 7, 347–348 (1964) 20. Mrena, M., Kvassay, M., Czapp, S.: Single and series of multi-valued decision diagrams in representation of structure function. In: Lecture Notes in Networks and Systems, pp. 176–185 (2022)

Reliability Analysis of Data Storage Using Survival Signature and Logic Differential Calculus Patrik Rusnak, Peter Sedlacek, and Stanislaw Czapp

Abstract The Data storage system is an important part of any information system. All the necessary data that must be available for the successful operation of the information system are stored here. Therefore, it is advisable to think about the reliability of such a data storage system. As part of reliability engineering, it is possible to perform a reliability analysis of any system. Therefore, the data storage system can be analyzed as well. As part of the reliability analysis, it is necessary to select the mathematical representation of the analyzed system. One such form is the structure function. A structure function is a mathematical representation of the analyzed system that maps the state of a system based on the state of its components. Main advantage is that the structure function can be used to describe a system of any complexity. However, if we have components of the same type in the system, the survival signature may be used as well. The structure function as well as the survival signature permits the use of multiple mathematical approaches such as logic differential calculus. Logic differential calculus can be used to detect situations where a change in the number of working components affects a change in the state of the system. This is useful in importance analysis, which is a part of the reliability analysis. In this paper, a reliability analysis will be performed for data storage in which multiple types of hard disk drives can be used as well as multiple methods of storing data on multiple disks using redundant array of independent disks. Keywords Data storage system · Importance analysis · Logic differential calculus · Reliability analysis

P. Rusnak (B) · P. Sedlacek Faculty of Management Science and Informatics, University of Zilina, Zilina, Slovakia e-mail: [email protected] P. Sedlacek e-mail: [email protected] S. Czapp Faculty of Electrical and Control Engineering, Gdansk University of Technology, Gdansk, Poland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_2

21

22

P. Rusnak et al.

1 Introduction Reliability analysis is an important aspect of system analysis [1, 2]. Performing a reliability analysis can be a challenging task, especially in cases of analyzing systems composed of many components [3, 4]. An important phase of reliability analysis is the quantification of the influence of individual system components on the operation of the entire system. This quantification is done using a set of indices known as importance measures (IMs) [5, 6]. However, before quantification can be performed, it is necessary to define the number of performance levels of the system model and to choose a mathematical representation of the system. Therefore, the first step of reliability analysis is to define the number of system performance levels for reliability analysis [1, 4]. This abstraction of the real system is necessary to reduce the computational complexity by a reasonable amount. Two main models are used in reliability analysis, namely the two-state system (BSS) and the multi-state system (MSS) [1, 3]. In BSS, there are only two states for components and for the system, whereas in multi-state systems this number can be larger. The next step in reliability analysis is the selection of a mathematical representation of the system model [1, 7]. One such representation is known as a structure function. It describes how performance levels or states of components affect system performance level or state [1]. Structure function is defined as follows: φ(x1 , x2 . . . , xn ) = φ(x) : {0, 1}n → {0, 1},

(1)

where xi is a variable that represents the state of component i and x = (x 1 , x2 , . . . , xn ) is a state vector that holds values of states for each system component. The main advantage is that it can represent a system of arbitrary structural complexity and corresponds to the definition of a logic or Boolean function (when MSS or BSS is used) [2, 7]. The structure function is a mathematically elegant and clear description of the structure of the system, but it has certain limitations. One such limitation is that its use is not ideal for comparing system structures of large and complex systems [8]. Such comparison will be costly from the computational point of view. Such systems include artificial intelligence systems [9], neural networks [9, 10], healthcare systems [11, 12], etc. However, there is a mathematical approach called survival signature that is applicable to comparing even large systems and is especially useful for systems that contain multiple components of the same type [8]. It is applicable both for reliability analysis, where time is not considered, and for analysis, where time is considered. Survival signature is defined as the probability that a system with n components and K different types of system components is in a functional state if exactly l system components of type k are functional for k = 1, 2, . . . , K . Its definition is as follows [8]:

Reliability Analysis of Data Storage Using Survival Signature and Logic …

φ(l1 , . . . , lk ) =

[ k ( )−1 ] ⊓ nk k=1

lk

∗

∑

φ(x)

23

(2)

x∈Sl1 ,...lk

where Sl1 ,...,l K is a set of all state vectors x with exactly lk working system components of type k for k = 1, 2, . . . , K , x is a state vector for structure function φ(x) and (l1 , . . . , l K ) is a vector that holds number of working components of type k for k = 1, 2, . . . , K . The third step is the quantitative and qualitative analysis of a given model. Here it is possible to focus on the investigated system as a whole or on its individual components [2, 6]. It is possible to examine only the topological part of the system, where the quality of the topology or the influence of the storage of components in the topology is determined. It is also possible to focus on the reliability point of view and thus include the reliability that the system or components will function at a given time, or during a specified period. From a qualitative point of view, it is possible to examine possible scenarios of system failure based on events that may occur or describe the system based on set parameters [1, 4]. In quantitative analysis, IMs can be used. IMs expresses the impact of a component or a group of components on the system or on other components [5, 6]. There are several types of IMs based on what exactly needs to be expressed and what type of analysis needs to be focused on. We know three basic types, namely structure, reliability and time-dependent IMs [5]. In structure IMs, the emphasis is on topology. An example is the structure IM (SI) [5], which is defined as the relative number of state vectors in which a component failure will cause the system to fail. This can be calculated as follows: ∑ SIi =

{(·i ,x)}

(φ(1i , x) − φ(0i , x)) 2n−1

,

(3)

where {(.i , x)} represent the set of state vectors without the i-th element, i.e.,(.i , x) = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xn ), and state vector (ai , x) = (x1 , x2 , . . . , xi−1 , a, xi+1 , . . . , xn ) fora ∈ {0, 1}. After performing all the necessary calculations and analyses, the last step takes place. This step is the evaluation of the system from the reliability point of view that concludes the results from previous steps [1, 2]. It is also possible to improve or locate the problematic part of the system in terms of reliability. In this article, the main emphasis will be on the use of IM for survival signature in topological analysis. A data storage system is chosen as the system, in which it is necessary to determine how important each type of hard disk is from a topological point of view. We will show different types of topologies and we will try to point out how important is each type of HDD for system from topological point of view.

24

P. Rusnak et al.

2 Redundant Array of Independent Disks Redundant array of independent disks or older term redundant array of inexpensive disks (RAID) is a method of data storage virtualization that ensures better performance, data redundancy, or both [13]. In most RAID, all disks should have the same characteristics. There are several types of RAID, namely standard, nested and nonstandard [13]. In this article, we will be focusing on standard and nested type of RAID. Standard RAIDs are standardized methods of data storage virtualization. They are labeled with numbers from 0 to 6 [13]. Within these RAIDs, there are different principles of data storage and redundancy for most of them. In this paper, we will cover only RAID 0 and RAID 1 [13]. RAID 0 is a method of storing data on an array of disks, in which data are divided evenly across all disks and data redundancy is not guaranteed. This means that the size of this RAID is proportional to the sum of the sizes of the individual disks, but if one disk fails, the entire storage will fail. RAID 1 uses mirroring, where data is stored on each disk. This means that the size of this RAID is proportional to the size of one disk, but data is available as long as at least one disk in RAID works. Both RAIDs with 5 HDDs can be seen in Fig. 1, in which RAID 0 is on the left side and RAID 1 is on the right side. Nested or hybrid RAID is a combination of standard RAIDs. In their name, used RAID are joined using character + or this character is omitted. In this paper, we will describe RAID 0 + 1 and RAID 1 + 0 [13]. RAID 0 + 1 uses RAID 1, whose elements are RAID 0. This results in a balance of storage size and redundancy. Since RAID 0 is at the lowest level, the entire part of the storage may be lost, so RAID 1 + 0 alternatively RAID 10 is used as a better option. Here the order of RAIDs is reversed, which means that RAID 0 is used, the elements of which are RAID 1. Both RAIDs with 8 HDDs can be seen in Fig. 2, in which RAID 0 + 1 is on the top and RAID 1 + 0 is on the bottom.

Fig. 1 RAID 0 (left) and RAID 1 (right) with 5 HDDs

Reliability Analysis of Data Storage Using Survival Signature and Logic …

25

Fig. 2 RAID 0 + 1 (top) and RAID 1 + 0 (bottom) with 8 HDDs

3 Mathematical Background Since the definition of the structure function (1) corresponds to the definition of the logic function, it is also possible to use the mathematical apparatus logic differential calculus (LDC) in reliability analysis [6, 14]. LDC can be used to analyze how the state of a component, specifically its change, affects the overall state of the analyzed system. A part of the LDC known as directional partial logic derivative (DPLD) can be used to determine how a change in the state of the component affects the system state [15]. DPLD is defined as follows: ∂φ( j → k) = ∂ xi (r → s)

(

1, φ(ri , x) = j and φ(si , x) = k , 0, otherwise

(4)

26

P. Rusnak et al.

where ∧ denotes Boolean operation AND and − is a negation of the argument interpreted as a Boolean function. DPLD can be used to calculate several IMs, one of which is the structure index (SI). SI using DPLD can be expressed as follows [3]: ) ∂φ(1 → 0) , SIi = TD ∂ xi (1 → 0) (

(5)

where TD represents the truth density of the argument interpreted as a Boolean function. The truth density value corresponds to the relative number of state vectors for which the argument has a non-zero value [3]. DPLDs are not only defined for structure function, but also for survival signature. There are 3 basic DPLDs that can be used [16]. The first DPLD for survival signature is used to identify the possibility of a decrease in system survivability when one component of a given type fails. Its definition is as follows [16]: ∂φ(l1 , . . . , l K ) ↓ = ∂lk (a → a − 1)

(

1, φ(l1 , . . . , ak , . . . , l K ) > φ(l1 , . . . , ak − 1, . . . , l K ) , 0, otherwise (6)

where a ∈ {1, 2, . . . , n k } depicts several working components of type k. Based on this DPLD, SI can be calculated as follows: ) ( ∂φ(l1 , . . . , l K ) ↓ ↓ . (7) SIk,a = TD ∂lk (a → a − 1) This SI expresses the relative number of situations in which the failure of a component of type k when components are working has an impact on system survivability. The second DPLD for survival signature finds its use in detecting the possibility of system failure, if the number of functional components of the given type decreases. The definition is as follows [16]: ∂φ(l1 , . . . , l K ) ↓ = ∂lk ↓

(

) ( lk , . . . , l K 1, φ(l1 , . . . , lk , . . . , l K ) > φ l1 , . . . , ~ 0, otherwise

(8)

∼

where lk ∈ {1, 2, . . . , n k }, and lk = lk − 1. Even with this derivation, it is possible to calculate SI, which has the following form: ) ( ∂φ(l1 , . . . , l K ) ↓ ↓ . SIk = TD ∂lk ↓

(9)

This SI expresses the relative number of situations in which the decline of working components of a given type has an impact on the system survivability.

Reliability Analysis of Data Storage Using Survival Signature and Logic …

27

The third DPLD focuses on the calculation of the system failure rate, assuming that a component of a given type fails, and its definition is as follows [16] ∂φ(l1 , . . . , lk ) ⇓ = ∂lk

(

) ( ε, φ(l1 , . . . , lk , . . . , l K ) > φ l1 , . . . , l˜k , . . . , l K other wise

0,

(10)

) ( ∼ ∼ where ε = φ(l1 , . . . , lk , . . . , l K ) − φ l1 , . . . , lk , . . . , l K for lk = 1, 2,.., nk , lk >lk ∼

and lk = lk − 1. The SI for this derivation represents the average decrease in the survival signature value, taking into account the decrease in the number of functional components of a given type. This can be expressed as follows: ∑ ⇓ SIk

=

nk ∗

l∈Sk ∏

∂φ(l) ⇓

i∈Mk

(n i + 1)

(11)

where l = (l1 , . . . , l K ) represents a vector of variables that represent number of working components of each type, Sk is a set of all vectors l in which lk ∈ {1, 2, .., n k } and li ∈ {0, 1, 2, .., n i } for i = 1, . . . , k − 1, k + 1, . . . , K and Mk is a set{1, . . . , k − 1, k + 1, . . . , K }.

4 Case Study As part of the usage example, the data storage system described in [17] will be used. This system is a storage system with eight hard disks with a capacity of 12 terabytes, where half of the HDDs are type 1 and the other half are type 2. As part of the reliability analysis, it is necessary to analyze how different placement of HDDs in data storage system affects the importance of each type of HDDs, namely HGST HUH721212ALN604 and Seagate ST12000NM0007. For the needs of covering all the main possible topologies, four different RAIDs will be used, which can be seen in Fig. 3. RAID 1 will work if at least one hard drive is working and RAID 0 will only work if all hard drives are working. As for RAID 0 + 1 and RAID 1 + 0, they combine RAID 0 and RAID 1 to get good attributes. For RAID 0 + 1 and RAID 1 + 0, we considered two different topological disk arrangements. For the purposes of reliability analysis, this system is understood as BSS, which means that individual disks are either in a functional state or in a failure state, and the overall data storage system is either in a functional or failure state. For the mathematical expression of the data storage system, based on the operation of individual RAIDs and the storage of hard disks in them, we determined the structure functions as follows:

28

P. Rusnak et al.

Fig. 3 RBD for each RAID

φR0 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = x1 ∧ x2 ∧ x3 ∧ x4 ∧ x5 ∧ x6 ∧ x7 ∧ x8 , φR01 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∧ x2 ∧ x3 ∧ x4 ) ∨ (x5 ∧ x6 ∧ x7 ∧ x8 ), φR01_2 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∧ x2 ) ∨ (x3 ∧ x4 ) ∨ (x5 ∧ x6 ) ∨ (x7 ∧ x8 ), φR1 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = x1 ∨ x2 ∨ x3 ∨ x4 ∨ x5 ∨ x6 ∨ x7 ∨ x8 , φR10 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∨ x2 ∨ x3 ∨ x4 ) ∧ (x5 ∨ x6 ∨ x7 ∨ x8 ), φR10_2 (x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∨ x2 ) ∧ (x3 ∨ x4 ) ∧ (x5 ∨ x6 ) ∧ (x7 ∨ x8 ).

(12) Since the data storage system consists of two types of HDDs, it is possible to focus on the use of survival signature. However, before we focus on its evaluation, it is necessary to point out that there are 70 different possibilities for HDD arrangement in each RAID. However, for RAID 0 and 1, this is irrelevant for the purposes of our reliability analysis, and for the other RAIDs, there are three placements that can represent all possibilities. Based on these facts, in this article we will focus in more detail on RAID 1 + 0 type 2. For other RAIDs, we will only state the resulting values. Survival signatures for three different placements of the HDDs shown in Fig. 4 are shown in Table 1, where p1 is the survival signature for placement 1, p2 is the survival signature for placement 2, and p3 is the survival signature for placement 3. In Table 1, there are only non-zero values of survival signatures for the purpose of reducing the size of the table. Out from those values it can be seen that the most interesting placement from survivability point of view on the system is placement 3,

Reliability Analysis of Data Storage Using Survival Signature and Logic …

29

because it has the most nonzero values for the survival signature and for the most part, its survivability is better that in different placement. After calculating the survival signature, it is possible to focus on the calculation of DPLDs as well as the corresponding SIs. Due to the symmetry of the HDDs in each placement, we will focus with computation only on type 1, since for type 2 the values of DPLDs and SIs are the same. The first DPLDs and SIs calculated according to (6) and (7) for all placements shown in Fig. 2 are listed in Tables 2, 3 and 4. Based on the calculated values, in placement 1, there are noticeable changes in system survivability when there are three and two HDDs, and one HDD fails. Looking at placement 2, the most problematic HDD failure is when three HDDs are Fig. 4 Three different placements for RAID 1 + 0 type 2, placement 1 is on top, placement 2 is in the middle and placement 3 is at the bottom

Table 1 Survival signature for each placement for RAID 1 + 0 type 2 of the data storage system φ p1 (l1 , l2 )

φ p2 (l1 , l2 )

φ p3 (l1 , l2 )

l1

l2

0

4

1

1

3

0.25

1

4

0.5

0.5

2

2

0.444

0.222

0.167

2

3

0.667

0.583

0.5

2

4

0.667

0.833

1

3

1

0.25

0.25

3

2

0.667

0.583

0.5

3

3

1

0.875

0.75

3

4

1

1

1

4

0

4

1

4

2

4 4

1 0.5

1

0.667

0.833

1

3

1

1

1

4

1

1

1

30

P. Rusnak et al.

Table 2 DPLD 1 and SI for placement 1 and component of type 1 of the data storage system l1

l2

∂φ p1 (l1 ,l2 )↓ ∂l1 (4→3)

∂φ p1 (l1 ,l2 )↓ ∂l1 (3→2)

∂φ p1 (l1 ,l2 )↓ ∂l1 (2→1)

∂φ p1 (l1 ,l2 )↓ ∂l1 (1→0)

1

0

–

–

–

0

1

1

–

–

–

0

1

2

–

–

–

0

1

3

–

–

–

0

1

4

–

–

–

0

2

0

–

–

0

–

2

1

–

–

0

–

2

2

–

–

1

–

2

3

–

–

1

–

2

4

–

–

1

–

3

0

–

0

–

–

3

1

–

0

–

–

3

2

–

1

–

–

3

3

–

1

–

–

3

4

–

1

–

–

4

0

0

–

–

–

4

1

0

–

–

–

4

2

0

–

–

–

4

3

0

–

–

–

4

4

0

–

–

–

0

0.6

0.6

0

↓ SI1,a

Table 3 DPLD 1 and SI for placement 2 and component of type 1 of the data storage system l1

l2

∂φ p2 (l1 ,l2 )↓ ∂l1 (4→3)

∂φ p2 (l1 ,l2 )↓ ∂l1 (3→2)

∂φ p2 (l1 ,l2 )↓ ∂l1 (2→1)

∂φ p2 (l1 ,l2 )↓ ∂l1 (1→0)

1

0

–

–

–

0

1

1

–

–

–

0

1

2

–

–

–

0

1

3

–

–

–

1

1

4

–

–

–

1

2

0

–

–

0

–

2

1

–

–

0

–

2

2

–

–

1

–

2

3

–

–

1

–

2

4

–

–

1

–

3

0

–

0

–

–

3

1

–

1

–

– (continued)

Reliability Analysis of Data Storage Using Survival Signature and Logic …

31

Table 3 (continued) l1

l2

∂φ p2 (l1 ,l2 )↓ ∂l1 (4→3)

∂φ p2 (l1 ,l2 )↓ ∂l1 (3→2)

∂φ p2 (l1 ,l2 )↓ ∂l1 (2→1)

∂φ p2 (l1 ,l2 )↓ ∂l1 (1→0)

3

2

–

1

–

–

3

3

–

1

–

–

3

4

–

1

–

–

4

0

0

–

–

–

4

1

1

–

–

–

4

2

1

–

–

–

4

3

1

–

–

–

4

0

–

–

–

0.6

0.8

0.6

0.4

4 ↓

SI1,a

Table 4 DPLD 1 and SI for placement 3 and component of type 1 of the data storage system l1

l2

∂φ p3 (l1 ,l2 )↓ ∂l1 (4→3)

∂φ p3 (l1 ,l2 )↓ ∂l1 (3→2)

∂φ p3 (l1 ,l2 )↓ ∂l1 (2→1)

∂φ p3 (l1 ,l2 )↓ ∂l1 (1→0)

1

0

–

–

–

0

1

1

–

–

–

0

1

2

–

–

–

0

1

3

–

–

–

1

1

4

–

–

–

0

2

0

–

–

0

–

2

1

–

–

0

–

2

2

–

–

1

–

2

3

–

–

1

–

2

4

–

–

0

–

3

0

–

0

–

–

3

1

–

1

–

–

3

2

–

1

–

–

3

3

–

1

–

–

3

4

–

0

–

–

4

0

1

–

–

–

4

1

1

–

–

–

4

2

1

–

–

–

4

3

1

–

–

–

4

4

0

–

–

–

0.8

0.6

0.4

0.2

↓ SI1,a

32

P. Rusnak et al.

working, however, with the failure of last HDD, its failure is still relevant for system survivability. As for the last placement, the most problematic failure is already with failure of HDD when 4 HDDs are working, but the importance of each HDD failure with decreasing number of working HDD is decreasing, and for last HDD, its failure is still affecting the survivability of the system. The first DPLDs represent a view of the system for each decrease in the number of working HDDs. For the needs of a holistic view for the reliability analysis, it is needed to focus on the second and third DPLDs as well. The second DPLDs and SIs calculated according to (8) and (9) for all placements shown in Fig. 2 are listed in Table 5. Based on the calculated values, when a component of a given type fails, its importance is the least when placement 1 is considered, and for placement 2, their importance for the system is the greatest. The second DPLDs focuses on indicating a decline in the survivability of the data storage system, but it did not consider the degree of degradation of this survivability. Therefore, it is advisable to calculate the third DPLDs as well. Third DPLDs for all placements shown in Fig. 2 calculated by formula (10) is shown in Table 6. Based Table 5 DPLD 2 and SI for all placement and component of type 1 of the data storage system l1

l2

∂φ p1 (l1 ,l2 )↓ ∂l1 ↓

∂φ p2 (l1 ,l2 )↓ ∂l1 ↓

∂φ p3 (l1 ,l2 )↓ ∂l1 ↓

1

0

0

0

0

1

1

0

0

0

1

2

0

0

0

1

3

0

1

1

1

4

0

1

0

2

0

0

0

0

2

1

0

0

0

2

2

1

1

1

2

3

1

1

1

2

4

1

1

0

3

0

0

0

0

3

1

0

1

1

3

2

1

1

1

3

3

1

1

1

3

4

1

1

0

4

0

0

0

1

4

1

0

1

1

4

2

0

1

1

4

3

0

1

1

4

0

0

0

0.3

0.6

0.5

4 ↓

SI1

Reliability Analysis of Data Storage Using Survival Signature and Logic …

33

on the calculated values, it is possible to see the change compared to the previous DPLDs. It is in the case of importance that there is a noticeable rise in placement 3. However, when we focus on the individual DPLDs values, storage 2 has the highest loss rate of all three storages with one hard drive working and the other values is not the loss of survivability as drastic as it is for example in placement 1 in case there are 2 functional HDDs. As for the other RAIDs and their placements, in Table 7 we can see all SIs for all DPLDs. From their values, we can see that importance of type can be the same for second and third DPLDs, however for first DPLDs we can see the difference. ↓ As example in RAID 0 + 1 and 1 + 0 placement 3, values of SI1 are 0.55 for both ⇓ ↓ ↓ ↓ ↓ as 0.138 in case of SI1 . This is different for SI1,1 , SI1,2 , SI1,3 , SI1,4 , where we can see that their values are reversed. This means that in RAID 0 + 1 with placement 3 it is more important for system survivability when there are 3 working HDDs of given type, and one fails as when there are 2 working HDDs. In RAID 1 + 0 with placement 3 it is more important for system survivability when there are 2 working HDDs of given type, and one fails as when there are 3 working HDDs.

Table 6 DPLD 3 and SI for all placement and component of type 1 of the data storage system l1

l2

∂φ p1 (l1 ,l2 )⇓ ∂l1 ↓

∂φ p2 (l1 ,l2 )⇓ ∂l1 ↓

∂φ p3 (l1 ,l2 )⇓ ∂l1 ↓

1

0

0

0

0

1

1

0

0

0

1

2

0

0

0

1

3

0

0.25

0.25

1

4

0

0.5

0

2

0

0

0

0

2

1

0

0

0

2

2

0.444

0.222

0.167

2

3

0.667

0.333

0.25

2

4

0.667

0.333

0

3

0

0

0

0

3

1

0

0.25

0.25

3

2

0.222

0.361

0.333

3

3

0.333

0.292

0.25

3

4

0.333

0.167

0

4

0

0

0

1

4

1

0

0.25

0.75

4

2

0

0.25

0.5

4

3

0

0.125

0.25

4

0

0

0

0.133

0.167

0.2

4 ⇓

SI1

34

P. Rusnak et al.

Table 7 SI for all placement of all RAIDs for component of type 1 of the data storage system ↓

↓

↓

↓

↓

⇓

RAID and placement

[SI1,1 ; SI1,2 ; SI1,3 ; SI1,4 ]

SI1

SI1

RAID 0

[0;0;0;0.2]

0.05

0.05

RAID 1

[0.2;0;0;0]

0.05

0.05

RAID 0 + 1 placement 1

[0;0;0;0.8]

0.2

0.2

RAID 0 + 1 placement 2

[0;0.6;0.6;0.4]

0.4

0.117

RAID 0 + 1 placement 3

[0.4;0.4;0.8;0.6]

0.55

0.138

RAID 0 + 1 type 2 placement 1

[0;0.6;0.6;0]

0.3

0.133

RAID 0 + 1 type 2 placement 2

[0.6;0.8;0.6;0.4]

0.6

0.167

RAID 0 + 1 type 2 placement 3

[0.8;0.6;0.4;0.2]

0.5

0.2

RAID 1 + 0 placement 1

[0.8;0;0;0]

0.2

0.2

RAID 1 + 0 placement 2

[0.4;0.6;0.6;0]

0.4

0.117

RAID 1 + 0 placement 3

[0.6;0.8;0.4;0.4]

0.55

0.138

RAID 1 + 0 type 2 placement 1

[0;0.6;0.6;0]

0.3

0.133

RAID 1 + 0 type 2 placement 2

[0.4;0.6;0.8;0.6]

0.6

0.167

RAID 1 + 0 type 2 placement 3

[0.2;0.4;0.6;0.8]

0.5

0.2

5 Conclusion In this article, we wanted to show the usability of DPLDs and their SIs defined for survival signatures for topological reliability analysis. We showed how the main message of SIs could be understand in case of DPLDs defined for survival signature. We showed its particular usability on topological analysis on a data storage system that had eight hard drives of two different types, with four drives being type 1 and four drives being type 2. Of all the options investigated, the most interesting option for our analysis was RAID 1 + 0 type 2, thanks to the result in [17]. From importance analysis, the importance of type from the topological point of view is lowest for placement 1. As an approach, SIs can be seen as a useful value that can determine a problematic type of system component or to see the summation information of the given DPLDs. As for the future development, we would like to focus on the development of a DPLD that would broaden the view of changing the number of working components of each component type based on size and not just an indication of a decrease in survivability. Then, SI for such DPLD should be developed as a view from different angle for topological reliability analysis. Acknowledgements This work is co-financed by the Polish National Agency for Academic Exchange and by the Slovak Research and Development Agency under the grant “Application of MSS Reliability Analysis for Electrical Low-Voltage Systems” (AMRA, reg. no. SK-PL-21-0003).

Reliability Analysis of Data Storage Using Survival Signature and Logic …

35

References 1. Larrucea, X., Belmonte, F., Welc, A., Xie, T.: Reliability engineering. IEEE Softw. (2017). https://doi.org/10.1109/MS.2017.89 2. Papadopoulos, V., Giovanis, D.G.: Reliability analysis. Math Eng (2018) 3. Kvassay, M., Zaitseva, E.: Topological analysis of multi-state systems based on direct partial logic derivatives. In Springer Series in Reliability Engineering, pp. 265–281. (2018) 4. Zio, E.: Reliability engineering: Old problems and new challenges. Reliab. Eng. Syst. Saf. 94(2), 125–141 (2009). https://doi.org/10.1016/J.RESS.2008.06.002 5. Kuo, W., Zhu, X.: Importance Measures in Reliability, Risk, and Optimization: Principles and Applications. John Wiley and Sons (2012) 6. Zaitseva, E.N., Levashenko, V.G.: Importance analysis by logical differential calculus. Autom. Remote Control 74(2), 171–182 (2013). https://doi.org/10.1134/S000511791302001X 7. Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Trans. Reliab. 65(4), 1710–1723 (2016). https://doi.org/10.1109/TR.2016.2578948 8. Coolen, F.P.A., Coolen-Maturi, T.: Generalizing the signature to systems with multiple types of components. Adv. Intell. Soft Comput. 170, 115–130 (2012). https://doi.org/10.1007/9783-642-30662-4-8 9. Kundu, S., et al.: Special session: reliability analysis for AI/ML Hardware. In: 2021 IEEE 39th VLSI Test Symposium (VTS), pp. 1–10. (2021). https://doi.org/10.1109/VTS50974.2021.944 1050 10. Zhang, C., Shafieezadeh, A.: Simulation-free reliability analysis with active learning and physics-informed neural network. Reliab. Eng. Syst. Saf. 226, 108716 (2022). https://doi.org/ 10.1016/J.RESS.2022.108716 11. Rusnak, P., Sedlacek, P., Forgac, A.., Illiashenko, O., Kharchenko, V.: Structure function based methods in evaluation of availability of healthcare system. In: Conference Proceedings of 2019 10th International Conference on Dependable Systems, Services and Technologies, DESSERT 2019, pp. 13–18 (2019). https://doi.org/10.1109/DESSERT.2019.8770009 12. Sujan, M.A., Embrey, D., Huang, H.: On the application of human reliability analysis in healthcare: opportunities and challenges. Reliab. Eng. Syst. Saf. 194, 106189 (2020). https://doi.org/ 10.1016/J.RESS.2018.06.017 13. Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 10th edn. John Wiley & Sons, Inc., (2018) 14. Zaitseva, E., Levashenko, V.: Reliability analysis of multi-state system with application of multiple-valued logic. Int. J. Qual. Reliab. Manag. 34(6), 862–878 (2017). https://doi.org/10. 1108/IJQRM-06-2016-0081 15. Yanushkevich, S.N., Michael Miller, D., Shmerko, V.P., Stankovi´c, R.S.: Decision diagram techniques for micro- and nanoelectronic design: Handbook. CRC Press (2005) 16. Rusnak, P., Zaitseva, E., Coolen, F., Kvassay, M., Levashenko, V.: Logic differential calculus for reliability analysis based on survival signature. IEEE Trans. Dependable Secur. Comput. (2022). https://doi.org/10.1109/TDSC.2022.3159126 17. Rusnak, P., Mrena, M.: Time Dependent Reliability Analysis of the Data Storage System Based on the Structure Function and Logic Differential Calculus, vol. 976. (2021)

Digital Techniques for Reliability Engineering and Computational Intelligence

Software Tests Quality Evaluation Using Code Mutants Peter Sedlacek , Patrik Rusnak , and Terezia Vrabkova

Abstract Software plays a significant role in nearly every system these days. Therefore a large emphasis has to be placed on its quality to minimize failures caused by programmers. Nearly every software development is followed by testing. However, low-quality software tests can lead to a false vision of software quality. In this paper, an approach to evaluate software tests quality is presented. This approach is based on mutation testing. The main idea of this approach is to generate mutations of the original source code using specified criteria. This mutation is tested by tests created for the original software, and the number of detected mutants by tests is evaluated. This approach is demonstrated in a simple program written in C# programming language. Keywords Code mutants · Software testing · Reliability analysis

1 Introduction Development of software, also known as software development life cycle usually consists of these parts: planning, analysis, design, implementation, testing, integration and maintenance. Software testing plays significant role in this process, because it detects difficulties in created software [1], that are caused by programmer mistake, misunderstanding of software requirements, etc. Software testing consists of This work is supported by Grant System of University of Zilina No. KOR/3181/2022. This work develops results of the project “Development of methods of healthcare system risk and reliability evaluation under coronavirus outbreak” which has been supported by the Slovak Research and Development Agency under Grant no. PP COVID-20-0013. P. Sedlacek (B) · P. Rusnak · T. Vrabkova University of Zilina, Zilina, Slovakia e-mail: [email protected] P. Rusnak e-mail: [email protected] T. Vrabkova e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_3

39

40

P. Sedlacek et al.

four stages: unit testing, integration testing, system testing and acceptance testing. In unit testing, tests are performed on the small parts of software called units. Units are isolated from the rest of system and it is evaluated that whether this small piece of code produce correct outputs using specific set of inputs [2]. Integration testing connects units from previous stage into many groups that are tested again. There is verified communication between created components from units. Following stage, system testing, is monitoring whether completed and integrated system fulfill system requirements. The last testing stage is focused on system acceptability, where real data from customers are used for validation and verification of software [2]. However, bad quality of tests may result in created software with many errors, even if it looks like reliable software. There are several methods for assessing software tests quality, that evaluates ability of test to uncover faults. One of them includes use of code coverage that takes structural aspect of the software into the consideration. This means that it describes parts of the source code that were executed during software run, typically for individual lines, statements or branches [3]. This measurement is easy to implement a quick to calculate. Another approach, also connected with coverage is use case coverage or test coverage. Here, different functional scenarios are identified and number of various test cases for each scenario is computed [4]. The previously mentioned methods such as methods aimed to code coverage are not sufficient to evaluate quality of code. For that reason we decide to focus on mutation testing. This method can be used to evaluate quality of code and not only the fact a code is covered by tests. The main focus of this paper is to present method for software tests quality evaluation using code mutants. For this purposes we also implemented a software tool that allow us to read project written in C# programming language, define mutation operations and then generate mutants of this code and evaluate them using metrics presented in Sect. 2.1. This method is presented in Sect. 4 on a simple program.

2 Mutation Testing Mutation testing is software testing technique where set of mutants are generated from source code [5]. Mutant is created by locating mutation operator on the certain place in source code. For example, mutation operator can be replacing one arithmetic operator (e.g. ‘*’) with another arithmetic operator (e.g. ‘.+’). These replacements can be done also on different statements of source code such as assignment, comparison, equality, logical and other operators. Mutation operator can also remove parts from source code, typically jump statements or function’s body, that is, in fact, executed by replacing function’s body with empty body [5]. According to dependency between original source code and results of performed tests, states of formed mutant can be defined: 1. killed—at least one test failed, 2. survived—all tests passed,

Software Tests Quality Evaluation Using Code Mutants

41

3. not covered—mutation operator cannot be placed in source code, 4. invalid—there was not possible to run test due to compilation error or test run never ends because of infinite loops, etc. For test quality evaluation we use first two states. The first state represents situation where tests are able to detect a mutant. The second state indicates that additional tests should be added or existing tests should be modified or expanded, because current tests did not reveal a defect. The last two states can be ignored, because they do not lead into software error that is not repaired before system deployment [6]. Not covered and invalid mutants are defined mainly to cover all possible states of the original source code and the result of the mutation operator, but are ignored in practical evaluation.

2.1 Mutation Testing Metrics Mutants with their states can be used for evaluation of test suit quality. For this purpose several metrics were proposed according to varied criteria. Metrics can be concentrated on provided test suite or generated mutants. For the first category, number of test cases or time required for testing is estimated. Second category includes estimation of executed mutants count, time required for mutant compilation or their generation [7]. On the basis of enumerated state of mutations, mutation testing metrics from second category are defined in our case. We use these five types of metrics: 1. number of killed mutants—total number of mutants that are in state killed, 2. number of survived mutants—total number of mutants that are in state survived, 3. number of all mutants—total number of mutants that are in state killed, survived, not covered or invalid, 4. number of valid mutants—total number of mutants that are in state killed or survived, 5. number of non-valid mutants—total number of mutants that are in state not covered or invalid. One more metric that is frequently used is number of equivalent mutants. Equivalent mutants are mutants that makes identical output like tested system independent from input [5, 7]. However looking for equivalent mutants and their removal is difficult and for that reason is often not implemented in software tools for mutation testing.

2.2 Algorithm of Mutation Testing Software tools for mutation testing use different testing algorithms, but in general all these algorithms consist of two steps - mutants creation and their running against test

42

P. Sedlacek et al.

Fig. 1 Mutation testing algorithm

suit. This process has its consequences that lies in huge quantity of mutants that needs to be tested [5, 8]. For the sake of this problem, mutant reduction techniques are used. Mutant reduction can be done by selecting random mutation operators or mutation operators that are more suitable for present system, that is known as selective mutation [9]. In another method, similar mutants are clustered and small amount of mutants from each cluster is used for testing. This way is named as mutation clustering [8, 10]. Our algorithm of mutation testing that is used in created application can be also divided into two parts as Fig. 1 shows. In the first part code mutants are created. For this purpose, we use syntax tree, which describes structure of program syntax [11]. Syntax tree of original source code is created for each class, which is not currently ignored. In the next step every syntax tree is traversed and specific nodes of tree are replaced or removed according to only defined mutation operators in configuration file. A new syntax tree is created after any of these changes and its source text representation (mutant) is saved on hard drive. The second part starts with mutant compilation. When code is successfully compiled, all tests written for mutated project are performed. Then state is assigned to mutant according to test result’s evaluation.

3 Software Tests Quality Evaluation Quality of software tests is evaluated by application of mutation score. Mutation score reveals effectiveness of test suit to detect mutants [8] and can be calculated from metrics introduced in Sect. 2 as follows: mutation score =

.

K V

(1)

where K stands for number of killed mutants and V for number of valid mutants. This means that mutation score depends on number of valid mutants that we can create. Mutation score acquires values from .0 to .1, where value .1 signifies high-quality test suit.

Software Tests Quality Evaluation Using Code Mutants

43

Let us consider the following example. For our purposes let us say, that in mutation testing process were generated .3 valid mutants and also .3 mutants were killed. When we put this numbers into Eq. (1), we get mutation score that is equal to .1. In mutation testing applied on other system, .1000 valid mutants were created and .955 from them were killed. Now, mutation score equals to .0.955, however there were more mutants revealed. This short example shows, that it is hard to evaluate tests without further information. Another problem with mutation score resides in absent of equivalent mutant evaluation that can cause decrease of mutation score even if detection of errors with used test suit increased [12]. In practise, mutation score can be calculated when equivalent mutants are taken into account as follows [8]: mutation score =

.

K M−E

(2)

where . K represents number of killed mutants, . M is total number of mutants and . E stands for number of equivalent mutants. Along with mutation score, coverage metrics are used that measures completeness of testing suite and therefore can be used for verification of results obtained from mutation score. These metrics can be separately evaluated for sequential and concurrent programs, because they can produce different results depending of process order that is given by scheduler. For that issue all potential executions need to be cover by mutants [13].

4 Model Example In this section we demonstrate whole mutation testing process that was described in Sect. 2.2, beginning from a syntax trees creation for each class and finishing with evaluation of test results together with determination of mutation testing metrics defined in Sect. 2.1 and mutation score according to Eq. (1). In this short example we ignore formation of equivalent mutants and program is executed sequentially. Let us take source code from Fig. 2. It is a program that is written in C# language and represents simple calculator. Please note that for mutation testing it is not

Fig. 2 Source code of a simple program

44

P. Sedlacek et al.

Fig. 3 Part of the syntax tree for the simple program from Fig. 2

important what programming paradigm or programming language is used. Mutation operation can be applied to procedural code as well as object oriented. However each paradigm can have its own specifics for example in object oriented code mutation operator on classes can be applied etc. In our example we decide to use C# language. This language contains many useful features that can be used for mutation testing, for example for syntax trees creation and manipulation. There are two methods declared. The first method has two parameters and returns their addition. The second method also accepts two parameters, but it returns subtraction of these two numbers. This class enters the process of creation syntax trees and than will be mutated. Syntax trees are created with .NET compiler platform Roslyn. Roslyn provides opensource set of compilers and analyzers of source code written in C# or Visual Basic programming language [14], that enables complete analysis of our class. Figure 3 shows syntax tree for body of method that calculates addition of two numbers. Syntax tree of whole class is not proposed due to its size. Syntax tree consists of syntax nodes (rectangles with blue background) and syntax tokens (other rectangles). For our case, only syntax nodes are necessary, because they represent statements, declarations, expressions, etc. In the Fig. 3 can be seen that method block consists of four syntax nodes: return statement, add expression and two identifier names. When mutants are generated only these four nodes is traversed and each of them can be replaced or removed. In this example we define only one mutation operator that replaces arithmetic operator ‘.+’ with arithmetic operator ‘.−’ and its realized as add expression replacement with subtract expression. A new syntax tree created from this substitution can be seen in Fig. 4. Roslyn syntax tree can be easily transformed to its string representation as can be seen in Fig. 5. This figure symbolizes generated mutant from introduced source code. Calculator class still comprised of two methods, but now, both methods returns same output, when inputs are identical. After saving this string on hard drive and its compilation, mutant is prepared for testing. For illustration, we use one unit test class with two unit test methods as shown in Fig. 6. The first unit test method tests numbers adding, specifically .2 and .0 and expects that result will equals.2. The second test method tests subtraction of numbers,

Software Tests Quality Evaluation Using Code Mutants

45

Fig. 4 Part of the syntax tree for the mutant generated from the source code in Fig. 2 Fig. 5 Mutant of the simple program from Fig. 2

Fig. 6 Example of tests for the simple program from Fig. 2

namely .3 subtract by .1 and expects number .2 as a result. Both tests passed, when they run against original source code. From test run against mutant we expect that at least one test fails, however both tests will passed again, which means that mutant was not killed. Reason resides in poor quality of test method, because it uses numbers as inputs to tested method that returns same result after their adding and also subtraction. Solution of this problem is to change these inputs or add one more test method with different inputs. Metrics for this example can be seen in Table 1. Mutation score equals to value.0. After test suit improvement according to previous suggestions, mutation score will equal to value .1 and mutation testing metrics will change as can be seen in Table 2.

46

P. Sedlacek et al.

Table 1 List of metrics for proposed example

Name of metric

Value of metric

Number of killed mutants Number of survived mutants Number of all mutants Number of valid mutants Number of non-valid mutants

0 1 1 1 0

Table 2 List of metrics for proposed example after test suit improvement

Name of metric

Value of metric

Number of killed mutants Number of survived mutants Number of all mutants Number of valid mutants Number of non-valid mutants

1 0 1 1 0

There was only one mutation operator applied in mutation testing process, however we can suggest more mutation operators for this example. For an instance, arithmetic operator ‘.+’ can be replaced by another arithmetic operator or arithmetic operator ‘.−’, that is used in the second method, can be substituted. This modification results in higher number of created mutants that affects also mutation testing metrics and mutation score. It is allowed to replace one operator with more than one different operator for various mutant formation, but it has impact on time that is spend for verification of mutant state through test suit running. In Fig. 7 can be seen previous model example in created application. On the left side settings are shown. First two buttons are used for selecting solution of .NET project and test project that uses NUnit, XUnit or MSTest unit testing framework. Mutation operators, that will be used, can be easily loaded from configuration file in JSON format or can be set directly in application. We can see that only Add Expression is checked. On the right side evaluated results can be seen. Application sequentially shows mutation testing metrics for each file from solution and than also calculates mutation score. All these results can be exported into file with multiple formats.

5 Conclusion Quantification of software tests quality became important for revealing errors hidden in created software. There are two major ways for their evaluation: test coverage and mutation testing. The first method determines proportion of source code that is covered by software tests. However, it does not assess influence of diverse mistakes

Software Tests Quality Evaluation Using Code Mutants

47

Fig. 7 Mutation testing process in created tool

that can be done by programmer. For that reason, test coverage should collaborate with mutation testing, whose algorithm and metrics are presented in this paper. The general idea of mutation testing is to make small changes in given source code and test this changed source code against original test suit. In our application, syntax tree is used for that purpose. For each class, one syntax tree is generated using .NET compiler platform Roslyn. Every syntax tree is traversed and new mutants are created by replacing or removing nodes of syntax tree according to active mutation operator in configuration file. It is easy to disable active mutants in application and also track progress of this process. New mutants are saved on hard drive, compiled and original tests run. Next part of that process consists in evaluating mutation testing metrics that are dependent on software tests results. These results determines state of particular mutants that enter into calculation of mutation score. Mutation score reveals effectiveness of test suit. Results for each syntax tree are shown in application and can be saved in multiple file formats. This paper also contains demonstration of mutation testing process on simple calculator. It includes syntax tree and mutant creation from given trees. There are also evaluated mutant states, metrics and final mutation score. Currently, tool for software tests evaluation is still in development process. There are new mutation operators added that can be used in mutation testing process. They includes simple operators that can be removed, but also work with methods and especially LINQ extension methods that call standard query operators. In our future work we will focus on problems in mutation testing process and also in created application. As can be seen in this paper, mutation score value can incorrectly interpret quality of software tests without other information. This problem can be solved by providing this information or improving quantities that enter its calculation.

48

P. Sedlacek et al.

The problem with created application resides in its effectiveness, both in generated mutants and their testing. These problems can be solved by using parallelism in these processes and reduction of created mutants using mutant reduction techniques. One more way of developing the proposed method of software reliability analysis by generating mutants and their testing is possible based on the use of another mathematical model which is the Multi-State System (MSS). It is a mathematical model that allows us to consider not only the total failure of the system but also the degradation of both the system and its components [15, 16].

References 1. Khan, M.E., Khan, F.: Importance of software testing in software development life cycle. Int. J. Comput. Sci. Issues (IJCSI) 11(2), 120 (2014) 2. Everett, G.D., McLeod Jr R.:. Software testing. Testing Across the Entire (2007) 3. Gopinath, R., Jensen, C., Groce, A.: Code coverage for suite evaluation by developers. In: Proceedings of the 36th International Conference on Software Engineering, pp. 72–82 (2014) 4. Zhu, H., Hall, P.A.V., May, J.H.R.: Software unit test coverage and adequacy. ACM Comput. Surv. (CSUR), 29(4), 366–427 (1997) 5. Bluemke, I., Kulesza, K.: Reductions of operators in java mutation testing. In: Proceedings of the Ninth International Conference on Dependability and Complex Systems DepCoSRELCOMEX. June 30–July 4, 2014, Brunów, Poland, pp. 93–102. Springer (2014) 6. Papadakis, M., Kintis, M., Zhang, J., Jia, Y., Traon, Y.L., Harman, M.: Mutation testing advances: an analysis and survey. In: Advances in Computers, vol. 112, pp. 275–378. Elsevier (2019) 7. Pizzoleto, A.V., Ferrari, F.C., Offutt, J., Fernandes, L., Ribeiro, M.: A systematic literature review of techniques and metrics to reduce the cost of mutation testing. J. Syst. Softw. 157, 110388 (2019) 8. Domínguez-Jiménez, J.J., Estero-Botaro, A., García-Domínguez, A., Medina-Bulo, I.: Evolutionary mutation testing. Inf. Softw. Technol. 53(10), 1108–1123 (2011) 9. Usaola, M.P., Mateo, P.R.: Mutation testing cost reduction techniques: a survey. IEEE Softw. 27(3), 80–86 (2010) 10. Hussain, S.: Mutation clustering. Ms. Th., Kings College London, Strand, London, p. 9 (2008) 11. Wang, W., Li, G., Ma, B., Xia, X., Jin, Z.: Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271. IEEE (2020) 12. Holling, D., Banescu, S., Probst, M., Petrovska, A., Pretschner, A.: Nequivack: assessing mutation score confidence. In: 2016 IEEE Ninth International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 152–161. IEEE (2016) 13. Sen, A., Abadir, M.S.: Coverage metrics for verification of concurrent systemC designs using mutation testing. In: 2010 IEEE International High Level Design Validation and Test Workshop (HLDVT), pp. 75–81. IEEE (2010) 14. Saadatmand, M.: Towards automating integration testing of .NET applications using Roslyn. In: 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 573–574. IEEE (2017) 15. Zaitseva, E., Levashenko, V.: Reliability analysis of multi-state system with application of multiple-valued logic. Int. J. Qual. Reliab. Manag. 34(6), 862–878 (2017) 16. Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Trans. Reliab. 65(4), 1710–1723 (2016)

Hacking DCNs Martin Lukac and Kamila Abdiyeva

Abstract In this paper we study the problem of the security of Deep Convolutional Networks (DCN). DCN are gaining more and more integration into daily lives and technology even though they cannot be directly understood or interpreted: while the process of learning from data is relatively well known the process by which DCN learns is still not explained. This state of using DCN without full understanding creates several non trivial security gaps and deficiencies that have already been addressed and exploited. Some of them are adversarial attacks, Trojan insertion, catastrophic content loss, etc. We study the problem of the DCN’s sensitivity to noise attack insertion on the level of its components. We show, that it is possible to determine how to inject Gaussian noise into the network so that a reliable predictor can be trained as to what images will be misclassified. In addition we study the dependency of image misclassification as a function of inserted noise magnitude and the average accuracy of the DCN. The results of this paper show that it is possible to relate the magnitude and type of inserted noise to the type of samples that are the most probable to be misclassified and therefore predict the failure of the network. The methodology presented in this paper can thus be seen as a type of unstructured Trojan insertion with expected misclassification and minimal default accuracy change. Keywords DCN · Noise injection · Targeted hacking

Martin Lukac and Kamila Abdiyeva are contributed equally to this work. M. Lukac (B) · K. Abdiyeva Department of Computer Science, Nazarbayev University, Kabanbay Batyr 53, Astana 010000, Kazakhstan e-mail: [email protected] K. Abdiyeva e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_4

49

50

M. Lukac and K. Abdiyeva

1 Introduction The success of the Deep Convolutional Neural Networks (DCN) in solving a wider range of real-world problems, resulted in a highly increased usage of machine learning approaches, from simple image or sound processing methods, to recognition and segmentation up to generation of synthetic data [1]. Of the most important advantages that DCN provides over other types of machine learning based approaches is an end-to-end learning methodology and a set of learnable features directly resulting from the learning process. Each feature in a DCN is obtained by a convolution of a filter and the input. Several layers of such convolutional filters, used in a sequential manner, can be used to obtain a robust, relatively noise resistant but in general fairly narrow set of features. Each filter is learned during the learning process by using a dataset proportional in size to the amount of layers in the DCN [2]. DCN are very effective in processing real world data as well as learning from very large datasets. However, the end-to-end learning and learning from big-data approach creates models that are not explainable and cannot be analyzed for causal relations between data and the processing elements of the DCN. As a result, several serious vulnerabilities based on data manipulation and on unexpected associations hidden within the training dataset have been identified. The prime example is the generative adversarial attacks [1]. For instance, the overparameterization [1, 3] of the DCNs can be exploited; by simple input image manipulation a misclassification of the result can be obtained. More advanced attacks are represented by the injection of Trojan-horse-like element in the DCN. Such a trained (or post-training) infected model will trigger a specific action on a set of selected inputs [3, 4]. The current state of art in the security evaluation and security hacking of DCN includes the following directions: adversarial attacks, poisoned training and posttraining Trojan insertion [5]. The methodology of adversarial attacks is based on the Generative Adversarial Networks [1] training, where a target DCN is subject to a game-like evaluating approach, allowing the attacker to determine how an input image should be modified to trigger a misclassification [1]. The poisoned dataset training [6], uses a specially prepared dataset, such that input images with specific properties are used for input, a trigger embedded in the DCN will result in a false classification. Finally, the post-training Trojan insertion, requires to insert into a trained DCN a specifically crafted set of trigger filters, that will react to a predetermined input image and result in a desired misclassification [7, 8]. While the Trojan attack can be used to Target specific images, it either requires a poisoned dataset or an insertion of additional filters into the DCN [5]. As such it can be either countered by not allowing the attacker the access to training material or by visually evaluating the structure of the DCN. However there has not yet been a study that would use a pre-trained DCN and evaluate how a single or a set of images can be forced to be misclassified. That is, given is a trained CNN and a new input sample, determine if and how to statistically alter the CNN’s filters in order to maximize misclassification.

Hacking DCNs

51

Therefore in this work, we consider a different type of attacking setup. Assume a CNN is used as a part of a security system and as such is undergoes regular checks for accuracy degradation by being evaluated on a test set as well as architecture verification. An attacker gaining access to the model is interested in modifying the CNN in such a way that it preserves its overall accuracy but misclassify certain type of images. For this purpose we study the impact of a statistical Trojan insertion into the DCN with the target of maximizing a target sample misclassification while maintaining the overall accuracy as close to the original one as possible. We use a pre-trained network and iteratively insert Gaussian noise to specifically selected individual or groups of filters. The method inserts Gaussian noise to specifically selected filters that are selected based on the results of [9]: using the Response Based Pruning (RBP). The noise insertion impact is measured on both the misclassification of a target input image and the overall accuracy reduction. As a result, we determine the criteria of inserting Gaussian noise that allows to modify the classification of a single input image while preserving the evaluation of the validation dataset intact. The difference between the proposed approach and previous Trojan horse approaches, is that we do not insert any trigger but attempt smoothly modify the overall network. Therefore, such an attack can be seen as a silent Trojan horse approach. In addition we do not require a poisoned training dataset or the insertion of a trigger but rather the inserted Trojan is crafted so as to become completely invisible to the testing.

2 Previous Work One of the main reasons for the understanding decision making process is to have an ability to predict an effect of changing some features on the performance and reliability of the model. Hence, not always just looking at the important regions visually or getting a textual description for the model decision making process can help adequately adjust the model reliability. Therefore, the alternative line of work is directed to study the effect of small changed to either input or model on model performance. One line of work investigates the ability of a neural model to provide a decision explanation for perturbation of input samples by adding random perturbations to the input samples. In [10] the authors generate adversarial perturbations that produce perceptively indistinguishable inputs that are classified by the model correctly, but have very different interpretations. Hence, the experiments show that systematic perturbations can lead to dramatically different interpretations without changing the label. To support these findings, the authors of [11] showed that by knowing what neurons are responding to, one can reprogram pre-trained network for synthetic image generation. And [12] work observed that by determining neurons with vague explanation and by manipulating their outputs one can achieve misclassification of class samples. In [13], the authors showed that determining a more accurate neuron importance measures can improve transferability of feature-level transfer-based adversar-

52

M. Lukac and K. Abdiyeva

ial attacks. However, the main disadvantage for input based adversarial attacks is that by simply retraining the models with a dataset that contains adversarial examples improves the model robustness dramatically [14]. Hence, in this work we want to study perturbation injected directly into the model, rather than perturbations introduced to the input. In such way by using various neuron importance measures and insertion of controllable perturbations we want to understand what each neuron. As a result we hope to come closer to understanding of the neural network decision process. In this work we want to understand the model decision making process not through pure feature visualization or generation of human readable explanation, but rather in the context of studying the model reliability for different types of model perturbations and spurious signals.

3 Background Let a Deep Convolutional Neural Networks (DCN) be described by a set of convolutional layers .L = {l1 , . . . , l z }, each containing a set of filters .l j = { f 1 , . . . , f p }. Each filter . f i ∈ Rd1 ×d2 ×d3 is a tensor object, where .d1 and .d2 are spatial dimensions of the filter and .d3 is a depth. The convolution of an input . I ∈ Rx×y×z with one layer of ' ' the DCN results in a new tensor object . O ∈ Rx ×y × p where .x ' ≤ x and . y ' ≤ y are spatial dimensions of the output tensor . O, and . p is the number of filters in layer .l j . A schematic of such network is shown in Fig. 1. The result of a CNN processing an input, is a set of features . F representing the input in an .k dimensional space. To augment the information used in the classifier, the output of individual filters from each layer .l j can be used. Let the output of the

Fig. 1 Schematic representation of two layers of convolution in a CNN. Other layers such as dropout, batch normalization or pooling are not shown

Hacking DCNs

53 '

'

convolution of a tensor object. I with a single filter. fi be a feature map. Mi ∈ Rx ×y and . Mi ∈ O. Then additional information can be obtained by averaging filter’s . f i output . Mi along spatial dimensions into a single scalar .ri later referred to as accumulated response. In this work we propose to insert noise into least contributing filters. For that we are using similar procedure (Response Based Pruning) as in [9] with a difference that we determine least contributing filters not on class-by-class basis, but on sample-bysample basis. Specifically, pruning a layer .l j is a process of removing one or more . f i from .l j using some predefined criteria, resulting in a new set of filters .P(l j ) = lˆ = ' ' ˆ { f 1 , . . . , f dˆ }. Hence, a new output is . Oˆ = lˆ 0 I with . Oˆ ∈ Rx ×y ×d and .dˆ ≤ d. As a pruning criteria—conditions to determine which filters to remove from the network—we propose to threshold filters’ accumulated responses .ri . Let .ri = {r1 , r2 , . . . , rt } be a vector of accumulated responses (averaged outputs) of all filters . f i in the network and input . I . For a given pruning ratio .θ ∈ [0, 1], Response Based Pruning (RBP) refers to a process of removing .|r| ∗ θ number of filters . f i with the smallest .ri ∈ r, where .|r| is a cardinality of .r. For instance, for .θ = 0.1, the 10% of filters with the lowest averaged responses .ri will be removed.

4 Methodology Let .f = { f 1 , . . . f N } be a set of all filters in a network .G, cardinality of .|f| = N . The network .G is the initial network trained on the training dataset without any modification or noise insertion. Let denote a pruning ratio as .θ ∈ [0, 1] indicating the number of filters to be pruned. The trojan insertion is then performed as follows. Firstly, we train a baseline model .G. Next we compute accumulated responses (averaged responses) .ri of each filter . f i and the image . I . To identify filters which will ¯ a set containing . f i with the be affected by trojan insertion for each . I we identify .f, ¯ smallest .ri such that where .|f| = θ ∗ N . Once we identified filters, we perform noise addition (instead of pruning) to the filters of the pre-trained model according to Eq. 1. { f =

. i

f i + λ ∗ X, where X ∼ N (μ, σ 2 ) f i O.W.

if f i ∈ f¯

(1)

where .λ is a noise scaling factor, . X ∈ Rd1 ×d2 ×d3 (where .d1 and .d2 are spatial dimensions of the filter . f i and .d3 is a depth) is noise sampled from normal distribution function with mean .μ = 0 and standard deviation .σ = 1.

54

M. Lukac and K. Abdiyeva

4.1 Classification Evaluation For the resulting modified model .G θ,λ , we compute classification accuracy for the image . I when using th modified set of filters .fˆ and determine the effect of trojan noise insertion on the sample . I . For images . I that changes their predicted label, we record target and changed-to labels. As a result, we construct an array .T = {I1 : [t, t¯], . . . , Ik : [t, t¯}, where .t is a target label and .t¯ is a predicted label after the noise insertion. The accuracy of the model is then simply computed by Eq. 2 acc(G

.

θ,λ

Ek )=

i=1

I(t = t¯) k

(2)

where .I(t = t¯) is the indicator function returning one if .t and .t¯ match and .G θ,σ is the trained network model .G for which filters selected at pruning ration .θ had an additive noise with scaling factor .λ.

4.2 Misclassification Label Prediction In addition, we also trained a classification model .C on the traindataset to predict if there will be a label changes after noise insertion. The purpose of this experiment is to determine if the misclassification of images resulting from adding of random gaussian noise can be predicted. Let . I be an image from the dataset with a correctly predicted label .t by .G, but changed label to .t¯ at noise scale .λ and threshold .θ. Let . W be parameters of .C, then output of the model be .r × W , where .r is accumulated responses of image . I and filters of model .G.

5 Experimental Results To evaluate the proposed approach we used the MNIST [15], German Traffic Sign Recognition Benchmark (GTSRB) [16] and a subset of 1100 images from IMBD faces dataset [17]. For each dataset we trained a model .Gm , .Gg and .Gi for each dataset respectively. The MNIST dataset contains ten categories and the input images are gray scale. The subset of 1100 images IMBD dataset contains the gender categories (male, female) and age information (young if .age < 30, old if .age > 50). Therefore for the subset of 1100 images dataset we evaluate it both on the gender as well as on the age labels. The inserted noise scale .λ is in range .[0.0027, 0.025] with increments of.0.0017. Both the noise scale magnitude interval and the noise scale increment were derived experimentally so that clear network accuracy decrease can be observed.

Hacking DCNs

55

5.1 Age and Gender Sensitivity This first set of experiments is aimed to determine the classification accuracy as a function of the inserted noise. For this purpose, we use Response Based Pruning (RBP) method to insert noise to the least and most contributing filters. The results can be seen in Fig. 2 using the .Gi . The y-axis shows the average classification accuracy while the x-axis shows the magnitude of inserted noise (inserted noise scale .λ). To conduct the study we performed the classification of both gender and age on 1100 images from IMBD dataset. Both the gender and age classification are a binary classification tasks. The results show that in the case of gender classification, adding noise to the least active filters has a stronger effect on misclassification of the data (Fig. 2a– b). In contrast, when classifying the same dataset using age labels, the maximally contributing filters have much stronger effect on the misclassification of the data when the noise is added. Observe, in Fig. 2d, the accuracy of classification reaches 50% of accuracy as soon as the smallest amount of noise is inserted. While using the least contributing filters for noise insertion (age classification Fig. 2c), the loss of accuracy follows similar trend to the classification of gender (Fig. 2a). Therefore if

Fig. 2 Noise insertion to .Gi trained on IMDB dataset on least (Min Pruning) and most (Max Pruning) significantly active filters. Each line shows the average accuracy at specified ratio .θ of affected filters for noise insertion

56

M. Lukac and K. Abdiyeva

ones is to insert trojan into such a network the neurons to be used for noise insertion must be carefully selected otherwise the performance will degrade too fast.

5.2 Prediction of Label Changes We evaluated the predictive capacity of the proposed noise insertion model. For this purpose we trained two classification models .Ci and .Cm to predict if there will be a label changes after noise insertion. In addition, for the IMDB dataset, we trained two different models .Ci,a and .Ci,g classifying respectively the IMDB dataset for age and for gender respectively. The purpose of this experiment is to determine if the misclassification of images resulting from adding of random gaussian noise can be predicted. We evaluated four different datasets to determine if the results of our experiments can be generalized. Figure 3 shows the results of predicting the misclassification of images as a function of inserted noise. Each figure contains two sets of lines: with crosses and with

Fig. 3 Prediction of misclassification as a function of inserted noise scale for a 0.05 (blue), b 0.1 (red), c 0.15 (yellow) and d 0.25 (purple) pruning ratio .θ. The left y-axis shows the size of the training data set, the right axis shows the average accuracy of classification and the x-axis shows the scale of the inserted noise .λ. Lines with dots represent the average accuracy of predicting missclassification and lines with crosses show the size of the training dataset. The legend has been omitted due to lack of space

Hacking DCNs

57

circles. Lines with crosses represent the size of the training dataset for this task and the line with circles represents the average accuracy of predicting the misclassification at each pruning (noise insertion) threshold .θ. The first observation is that for both .Ci,g and .Cm (Gender and MNIST) (Fig. 3a and d) the size of the training dataset increases linearly with the amount of inserted noise. However, in the case of .Ci,a (Age) and .Cg (GTSRB) classification (Fig. 3b and c) the number of samples shows unexpected behavior: the number of samples is reducing after some threshold of added noise. This is due to the fact that for larger amount of inserted noise, most of the samples will become misclassified and therefore there is not enough correctly classified samples to construct a large training dataset. Finally, we trained a regressor to predict the amount of noise necessary to misclassify a given image. The training was performed on the training set with experimentally determined noise scales .λ for misclassification. The results are shown in a case by case manner in Fig. 4. Each Figure shows a number of selected samples (indexed on the x-axis) for which the experimental (blue) and predicted (red) noise is drawn. Alternetavily, we also performed inserted noise scale prediction experiments. In other words we trained a regression model that was trained to predict the minimal noise scale.λ required to misclassify the sample image. I . The results of the regression analysis show that average error for the prediction of required noise scale is .0.0019 and.0.0017 for age and gender classification respectively. For the gender classification the noise prediction accuracy is close to the ground truth values. Although, for age classification the noise prediction average accuracy is similar to the average accuracy of gender noise prediction, the variance is much higher and therefore the predicted noise scale is in general larger than the noise increment. The possible reason is that labels for age classification were based on clustering of age into young and old groups. This, however, resulted in possible fuzzy labels because old and young classification cannot be reliably assessed from appearance only.

Fig. 4 Prediction of noise resulting in misclassification using a regressor. The blue line shows the minimal noise experimentaly determined for missclassification of a given sample and red line shows the predicted noise value

58

M. Lukac and K. Abdiyeva

6 Discussion The experiments have shown that it is possible to insert a randomly generated noise into DCN filters, and modify the performance of the DCN. In particular one can observe that for instance for GTSRB the average accuracy remains preserved when the smallest magnitude noise is being inserted (0.05) in up to 1% of the DCN filters. This implies that the DCN can be successfully attacked without the noise injection being detected (Fig. 3c). For other datasets, MNIST and IMDB (Gender and age classification) the results are not so explicit. For instance the injection of noise at increasing magnitude also results in increasing accuracy when only a very small amount of neurons are affected in the Gender classification task (Fig. 3a). Interestingly for the age and MNIST classification the accuracy is even increasing with added noise (Fig. 3b and d). Therefore the proposed approach can be considered as a viable attack despite the fact that the noise and the insertion to the noise should be more structured in order to determine exact criteria of noise injection. We can also look in details on the prediction of the noise amount for a targeted misclassification. While the predictions for gender are more accurate than for age this result is also related to the fact that the age classification is based on clustered data: age was grouped to three categories which might a too rough of an estimate. The important observation is that in general for a single image misclassification the experimental noise magnitude is very low .≈0.005 for gender classification and .≈0.003 or age classification. This observation correlated with the results from Fig. 3 would indicate that in general a very small amount of noise is necessary to obtain a targeted image misclassification and it can be hidden in the neural network classification noise. For instance inserting noise at.0.1 magnitude to less than 0.5% of the neurons could result in the desired effect and the accuracy would be hardly affected, i.e. indistinguishable from the original accuracy.

7 Conclusion In this paper we evaluated the nature of normal Gaussian noise inserted into a CNN with the purpose of obtaining controlled misclassification. We showed that indeed while not with absolute certainty, inserting Gaussian noise on specifically selected neurons can be used to change the classification results, preserve relatively high original network accuracy and even predict how much noise one would need to insert in the network to have a specific image misclassified. As a future work we plan to expand this approach into a deeper study of which filters can be individually targeted for trojan insertion. Acknowledgements This paper is partially supported by the Program of Targeted Funding “Economy of the Future” #0054/PCF-HC-19 and by the Erasmus+ CBHE reg. no. 598003-EPP-1-20181-SK-EPPKA2-CBHE-JP.

Hacking DCNs

59

References 1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, vol. 27 (2014). https://proceedings.neurips. cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf 2. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www. deeplearningbook.org 3. Carlini, N., Wagner, D.: Towards Evaluating the Robustness of Neural Networks (2016). arXiv:1608.04644 4. Brown, T.B., Mane, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial Patch (2017) 5. Liu, Y., Mondal, A., Chakraborty, A., Zuzak, M., Jacobsen, N., Xing, D., Srivastava, A.: A survey on neural trojans. In: ISQED, pp. 33–39 (2020). https://doi.org/10.1109/ISQED48828. 2020.9137011 6. Liu, Y., Ma, S., Aafer, Y., Lee, W., Zhai, J., Wang, W., Zhang, X.: Trojaning attack on neural networks. In: NDSS (2018). http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/ 2018/02/ndss2018_03A-5_Liu_paper.pdf 7. Costales, R., Mao, C., Norwitz, R., Kim, B., Yang, J.: Live trojan attacks on deep neural networks (2020). arXiv:2004.11370 8. Tang, R., Du, M., Liu, N., Yang, F., Hu, X.: An embarrassingly simple approach for trojan attack in deep neural networks. CoRR (2020). arXiv:2006.08131 9. Abdiyeva, K., Lukac, M., Ahuja, N.: Remove to improve? In: Pattern Recognition. ICPR International Workshops and Challenges, pp. 146–161. Springer, Cham (2021) 10. Ghorbani, A., Abid, A., Zou, J.: Interpretation of neural networks is fragile. Proc. AAAI Conf. Artif. Intell. 33(01), 3681–3688 (2019) 11. Bau, D., Liu, S., Wang, T., Zhu, J.-Y., Torralba, A.: Rewriting a deep generative model. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision, ECCV 2020, pp. 351–369. Springer, Cham (2020) 12. Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2022). arXiv:2201.11114 13. Zhang, J., Wu, W., Huang, J.-T., Huang, Y., Wang, W., Su, Y., Lyu, M.R.: Improving adversarial transferability via neuron attribution-based attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14993–15002 (2022) 14. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and Harnessing Adversarial Examples (2014) 15. Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012) 16. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The German traffic sign recognition benchmark: a multi-class classification competition. In: IEEE International Joint Conference on Neural Networks, pp. 1453–1460 (2011) 17. Wang, F., Chen, L., Li, C., Huang, S., Chen, Y., Qian, C., Loy, C.C.: The devil of face recognition is in the noise (2018). arXiv:1807.11649

Markov Model of PLC Availability Considering Cyber-Attacks in Industrial IoT Maryna Kolisnyk, Axel Jantsch, Tanja Zseby, and Vyacheslav Kharchenko

Abstract Programmable Logic Controllers (PLCs) are important subsystems in Industrial Internet of Things (IIoT) systems. Recently, sophisticated and targeted cyber-attacks against owners and operators of industrial control systems in IIoT systems have become more frequent. In this work we present a Markov model for assessing the dependability of PLCs in Industrial Internet of Things (IIoT) systems, and its main indicator of availability—the stationary coefficient of availability (AC). We show the AC under the consideration of cyber-attacks, with focus on DoS attacks. The model provides a basis to assess how the dependability of a PLC is affected by cyber-attacks. We describe the states and potential state transitions for the PLC in an IIoT environment with a Markov model. We simulated different use cases based on the model and show with exemplary parameter settings how the probabilities of the states of the PLC’s subsystems can be calculated and how the AC can be derived from the model. Results of the simulation are used to analyze the influence of cyber-attack rates on the PLC availability and it is shown how DoS-attacks impact the PLC’s dependability. Keywords Dependability · Availability · Cyber-attacks · Programmable-logic controller (PLC)

M. Kolisnyk (B) · V. Kharchenko Department of Computer Systems, Networks and Cyber-Security, National Aerospace University “KhAI”, Kharkiv, Ukraine e-mail: [email protected] V. Kharchenko e-mail: [email protected] A. Jantsch · T. Zseby TU Wien, Vienna, Austria e-mail: [email protected] T. Zseby e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_5

61

62

M. Kolisnyk et al.

1 Introduction The development of information transmission technologies and telecommunication networks has led to the creation of new systems such as the Industrial Internet of Things (IIoT). An implementation of IIoT is possible using wired (optical cable or copper cable) or wireless technologies. The functioning of the IIoT subsystems depends on its reliability and dependability (including cyber-security aspects) of the entire IIoT system as a whole. In some cases also the safety and life of maintenance personnel depends on the dependability of the system components. One of the important components of an IIoT system are large modular programmable logic controllers (PLCs). They allow to collect information about the state of the system by polling sensors, and control the operation of various mechanisms and subsystems in the IIoT network. Sophisticated and targeted cyber-attacks against owners and operators of industrial control systems in IIoT systems have become more frequent. A PLC is a complex hardware (HW) and software (SW) system, that connects to the Internet using wireless or wired data transmission technologies. All connected devices in a computer network can be attacked by hackers. Therefore, PLCs can be affected by all types of cyber-attacks known from conventional computer networks. IIoT structures usually include a variety of components: sensors (light, motion, humidity, smoke, door environments, etc.), cameras, robots, computer-controlled machines, IoT gateways, firewalls, PLCs, switches, routers, special software (OS, special SW for managing SCADA and PLC systems). An analysis of the vulnerabilities of modern large modular PLCs [1] showed, that almost all of its structural components (both HW and SW) and communication protocols contain vulnerabilities to cyber-attacks such as DoS-attacks (Denial-ofServices attacks). Such types of attacks can lead to a failure of both, individual components included in the PLC, and the entire PLC as a whole. A failure of the PLC leads to a partial failure of the IIoT system. Therefore, it is advisable to assess the dependability of the PLC under the influence of cyber-attacks such as DoS-attacks. In order to assess how the dependability of a PLC is affected by the impact of DoS-attacks, a Markov model of the PLC availability is proposed, taking into account the impact of DoS-attacks on its subsystems. Questions of cyber-attacks on industry systems were presented in many scientific papers. The authors of paper [1] present a redundant control system of PLC and obtained reliability criteria from a Markov model. The paper [2] proposes a reliability semi-Markov modeling for PLC and its application to the industry. Paper [3] presents a novel approach of control design of Programmable Logic Controller based Networked Control Systems (PLC-based NCS) using formal discrete modeling and simulation for modeling and control of NCS, using Colored Petri Nets (CPNs), Hidden Markov Models (DMM, HMM) and a new concept of Mutual Markov modeling. In paper [4] mathematical and dynamic simulation methods for reliability evaluation are described: Dynamic Fault Tree Analysis (DFTA), Markov Chains and Reliability Block Diagrams (RBD), Failure Mode and Effect Analysis (FMEA)

Markov Model of PLC Availability Considering Cyber-Attacks …

63

and Monte Carlo simulation of common industry-based control system architectures, such as PLCs. Paper [5] presents a two-step approach for modeling forward and backward network delays in NCS, using a colored Petri net (CPN) structural model for the simulation of Ethernet based networked control systems and Markov chain delay model (FSMD). Reliability analysis of safety of HW components and the embedded SW of critical PLC systems is shown in [6] using a novel probabilistic hybrid relation model (HRM), in the view of a Bayesian network (BN) that captures the execution logic of the embedded SW. In [7] the authors develop a method for assessing the reliability of a SCADA system based on analytical and stochastic modeling with the support of cloud resources, taking into account the influence of various negative factors, which can be used to justify the general functional requirements for a SCADA system. The article [8] presents the results of the analysis of the reliability and security of industrial systems based on SCADA using F(I)MEA (analysis of failure (intrusion) and consequences) to determine the weakest links in the system and the choice of means to prevent failures, detect failures and ensure fault tolerance. Paper [9] considered the features of cyber-terrorism, vulnerabilities in SCADA network systems and the concept of cyber resilience to combat cyberattacks. A method to detect different malicious intrusions and cyber-attacks based on automatic learning is proposed in [10]. A project with all data encrypted, that the PLC sends over the network, using Machine learning was proposed in [11]. An analysis of different types of PLC architectures, SRCS (Safety Related Control System) and the difference between PLC and cyber-secure PLC is presented in [12]. A method against cyber-attacks on PLCs, based on developed external hash authentication device embedding asymmetric encryption, remote security operation and maintenance method is proposed in [13]. The classical theory of reliability and the main indicators we are using are presented in [14]. Table 1 shows the results and mathematical methods, presented in the related work. The novelty of our research is the assessment of PLC dependability, based on a Markov model for the availability of a large modular PLC considering cyber-attacks on its components. For this we are using the method of system analysis of PLC HW and SW subsystems. The goal of our research is to develop and to investigate an evaluation model for PLC dependability considering cyber-attacks on its subsystems. The main contribution of the research is the development of a Markov model for PLC dependability including failure states of HW and SW subsystems of the PLC and a possible impact of cyber-attacks (DoS-attacks). The cyber-attacks can lead to a failure of one of the PLC’s subsystems and the whole PLC. The proposed model can be used by manufacturers and service teams, using real statistical data, to obtain the availability of the PLC in the future.

64

M. Kolisnyk et al.

Table 1 Analysis of related work Paper

Mathematical approach

New method, model, or analysis

Experiments, calculations, or simulation

1

Markov model

Redundant control system of PLC

Simulation of Markov model and reliability criteria assessment

2

Semi-Markov model

Reliability semi-Markov modeling for PLC Simulation of Semi-Markov model

3

Colored Petri Nets Formal discrete modeling and simulation, (CPNs), Hidden Markov a new concept of Mutual Markov modeling Models (DMM, HMM), Mutual Markov modeling

4

Analysis of Dynamic Fault Tree Analysis (DFTA), Markov Chains and Reliability Block Diagrams (RBD), Failure Mode and Effect Analysis (FMEA), Monte Carlo

Mathematical and dynamic simulation Monte Carlo methods for reliability evaluation: simulation of Dynamic Fault Tree Analysis (DFTA), PLCs Markov Chains and Reliability Block Diagrams (RBD), Failure Mode and Effect Analysis (FMEA), Monte Carlo simulation of PLCs

5

Colored Petri net (CPN), Markov chain delay model (FSMD)

Two-step approach for modeling forward and backward network delays in NCS for the simulation of Ethernet based networked control systems, Markov chain delay model (FSMD)

6

Analysis and Bayesian network

Reliability analysis of safety of HW Simulation of components and the embedded SW of HRM critical PLC systems using a novel probabilistic hybrid relation model (HRM)

7

Analytical and stochastic modeling

Method of detection intrusions and for assessing SCADA system reliability

Modeling with using of cloud sources considering negative factors

8

F(I)MEA

Analysis of the reliability and cyber-security in SCADA systems

Choice of prevention and detection means of failures, and ensure fault tolerance

Mutual Markov model and formal discrete model simulation

Simulation of FSMD and CPN

(continued)

Markov Model of PLC Availability Considering Cyber-Attacks …

65

Table 1 (continued) Paper

Mathematical approach

New method, model, or analysis

Experiments, calculations, or simulation

9

Analysis

Analysis of the features of cyber-terrorism, Comparison vulnerabilities in SCADA network systems by theory, no and the concept of cyber resilience to experiments combat cyber-attacks

10

Automatic learning

A method to detect different malicious intrusions and cyber-attacks

11

Machine learning

A project of PLC with open program code, Experiment of which has modification for encryption of Intrusion all data, sending to the network detection system with encryption of data, based on Machine learning algorithms

12

Method of structure analysis and designing

Analysis of PLC architecture types, SRCS, Experiments differences of PLC and safety PLC of using the safe-SADT method for design of distributed control systems

13

Embedding asymmetric encryption

A method of dealing with mode-switching attacks by an external hash authentication device embedding asymmetric encryption, protection method can break the communication channel between the attacking computer and the PLC to effectively protect against malicious cyber-attacks

Experiments of intrusions detection using of automatic learning

Presented the theory, no experiments

2 PLC Architecture We consider the components of a PLC architecture using the model of a large modular PLC (Fig. 1) based on [15]. The large modular PLC has a complex HW and SW architecture: middleware, I/ O components (e.g., PROFINET controller, PROFINET device, Axioline F master (local bus)), service components (OPC UA server, Proficloud gateway, Web management, PC Worx Engineer), HMI (HTML5 Web Visualization), OS Access, DCP,

66

M. Kolisnyk et al.

Fig. 1 Structure of the large modular PLC, presented in [15]

SFTP, VPN, SSH, NTP, Trace Controller, System Components (System Manager and PLC Manager, 2 ESM (Execution and Synchronization Manager, where each processor used an own ESM), User Manager, eCLR, firewall and communication protocols (MQTT, XMPP, COAP, HTTP, HTTPS, AMQP), GDS (Global Data Space) [15]. Each of the listed components can be subject to a cyber-attack. The PLC control unit includes: 2 processors with 2 cores, 2 ESMs, RAM and ROM, interface system. The large modular PLC has a firewall that is based on the internal Linux mechanisms. The firewall settings have to be taken into account, because if the firewall is configured incorrectly or is defective, the PLC is at risk.

Markov Model of PLC Availability Considering Cyber-Attacks …

67

3 Evaluation of the Dependability of the PLC Considering Dos-Attacks on Its Components Availability is an important property of PLC dependability. According to standard ISO 8402, we can obtain this property of a PLC by different mathematical models [14, 16]. The main steps to assess the PLC’s dependability are presented in Fig. 2. We first select the appropriate model based on the assumptions (see Fig. 2). As shown in chapter Error: Reference source not found, we can use Petri networks, Markov or Semi-Markov models for assessing the dependability. The first step of the research is choosing a mathematical model for the evaluation of the dependability of the PLC evaluation. In our research we select the mathematical apparatus of Markov models. Then for establishing the Markov model and calculating the AC we use the following steps: (a) defining the states based on the components from the architecture; (b) setting the transition rates for failures, recoveries, and attacks; (c) calculating the probabilities of each state by simulating the transitions for infinite time steps; Fig. 2 Main steps for the large modular PLC’s dependability evaluation

68

M. Kolisnyk et al.

(d) calculating the AC based on all steps that do not cause a failure of the PLC system. The PLC can operate as a standalone system or with Internet connection. In our case we assume that the PLC is connected to the Internet and therefore can be a target of cyber-attacks. In our model we consider, that a cyber-attack (DoS-attack) can begin in a case of a firewall failure. However, the model can be adjusted and extended to other attack types. The subsystems inside the PLC are assumed to be static subsystems (they are located in stationary places, and are working in stationary mode). All transition rates remain the same at all time steps. Communication protocols, as well as SW components, which are necessary for the operation of the PLC, are also considered static subsystems that work in a mode of constant availability [15]. Therefore, to assess the availability of the PLC, we can use the stationary coefficient of availability (AC). The AC depends only on the values of the transition rates from one state to another, and does not depend on the value of time. The proposed Markov model of PLC availability considering DoS-attacks allows to determine the most critical state (the most vulnerable to DoS-attack PLC’s subsystem) of the PLC with a set of input data in the case of cyber-attacks. The structure of the PLC includes the reservation of the processor and two systems of synchronization (ESMs). The time of the transition to the secondary ESM in the event of failure in one of them is considered minimal. Assumptions for the developed model are the following [1, 16]: – the monitoring and diagnostic tools are good working and can determine the technical state of the PLC with high degree of authenticity; – the flow of failures of PLC HW and its constituent components is the simplest (has the properties of ordinary, stationary and no aftereffect), random, and obeys the Poisson distribution law; – the flow of PLC SW failures (both OS and special control and monitoring programs) is the simplest (has the properties of ordinary, stationarity and no aftereffect), random, and obeys the Poisson distribution law. Failures caused by SW design faults of PLC subsystems obey the Poisson distribution, according to the results of monitoring and diagnostics, testing, a secondary error was fixed (as a result of the accumulation of the consequences of primary errors and defects, SW backdoors). We believe that a patch for software malfunctions or failures, for SW vulnerabilities, is not put. The number of DoS and DDoS attacks, as well as the number of primary defects in SW, is constant; – the process, which occurs in the system, it is a process without aftereffect, every time in the future behavior of the system depends only on the state of the system at this time and does not depend on how the system arrived at that state. Therefore, the failure flows of PLC subsystems have the Markov property.

Markov Model of PLC Availability Considering Cyber-Attacks …

69

The AC was selected as an indicator of PLC dependability [14, 17]. To calculate the AC value, we compose a system of Kolmogorov-Chapman differential linear equations and solve it under the condition of normalization. The PLC is not a dynamic system, so the process of controller operation can be described by a Markov model with discrete states and continuous time. We synthesized the PLC availability model taking into account cyber-attacks (DoS-attacks). In our example the rates for the cyber-attacks are derived from statistical data [14, 16, 17] and databases of vulnerabilities, taking into account their severity and the frequency of attacks on the vulnerabilities of PLC components. We here show a simulation with example values for a firewall failure and a DoS attack. Model and values can be adjusted to other attack types or other attack rates if more accurate values are available. Using results of the model simulation, the influence of cyber-attack rates on PLC availability can be analyzed, and recommendations to minimize the risk of failures caused by cyber-attacks can be suggested. The graph of the Markov model of PLC availability in the case of firewall failure considering DoS cyber-attacks on its subsystems, is presented in Fig. 3. The states of the Markov model show the states of the PLC subsystems and transition rates from one state to another. We define one state as full functioning PLC and then for each of the components we define one failure state that is entered if the component fails due to a classical failure or due to an attack. The failure rates (transition rates from the one state of the model to another state) shown in the graph are denoted as λi,j, and can be derived

Fig. 3 Markov model of PLC availability, considering DoS cyber-attacks on its subsystems

70

M. Kolisnyk et al.

from statistical data from the literature [14, 16]. The recovery rates after a failure are denoted as μi,j and are obtained with time schedule tools for recovery maintenance after the failure as presented in [17]. Authors of this paper have an article accepted at IEEE conference DESSERT 2022 [18], which describes the Markov model of PLC’s availability without the impact of cyber-attacks, and without state number 23 (firewall failure). Apart from introducing the state 23, the difference to the paper in [18] is that in this paper we now are also considering the impact of DoS-attacks on the PLC’s subsystems and therefore adds attack rates to the model. The initial data of transition rates λi,j, μi,j were presented in [18]. The attacks rates are denoted as αi,j. and are derived from the number of attacks expected per time interval and can be experimentally obtained for a particular PLC (as intensity of requests to the PLC’s subsystems), or it can be taken according to average statistics from literature. Orange transition lines describe the possible transitions due to DoS-attacks on components of PLC. We can effectively use the two above relations in the calculation of one characteristic if the other one is known. This implies that the key procedure in probabilistic reliability assessment of complex systems with known structure functions is an efficient calculation of either system availability (4) or system state probabilities (5). The structure of the PLC includes the reservation of the processor and the system of synchronization (ESM). The considered PLC has a second ESM and the time of the transition to the reserve ESM in the event of a failure of one of the ESMs is considered minimal. For the Markov model of PLC availability considering cyber-attacks (Fig. 3) we define the following 23 states: 0—correct functioning of PLC; 1—the failure state of Execution and Synchronization Manager (ESM) 1; 2—the failure state of restart button; 3—the failure state of ESM 2; 4—the failure state of Random Access Memory (RAM)—DDR3 SDRAM; 5—the failure state of Read-only Memory (ROM)—NVRAM and Flash-memory; 6—the failure state of SD-card Flash-memory; 7—the failure state of Processor Core 1; 8—the failure state of Processor Core 2; 9—the failure state of Processor 1 and Processor 2; 10—the failure state of PROFINET controller; 11—the failure state of Ethernet adapters; 12—the failure state of AXIOLINE F master; 13—the failure state of Global Data Space (GDS) module; 14—the failure state of OPC-UA server; 15—the failure state of PROFICLOUD.io SW; 16—the failure state of Operational System (OS) Linux; 17—the failure state of PLC special SW PC Worx Engineer; 18—the failure state of Web-server; 19—the failure state of Internet User component; 20—the failure state of Wi-Fi adapter;

Markov Model of PLC Availability Considering Cyber-Attacks …

71

21—the partial failure state of PLC; 22—the failure state of PLC; 23—the failure state of firewall. In the model we do not consider a power system failure, because we assume, that such a system has a secondary power supply and will be working in all cases. The rates of DoS-attacks are denoted as αi,j: α23,4—attack rate on RAM; α23,7—attack rate on the processor core 1; α23,8—attack rate on the processor core 2; α23,16— attack rate on OS Linux; α23,17—attack rate on special SW; α23,18—attack rate on Web-server; α23,19—attack rate on the Internet User Component. We use the stationary availability coefficient (AC) as a main indicator of the PLC’s availability and its dependability. For the AC we need to take all states as input in which the PLC does not fail. From the 23 states we of course need to include state 0 (all functioning states). In addition, we include the states 1, 3, 7 and 8 because in case of the failure of the ESM or the processor core the secondary ESM or processor core will take over. In addition, we include state 2, because a failure of the start button has no implications on the functioning of the PLC system. Furthermore, we can also define system state probability in terms of availability. To calculate the AC we sum up the probabilities of all states that refer to a functioning PLC. AC = P0 + P1 + P2 + P3 + P7 + P8 P0 (0) = 1,

23

Pi = 1.

i=0

4 Simulation of the Markov Model of PLC Availability Based on the model we now want to investigate the effects of different failure or attack rates on the AC. For this, we change individual rates and show how the AC evolves with different rates. For the model simulation we use the average transition rates from statistical data [15, 16]. The initial data can be changed, if more concrete values are available, e.g. if in a smart factory statistical data is collected. The initial data assumed for our simulation are presented in Tables 2, 3 and 4 and are the same as used in [18]. The results of the simulation of the Markov model of PLC availability considering cyber-attacks are presented in Figs. 4, 5, 6 and 7. We show some examples on how the AC depends on specific failure rates and attack rates. In all plots the transition rate λ(1/h) or attack rate α is depicted on the x-axis, and the AC on the y-axis. In Figs. 4, 5 and 7 we see how the increasing of a failure rate decreases the AC. Please note that here the whole system is considered including attack rates for calculating the AC. Changing of attack rates also impacts on AC of the system. At first, the value of AC of subsystem under attack increase, but then it is decrease.

1

1

1

1

20

1

1

λ0,1

λ0,2

λ0,3

λ0,4

λ0,5

λ0,6

λ0,7

λ0,15

λ0,14

λ0,13

λ0,12

λ0,11

λ0,10

λ0,8

λi,j

× 10–7 1/h

1

1

1

1

5

2

1

λ9,22 = 0.1 1/h, λ23,21 = 0.01 1/h

× 10–7 1/h

λi,j

λ1,7

λ1,21

λ0,20

λ0,19

λ0,18

λ0,17

λ0,16

λi,j

10

100

10

1

1

1

1

× 10–7 1/h

λ8,9

λ6,21

λ5,22

λ4,22

λ3,21

λ3,8

λ2,21

λi,j

2

100

10

10

100

1

100

× 10–7 1/h

λ16,22

λ15,22

λ14,21

λ13,21

λ12,21

λ11,21

λ10,21

λi,j

Table 2 Initial data of transition rates for the model simulation (here we use the same values, as in [18])

10

10

100

100

100

100

100

× 10–7 1/h

λ0,23

λ21,22

λ20,21

λ19,22

λ18,22

λ17,22

λi,j

100

1

10

1

1

1

× 10–6 1/h

72 M. Kolisnyk et al.

Markov Model of PLC Availability Considering Cyber-Attacks …

73

Table 3 Initial data of cyber-attacks rates for the model simulation [15] αi,j, 1/h

α23,4

α23,7

α23,8

α23,10

α23,11

α23,16

α23,17

α23,18

α23,19

× 10^7

1

1

1

1

1

1

1

1

1

Table 4 Initial data of recovery rates for the model simulation (here we use the same values as in [18]) μi,j

h^(−1)

μ1,0

0.33

μ2,0 μ3,0

μi,j

h^(−1)

μi,j

h^(−1)

μi,j

h^(−1)

μ7,0

0.33

μ14,0

0.33

μ20,0

0.33

0.33

μ8,0

0.33

μ15,0

0.33

μ21,0

0.33

0.33

μ10,0

0.33

μ16,0

0.33

μ22,0

0.33

μ4,0

0.33

μ11,0

0.33

μ17,0

0.33

μ23,0

0.042

μ5,0

0.33

μ12,0

0.33

μ18,0

0.33

μ6,0

0.33

μ13,0

0.33

μ19,0

0.33

Fig. 4 Graphical dependence of the AC from λ0,1 based on models with DoS-attacks

Fig. 5 Graphical dependence of the AC from λ0,4 based on models with DoS-attacks

74

M. Kolisnyk et al.

Fig. 6 Graphical dependence of AC from α23,4 changing

Fig. 7 Graphical dependence of AC from λ0,16 based on model with DoS-attacks

Injecting cyber-attacks on the PLC cause the change of the failure rate of the subsystem under impact of cyber-attack, and AC value. The analysis of the graphical dependence of AC in the form of changes α23,4 (attack rate on the RAM) showed that the AC changes sharply from the value of 0.9996 to the value of 0.99938 with the value of α23,4 = 900 1/h (Fig. 6). The difference in graphical dependencies of AC on changing of the rate value λ04 (the failure state of the RAM) under the impact of attacks shows that the value of AC changes from the value of 0.999968 to 0.99913 with an increase in the value of 0 to 0.00003 1/h. With the smaller value of λ the value of AC is larger, so, the PLC is more dependable. With an increase in λ0,1 (the failure state of ESM 1) in the range 0…0.0003 1/h, the value of AC changes from the value 0.999 to 0.99905. When the value of λ016 (the failure state of OS Linux, Fig. 7) is increased to 0.0014 1/h, the value of AC is changed from the value of 0.9997 to 0.9955. For modeling the real processes in a PLC system, and for simulation of the proposed model, the provisional data can be replaced by real measured data. When collecting, processing and analyzing statistical data, selected as log files from the PLC, installed on the enterprise, it is possible to take these statistical input data for the simulation of this model. The results of the simulation model show how to change the parameters of the PLC dependability—the AC in case of influence of cyber-attacks with a different intensity of when changing the value of transitions rates in the model.

Markov Model of PLC Availability Considering Cyber-Attacks …

75

In this way, the injection of cyber-attacks affects the value of the AC and the dependability of the PLC as a whole. It is necessary to defense the main components of the PLC in the presence of cyber-attacks at various stages of the life cycle. The results of obtaining the state probabilities for the PLC are presented in the Table 5. We can see, that cyber-attacks (in our case DoS-attacks) increase the failure rate of the RAM subsystem of PLC by 100 times, and decrease the probability of the correct functioning of PLC state by 100 times.

5 Conclusion Large modular PLCs are often connected not only to enterprise networks, but also to the global Internet. Therefore, they can be attacked with mechanisms of attacks used in conventional computer networks. In this paper a Markov model of PLC availability is developed considering the impact of cyber-attacks, failures of the main PLC’s subsystems, and their recovery after the failure. Our research shows, that the impact of cyber-attacks (for example, DoS-attacks) is significant for the dependability of the PLC. The developed model can be used to obtain probabilities of failure states of the PLC’s subsystems, and its AC value. This could be useful for service teams and manufacturers of PLC. To ensure the dependability, it is necessary to implement measures aimed at maintaining the operable technical condition of the PLC. The technical condition is affected by HW and SW failures, which can be observed when analyzing the results of the Markov model. Failures of the ROM and RAM of the processor affect the value of the PLC availability factor (AC) the most. Also, the technical condition of the PLC can change under the influence of cyber-attacks. Our research considered the intensity of DoS-attacks, which lead to malfunctions of both PLC subsystems, HW and SW. The conducted study of the Markov model of PLC availability shows that its AC is affected by DoS-attacks, reducing the value of AC by 100 times. Organizations that provide recommendations for the protection of PLCs from cyber-attacks (CERT, NIST, CISA, CSA, DOE, NSA, FBI, Phoenix Contact Cyber-Security Team, CISCO PSIRT, ICS-CERT, Siemens, Mitsubishi, Iconics) can overlap. The most complete option is provided by NIST. When giving recommendations, we suggest taking into account the structural scheme of the PLC, and which vulnerabilities of its components could be successful attacked by DoS-attacks, which can provide to the failure of the PLC main components. This will allow not only to assess the dependability of the PLC, but also to provide correct recommendations regarding protection against cyber-attacks, prevention and their elimination. It is necessary to configure security policies not only for the entire device, but also to take measures to protect the PLC against attacks on the vulnerabilities of each major component of the PLC. All the well-known security measures, like updating security patches and using strong passwords, should be applied.

P0

0.9939

0.99996

P12 , × 10–7

3.01

3.03

P

With attack

Without attack

P

With attack

Without attack

3.03

3.01

P13 , × 10–7

3.03

3.01

P1 , × 10–7

3.03

3.01

P14 , × 10–8

3.03

3.01

P2 , × 10–6

6.06

602

3.03

3.01

6.06

6.02

3.03

3.01

3.03

3.01

3.03

3.01

3.03

3.01

3.03

3.01

3.03

3.01

P8 , × 10–7

3.03

3.01

P19 , × 10–6

P7 , × 10–7

P18 , × 10–7

P6 , × 10–7

P17 , × 10–7

P5 , × 10–8

P16 , × 10–7

P4 , × 10–7

P15 , × 10–7

3.03

3.01

P3 , × 10–8

Table 5 The values of the PLC’s subsystems states probabilities

3.03

3.01

P20 , × 10–6

1.21

1.20

P9 , × 10–7

3.03 × 10–2

5.94 × 10–10

P21

6.06

6.02

P10 , × 10–7

834

1.82

P22 , × 10–8

1.51

1.50

P11 , × 10–6

76 M. Kolisnyk et al.

Markov Model of PLC Availability Considering Cyber-Attacks …

77

According to the results of the Markov model of PLC availability simulation, it is necessary to protect the large modular PLC, as an important subsystem of IIoT system, against cyber-attacks (especially, DoS-attacks). We can see, that the most vulnerable PLC’s subsystems are: RAM, ROM, Processor, and communication protocols. It is necessary to use methods of protection and prevention from cyber-attacks for these subsystems in the lifecycle of PLCs, and their operation. Vulnerabilities are present in all main components of PLC, so, they can be attacked by hackers. Therefore, it is necessary to provide cyber-security measures for all PLC components. Cyber-attacks can decrease the availability of the PLC by 100 times (for the chosen initial data) and its dependability (AC). A properly configured firewall of the PLC, access policies, authorization, authentication and accounting services, will prevent possible attacks on the large modular PLC, and therefore reduce the risks of successful cyber-attacks. It is also necessary to train the personnel in the basics of cyber-security of PLC. Acknowledgements This research was supported by the Austrian Academy of Sciences’ Joint Excellence in Science and Humanities (JESH) for Ukrainian scientists under grant “Models and method for assessing the dependability of IioT subsystems” (2022).

References 1. Kolisnyk, M., Kharchenko, V., Piskachova, I., Bardis, N.: Reliability and security issues for IoT-based smart business center: architecture and Markov model. The World Conference IEEE: MCSI, Greece, Chania, pp. 313–318 (2016) 2. Syed, R., Ramachandran, K.: Reliability modeling strategy of an industrial system. Proceedings First International Conference on Availability, Reliability and Security, ARES2006, pp. 625– 630 (2006). https://doi.org/10.1109/ARES.2006.107 3. Kumar, S., Gaur, N., Kumar, A.: Developing a secure cyber ecosystem for SCADA architecture. 2018 Second International Conference on Computing Methodologies and Communication (ICCMC), pp. 559–562 (2018). https://doi.org/10.1109/ICCMC.2018.8487713 4. Kabiri, P., Chavoshi, M.: Destructive attacks detection and response system for physical devices in cyber-physical systems. 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), pp. 1–6 (2019). https://doi.org/10.1109/CyberSecPODS. 2019.8884999 5. Lin, C.-T., Wu, S.-L., Lee, M.-L.: Cyber-attack and defense on industry control systems. 2017 IEEE Conference on Dependable and Secure Computing, pp. 524–526 (2017). https://doi.org/ 10.1109/DESEC.2017.8073874 6. Malchow, J.-O., Marzin, D., Klick, J., Kovacs, R., Roth, V.: PLC guard: a practical defense against attacks on cyber-physical systems. 2015 IEEE Conference on Communications and Network Security (CNS), pp. 326–334 (2015). https://doi.org/10.1109/CNS.2015.7346843 7. Ivanchenko, O., Kharchenko, V., Moroz, B., Ponochovny, Y. and Degtyareva, L.: Availability assessment of a cloud server system: comparing Markov and Semi-Markov models. 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), pp. 1–6, https://doi.org/10.1109/IDAACS 53288.2021.9661052(2021)

78

M. Kolisnyk et al.

8. Babeshko, E., Kharchenko, V., Gorbenko, A.: Applying F(I)MEA-technique for SCADA-based industrial control systems dependability assessment and ensuring. 2008 Third International Conference on Dependability of Computer Systems DepCoS-RELCOMEX, pp. 309–315, https://doi.org/10.1109/DepCoS-RELCOMEX.2008.23 (2008) 9. Hassan, M., Gumaei, A., Huda, S., Almogren, A.: Increasing the trustworthiness in the industrial IoT networks through a reliable cyberattack detection model. IEEE Trans. Indus. Inf. 16(9), 6154–6162 (2020). https://doi.org/10.1109/TII.2020.2970074 10. Kharchenko, V., Ponochovnyi, Y., Ivanchenko, O., Fesenko, H., Illiashenko, O.: Combining Markov and Semi-Markov modelling for assessing availability and cybersecurity of cloud and IoT systems. Cryptography 6(3), 44 p. (2022) https://doi.org/10.3390/cryptography6030044 11. Alves, T., Das R., Morris, T.: Embedding encryption and machine learning intrusion prevention systems on programmable logic controllers. IEEE Embedded Syst. Lett., 1 (2018). https://doi. org/10.1109/LES.2018.2823906 12. Cuninka, P., Závacký, P., Strémy, M.: Influence of architecture on reliability and safety of the SRCS with safety PLC. 2015 Second International Conference on Mathematics and Computers in Sciences and in Industry (MCSI), pp. 225–230 (2015). https://doi.org/10.1109/MCSI.201 5.38. (2015) 13. Gao, J. et al.: An effective defense method based on hash authentication against mode-switching attack of Ormon PLC. 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), pp. 976–979 (2022). https://doi.org/10.1109/ICSP54964.2022.9778843 14. Gnedenko, B., Pavlov, I., Ushakov I.: Statistical Reliability Engineering. Wiley, Technology & Engineering, 528 p (1999) 15. PHOENIX CONTACT PLC Next. Manual, 245 p. (2019). https://www.phoenixcontact.com/ en-in/products/controller-axc-f-2152-2404267 16. Kolisnyk, M., Kharchenko, V.: Investigation of the smart business center for IoT Systems availability considering attacks on the router. Dependable IoT for Human and Industry. Modeling, Architecting, Implementation. Kharchenko, V., Ah Lian Kor, Rucinski, A. (edits). River Publishers Series in Information Science and Technology. 622 p. (2018) 17. Kolisnyk, M.: R&D report, development of methods for determining the scope of work on the operation of telecommunications equipment, Part 1, Part 2. Printing. UkrDAZT, reg. No 0111U007919. 94 p. Kolisnyk M.O., Prikhodko S.I., Lysechko V.P., Zhuchenko O.S., Volkov O.S., Shtompel M.A. (2012) 18. Kolisnyk, M., Jantsch, A., Piskachova, I.: Markov model for availability assessment of PLC in Industrial IoT considering subsystems failures. 12th IEEE International Conference on Dependable Systems, Services and Technologies, DESSERT’2022, 9–11 December, Athens, Greece, pp. 1–4 (2022)

Advanced Networking and Cybersecurity Approaches Andriy Luntovskyy

Abstract The presented work is devoted to the examination of secure networking and advanced cybersecurity approaches as follows: foundations studies, segmenting of structured networks, new generations of networked firewalls with intrusion detection systems, intrusion prevention systems, and collaborative intrusion detection networks, honeypotting for vulnerability researching, blockchain technology vs. public-key-infrastructure, peculiarities for IoT. The work contains an actual survey of the approaches mentioned above and some illustrating examples. Keywords Firewalls · Intrusions and attacks · IDS · IPS · CIDN · Blockchain · Honeypotting · Segmenting · Hackers · Insiders

1 Motivation Up-to-date combined networks (LAN, robotics, IoT) and cyber-systems are becoming the victims of hacker attacks more and more frequently. Compared to the conventional ones, advanced networking, and cybersecurity approaches must be used to counteract this. The given work represents the impact of advanced cybersecurity in modern combined networks (LAN, Robotics, IoT), which provide peer-2-peer (P2P) and machine-to-machine (M2M) communication styles. These communication styles are growing in the networks nowadays, with further service decentralization and load balancing between clouds and IoT devices [1–5]. Section 1 includes motivation for this work. The remainder of this work has the following structure:

A. Luntovskyy (B) BA Dresden University of Coop. Education, Saxon State Study Academy Dresden, Dresden, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_6

79

80

A. Luntovskyy

• Section 2: State-of-the-Art: firewalls, blockchain (BC), collaborative intrusion detection networks (CIDN), • Section 3: Network planning with segmenting and conventional security approaches, • Section 4: Foundations for advanced cybersecurity like OWASP [6] and MITRE [7–9], SIEM [10], • Section 5: Honeypotting for Advanced Security with collecting of knowledge about dangerous events [7–9, 11], • Section 6: Conclusions. The work contains the following necessary abbreviations and acronyms: DR— data rate, FW—firewall, PF—packet filter, CR—circuit relay, AG—application gateway, WAF—web application firewall, IDS—intrusion detection system, IPS— intrusion prevention system, SIF—stateful inspection firewall, NG-FW—new generation firewall, CIDN—collaborative intrusion detection networks, DMZ—demilitarized zone, HP—honeypotting, VM—virtual machine, BC—blockchain, P2P— peer-2-peer, M2M—machine-to-machine, SC—smart contracting, SIEM—Security Information and Event Management, OWASP—Open Web Application Security Project, PKI—Public Key Infrastructure.

2 State-Of-The-Art Cybersecurity means based on [1, 2] protection of a data and telecommunication system from theft or damage to the hardware, the software, and the information on them, as well as from disruption or misdirection of the services that have to be provided.

2.1 Firewall Techniques A classification of firewall generations based on Check Point [3] is given below (Table 1). The functionality of FW is growing each decade, including IPS, IDS, Antibot, CIDN, and other concepts. The traffic between the interacting network nodes can be encrypted and authenticated via cryptographic protocols like VPN, IPsec, TLS/SSL, and HTTPS.

Advanced Networking and Cybersecurity Approaches

81

Table 1 Firewall generations by check point [3] Generation

Gen I

Gen II

Gen III

Gen IV

Gen V

Period

1990

2000

2005

2010

2022–2023

Risks and threats

–

–

–

–

CIDN

–

–

–

–

Honeypotting

–

–

–

Anti-phishing

Antibot

–

–

IPS

WAF

Anti-ransomware

–

NG-FW

IDS

Sandbox, quarantine

*

PF, CR, AG

SIF

DMZ

*

*

Antivirus

Anti-malware

*

*

*

Legend:

DR—data rate, FW—firewall, PF—packet filter, CR—circuit relay, AG—application gateway, WAF—web application firewall, IDS—intrusion detection system, IPS—intrusion prevention system, SIF—stateful inspection firewall, NG-FW—new generation firewall, CIDN—collaborative intrusion detection networks, DMZ—demilitarized zone, HP—honeypotting, VM—virtual machine, BC—blockchain, P2P—peer-2-peer, M2M—machine-to-machine, SC—smart contracting, SIEM—Security Information and Event Management, OWASP—Open Web Application Security Project, PKI—Public Key Infrastructure

2.2 Blockchain Techniques Nowadays, the use of BC (blockchain) instead of PKI (Public Key Infrastructure) provides more security in modern combined networks (LAN, Robotics, IoT). The following criteria are considered [2]: • No more 1:1 like in web-on-trust communication (it means conventional communication model like one-to-one, or Alice-to-Bob), • No more only 1:1 communication with the use of trusted 3rd-parties (CA with PKI and TLS deployment), • Dominating n:n (or P2P, or peer-to-peer) communication scenarios by offering multilateral compulsoriness for the parties (n × n), • Growing M2M (machine-to-machine) communication style with significantly asymmetric links; upload functionality is implemented with uplink DR, which is essentially more than downlink one’s for plain commands only, • Further service decentralization of data processing between clouds and IoT/ Robotics with choreography as a composition method before orchestration [1, 2]. Blockchain (BC) is, per definition, a decentralized network that consists of interconnected nodes in which data and information are redundantly stored. Behind every node, a unique participant of the network is usually situated. Transactions between participants are always directly carried out, like in a P2P network; an intermediary, e.g., Certification Authority with PKI for 1:1 communication (Alice-Bob), is no more

82

A. Luntovskyy

necessary [1, 2]. The concept is a further development of the conventional RSA/PKI method. Each block contains a collection of processed transactions and descriptive metadata (headers). As soon as a block #n has reached its maximum number of transactions, all its content is used to generate a hash code (based on a not-reverse function). This hash is then used in the metainformation of the next block #(n + 1). It leads to chaining for all blocks within a specific crypto-network (Fig. 1). These blocks with the contained transactions are secured and compulsory for executing due to calculated signatures for the elaborated hash codes. The workflow is fully transparent and better controlled; there is no way to roll back. All activities are binding and understandable for the participants (peers). With a blockchain, different kinds of security goals can be guaranteed [1, 2]: • Authentication: Signature and encryption can be used, as mentioned before, • Access Control: In a private blockchain, granular access policies can be defined for different participants, • Confidentiality: Data is only accessible from within the network, • Integrity: Transactions that are stored in a block cannot be deleted and are nearly temper-proof, • Transparency: Every participant has access to the complete blockchain.

Fig. 1 A common structure of a blockchain and BC-based smart contracting

Advanced Networking and Cybersecurity Approaches

83

The following few prominent examples of blockchain techniques can be mentioned herewith: Bitcoin, Ethereum, Hyperledger Fabric, Multichain Quorum [1, 2]. Via BC, the basis for so-called Smart Contracting (SC) is provided (refer to Fig. 1). A smart contract, based on BC, offers the following advantages in comparison to conventional contracting [1, 2]: • • • • •

Decentralized processing of the agreements, Mapping of the contracts as executable source code, Compulsoriness and trustworthiness through transparency, “Open execution” instead of just “open source”, Legal security without an intermediary (jurist).

2.3 CIDN Deployment The widespread modern intrusion detection systems (IDS) evaluate and prohibit the potential hackers’ attacks that are directed against a computer system or a network. IDS increase data security significantly in the opportunity to the classical firewalls, which lonely deployment is not satisfying. Intrusion prevention systems (IPS) are the enhanced IDS which provide the additional functionality aimed in discovering and avoiding of the potential attacks [1–3]. Nevertheless, as a rule, the classical IDS/IPS are operated standalone or autonomously. They are not able to detect advanced hackers’ threats which become year by year more sophisticated and complex. Those dangerous threats can serve to disorder the operation of data centers and robotic clusters round-the-clock in 24/7 mode. Therefore, the cooperation and collaboration of the IDS’ in a network are of significant meaning [1–3]. A collaborative intrusion detection network (CIDN) is an optimized concept for so-called collaborative IDS/IPS’. They operate no more autonomously. This network contains IDS as the nodes (peers) and is intended to optimize the disadvantages of standalone defense solutions confronting to unknown dangerous attacks (Fig. 2). A CIDN allows the participating IDS nodes as the collaborating (cooperating) network peers to share the detected knowledge, experiences and best practices oriented against the hackers’ threats [1, 2, 4]. “A Collaborative Intrusion Detection Network (CIDN) consists of a set of cooperating IDS’s which use collective knowledge and experience to achieve improved intrusion detection accuracy (Carol Fung [4]). The nodes (refer to Fig. 2) can be marked by color intensity: good reputation peers (Alice, Bob, Charlie, etc.), compromised peers (Buddy), and malicious peers (Trudy, Mallory). The executed attacks A1-A8 as well as A9 as hybrid on the given peers in the collaborative IDS can be sorted into classic (conventional) attacks, which can be fixed via the standalone solutions (e.g., Man-in-the-Middle) as well as into advanced

84

A. Luntovskyy

Fig. 2 CIDN and advanced attacks (based on [4])

insider attacks, which can be only prevented via a CIDN (e.g., Betrayal Attack) from compromised or, even, malicious peers [1–4]. The main requirements for the construction of a CIDN and the support of such functionality are as follows: efficient communication at short up to a middle distance, the robustness of the peers (IDS) and links, scalability, and mutual compatibility of individual participating peers (single IDS). The typical interoperable appropriate networks are as follows: LAN, 4-5G mobile, Wi-Fi, BT, and NFC [1, 2]. Collaborative intrusion detection networks (CIDN) can consist [4] of multiple IDS solutions under the use of multiple “intelligent things”, robots, gadgets, PC, end radio devices, installed firewalls as well of groups of users, which are divided into clusters—peers (titled as users Alice, Bob, Charlie, Dave, etc.). The coupling between the groups is loose or tight. However, the reputation of the users (peers) is quietly different (cp. Fig. 2): good, compromised (refer Buddy), malicious (refer Trudy and Mallory). Additionally, insider attacks on CIDNs are possible (e.g., by Emmy with a temporary “good” reputation). The CIDN can efficiently prevent such multiple attacks A1-9 (cp. Fig. 2) by providing peer-to-peer cooperation. This type of networking improves the overall accuracy of the threat assessment. As it has been shown, the cooperation among the participating single peers (IDScollaborators) became more efficient within a CIDN.

Advanced Networking and Cybersecurity Approaches

85

Table 2 CIDN functionality [4] Certain CIDN examples

Detection and prevention of the advanced attacks A1-A9

Topology type

Focus

Further risks

Indra

+

Distributed

Local

SPAM

Domino

+

Decentralized

Global

Worms

Abdias

+

Centralized

Hybrid

Trojans

Crim

+

Centralized

Hybrid

Social Engineering, WAF

Unfortunately, the CIDN can become a target of attacks and malicious software sometimes. Some malicious insiders within the CIDN may compromise the interoperability and efficiency of the intrusion detection networks internally (refer to Table 2). Therefore, the following tasks must be solved: selection of peers (collaborators), resource and trust management, and collaborative decision-making. As prominent examples for HP the next can be cited: Indra, Domino, Abdias, Crim [1–4].

3 Network Planning with Segmenting Within a Campus LAN The next topic of securing is accurate network planning with internal structure: segmenting in the campus LAN/ intranet. Segmenting brings many advantages compared to an unsegmented network (refer to the next §).

3.1 Unsegmented Networks An example of an unsegmented network is given in Fig. 3. Herewith, a flat network is considered. NAS (network-attached storage), virtualized servers, clients, and printers are coupled behind a firewall with rigorous filtering rules [1, 2, 5]. A flat network operates as a so-called broadcast domain with easy management. The authorization proceeded via Access Control Lists (ACL) or Capabilities usually. The system enables controlled routing (Layer 3) with the deployment of subnetting (IP subnets) and a firewall. In spite of easy management, the solution has a lot of disadvantages: security and robustness against multiple intrusion risks. Additionally, a flat network is less scalable [1–5].

86

A. Luntovskyy

Fig. 3 Unsegmented network

3.2 Segmenting Best Practices As an optimized solution VLAN (Virtual LAN) use is recommended. VLAN segmenting means fast broadcast structure separation (OSI Layer 2) and dynamic routing (OSI Layer 3). As a disadvantage must be mentioned: redundancy and inessential performance reduction. As the leading techniques and protocols for LAN segmenting, the following can be mentioned (Fig. 4): • VRF (Virtual Routing Forwarding), • MPLS with OSPF for the best route (Spanning Tree), • BGP (Border Gateway Protocol).

Fig. 4 Segmenting with external and internal firewalls

Advanced Networking and Cybersecurity Approaches

87

Segmenting with external and internal firewalls is depicted herewith (refer to Fig. 4). As a rule, between both external and internal firewalls, a zone is situated, which provides free access to public-offered services like Web, Mail, DS, Cloud, etc. Such zoning is called DMZ. The following advantages are provided: • • • •

Deployment of VLAN per IP or per MAC addressing Deployment of DMZ (demilitarized zone) with public-offered services Class formation with Subnetting for IPv4 or CIDR alternatively Migration to IPv6 with multiple pros in performance, QoS, and security, but very laborious.

3.3 Conventional Cybersecurity Approaches Together with structuring via segmenting, further best practices for conventional security can be provided. These include [1, 2, 5] the following conventional approaches (refer to the previous sections, Fig. 4 and Table 1): 1. Stateful firewalls/NG-FW with functionality like: • • • • • • • •

Combined Filtering Rules Policies, Security zones, Identity awareness, User and group reference, Loose coupling, Heuristics, Antibot, Access Lists Violation Recognition.

2. Intrusion Detection Systems (IDS) with the functionality like: • • • • •

2-stage firewall strategy (refer to Fig. 4), DNS Traps, DMZ: Zoning for publicly available services, App and URL filters, Access and Application control.

3. Intrusion Prevention Systems (IPS) with functionality like: • • • •

Examination by samples from KDB (knowledge databases), KDB for vulnerabilities, Logging, Automatic containment.

4. Anti-Malware with functionality like: • • • •

Permission or rejection, Antispam, anti-phishing, anti-ransomware, Quarantine providing, WAF and outsourcing of web security [1–3].

88

A. Luntovskyy

Fig. 5 Advanced security: foundations and best practices [3–12]

The above-mentioned approaches offer only conventional cybersecurity. In contrast, blockchain techniques with SC, new firewall generations (see Table 1), and CIDN enable so-called advanced security (see Sect. 2) and are usually offered in combination with the conventional approaches. In addition, honeypotting (HP) targets up-to-date network applications and can be recommended for such advanced security too. Per definition, HP operates [6–10] with fake good reachable instances (like hosts, servers, VM, software, databases) which act as decoys to distract potential intruders and insider enemies. HP techniques are based on the following advanced security foundations (see Sect. 4). These organizations provide wide support in the elaboration of efficient honeypotting solutions. Slightly finalizing temporary results, the following demarcation between so-called advanced security to the conventional security concepts can be given in Fig. 5.

4 Foundations for Advanced Cybersecurity 4.1 Open Web Application Security Project OWASP is a non-profit organization in the USA (founded in 2001) and aims to improve the security of applications and services on the WWW. By creating transparency, end-users, and organizations should be able to make well-founded decisions

Advanced Networking and Cybersecurity Approaches

89

about real security risks in software. OWASP Top 10 is the ten most critical security risk categories in web applications. Nowadays, OWASP Top 10/ 2021 must be considered [6].

4.2 MITRE Corporation The MITRE Corporation is an organization for the operation of research institutes on behalf of the USA, which was created by splitting off from MIT (Massachusetts, USA). Nowadays, MITRE is a research institute in Bedford, MA, USA, founded in 1958, with ca. 8000 researchers [7]. MITRE provides public access to the so-called MITRE ATT&CK, a globallyaccessible knowledge database of adversary tactics and techniques based on realworld observations (since 1958). The ATT&CK KDB is used as a foundation for the development of specific threat models and methodologies in the private sector, government, and cybersecurity products and service communities. ATT&CK is open and available to any person or organization for use at no charge [8]. With the creation of ATT&CK, MITRE is fulfilling its mission to solve problems for safer networking by bringing communities and enterprises together for the development of more effective cybersecurity. The following use cases can be considered below [7–9]: Example 1 (Enterprise Matrix). Enterprise Matrix from MITRE ATT&CK [9] contains the tactics and techniques for the following platforms: Windows, macOS, Linux, PRE, Azure AD, Office 365, Google Workspace, SaaS, IaaS, Network, Containers; refer to the URL: https://attack.mitre.org/matrices/enterprise/. Example 2 (ICS Matrix) ICS Matrix from MITRE ATT&CK [9] contains the tactics and techniques for ICS, i.e., Information and Communication Systems; refer to the URL: https://attack.mitre.org/matrices/ics/.

4.3 SIEM Market Security Information and Event Management (SIEM) is a combined concept and, simultaneously, a multi-component software product for real-time analysis of security alarms from both sources: applications and networking [10]. SIEM provides the cybersecurity of an enterprise or organization as a centrally installed instance or as a cloud service.

90

A. Luntovskyy

According to Gartner Consulting [12], SIEM vendors (e.g., IBM Tivoli SIEM, Splunk) are defined as supporting a wide spectrum of use cases, including threat detection, compliance, real-time telemetry, event analysis, and incident investigation.

5 Honeypotting for Advanced Security Based on MITRE and OWASP [6–9], honeypotting techniques can be deployed as an efficient addition to the discussed firewalls, IDS/ IPS, CIDN, PKI with TLS, as well as BC with SC. Honeypotting is oriented to [6–9, 11, 13–15]: • Intruder distraction by apparent goals (decoys), • Collection and analysis of multi-dimensional information about attacks and attackers are available, • Insider detection, • Vulnerability monitoring, • Functionality outsourcing on demand. The taxonomies of HP techniques provided by Zach Martin [11] are given in Table 3. Some important categories like a communication model, objectives, interaction intensity, augmentation, and outsourcing potentials are considered. An example to distinguish between so-called Server-Side HP and Client-Side HP, as well as passive or active modus operandi by the deployed communication model, is given in Fig. 6. The main principles for both HP are declared in the following sections. Table 3 Taxonomies of honeypotting Category

Honeypotting

C-S communication model

1. Server-side HP 2. Client-side HP

Objective

3. Production HP 4. Research HP

Interaction intensity

5. High-interaction HP 6. Low-interaction HP

Augmented

7. Honeypots with DMZ and insider 8. HP with CIDN

Outsourcing

9. Deception HP with outsourcing 10. Honeypots for malware researchers

Advanced Networking and Cybersecurity Approaches

91

Fig. 6 Honeypotting: server-side HP versus client-side HP

5.1 Honeypotting with Gateways and Firewalls Honeypotting cooperates widely with networking gateways and the discussed firewall techniques [6–9, 11], both under the use of the knowledge of OWASP, MITRE ATT&CK foundations as a rule. Furthermore, its functionality lies in acquiring and analyzing multi-dimensional information about attacks and attackers (hackers, intruders, insiders), as well as gaining in-depth knowledge of which procedures and tools attackers are currently using (Fig. 7). A honeypot with a PF provides a mostly simple cooperation type. LAN and Wi-Fi deployment is relevant for the use of gateways with data flows based on “communication gateways-2-routers”, “gateways-2-complex devices” connectivity, e.g., PC, surveillance cameras, tablets and notebooks, multi-function printer (MFP), NAS, Smart TVs, smart radios, etc. The Wi-Fi IEEE 802.11

Fig. 7 Honeypotting with PF

92

A. Luntovskyy

ac, ax mesh networks use as a rule 3D-roaming protocol IEEE 802.11 s. The presented building with multiple floors provides a “smart” Wi-Fi mesh network with 3D-roaming for the listed networking devices [1, 2, 16, 17]. Example 3 (IoT and Smart Home) The following standards can be mostly used for the mobile IoT boards and Smart Home in the intranet for small and simple devices connectivity: ZigBee (e.g., for Alexa Gateways for LED-lamps, subwoofers, mini-cameras, digital cinemas, etc.) and its multiple clones like, e.g., ConBee II, Bluetooth (BT) as well as established on the market Z-Wave, EnOcean, etc. [1, 2, 16, 17]. A gateway (GW) supports the following protocols: Z-Wave, Wi-Fi, ZigBee, and Bluetooth. The gateways should be compatible with many other vendors. The voice control can be provided via Amazon Alexa or Google Home. A Philips Hue Bridge enables simple control by voice via Alexa too. The Smart Home, based on further protocols, i.e., BT, controls HVAC (Heating, Ventilation, Air-Conditioning) as well as lighting, is shown in Fig. 8. The HP designers frequently face the following practical problems [6–9, 11, 13– 15]:

Fig. 8 Honeypotting with an IoT gateway for smart home deployment [16, 17]

Advanced Networking and Cybersecurity Approaches

93

• System interoperability, • Confusion and obtrusiveness of different user hardware and software interfaces, • Lack of data security and privacy due to proprietary software or external data storage effects based on the clouds (theft of data, outflow of data to companies), • Requirement of IoT connections for reliable control.

5.2 Honeypotting and Vulnerability Monitoring Below, the HP functionality and vulnerability monitoring is described in more detail. Up-to-date combined networks (LAN, Robotics, IoT) and cyber-systems are becoming the victims of hacker attacks more and more frequently. To counteract this, honeypots are used as a measure to supplement NG-FW, IDS, IPS as well as CIDN [16, 17]. As it has been mentioned above, honeypot [6–9, 11, 13–15] represents a physical host, VM, IoT device or software that is intended to distract an attacker (intruder, insider enemy) from the actual target (offering real services) into a fake area that would otherwise not have interested him, i.e., in the form of a decoy. In the case of an attack, there is a high probability that the HP will be attacked first since it is not protected by the firewalls like productivity servers in an internal structured network (refer to Figs. 6 and 7). During an attack, the monitoring application sends out alerts, records the hacker activities, and, e.g., blocks their IP address (Fig. 9), as well as collects further data in the form of a KDB to share it with the communities and foundations and immunize the system in the future.

Fig. 9 Honeypotting with monitoring tools

94

A. Luntovskyy

Fig. 10 Honeypotting in a scenario with DMZ and insiders

HP can be set up in the following manner. An OS without updates and with default settings can be installed. This physical host or VM must be equipped only with data that can be deleted or destroyed without risk to the internal combined network (LAN, intranet, Robotics, IoT). Furthermore, in addition to the HP an application has to be installed that records the attacker’s activities. Depending on the objective, honeypots can be divided into production and research honeypots (refer to Table 3). The simple Production HP has limited information collection and is primarily used for company security; the advanced Research HP is more complex and is used in research, administrative and military institutions (refer to Figs. 9 and 10). Depending on interaction intensity, High-interaction HP and Low-interaction HP are differentiated. The advanced forms are presented in Table 3: • Augmenting HP with DMZ andCIDN faced to insiders are considered, • Outsourcing HP can lead to so-called Deception HP or Honeypots for Malware Researchers [6–9, 11].

5.3 Production Honeypots As a rule, the Production HP can be divided into (refer to Fig. 6): 1. Server-Side Honeypotting: • The Server-Side honeypot simulates a server application (a web server). The attackers are attracted to this isolated area. As soon as an intrusion is attempted, the honeypot records the activities, then reports about a critical event, and then

Advanced Networking and Cybersecurity Approaches

95

initiates specific countermeasures. The knowledge is collected to a KDB to secure the computer system more effectively. 2. Client-Side Honeypotting: • The Client-Side honeypot is aimed at the imitation of application software like a web browser that accesses unsecured/ dangerous websites. The processed attacks on browsers are logged. The knowledge is collected to a KDB to improve the software and reduce security risks (refer to Fig. 10).

5.4 Research Honeypots So-called Research Honeypots collect and evaluate multi-dimensional data about the registered critical events and enter it into the KDBs. The organizations like OWASP, MITRE [6–9, 11], the honeypot network “Project Honey Pot”, etc., use them to collect data (e.g., dangerous IP addresses) aimed at increasing knowledge of the procedures and tools which attackers are currently using. The research findings are published to share with the general public or communities. In spite of evident advantages, there are some cons. The HP pros and cons are presented in Table 4. Table 4 HP pros and cons Advantages

Disadvantages

1. Collection, acquiring, logging, and analysis 1. Security systems can be overloaded by high of data data traffic, which leads to the loss of data 2. Additional control functionality for the IDS packets; then, if a honeypot is not found, it (intrusion detection system), which the misses sense attacks register can, and the firewalls 2. The next question is: whether the attack by 3. HP can be used to monitor the effectiveness the intruders will follow the secured of the security systems and its independent services directly or maybe not or only quality assurance indirectly. In this case, HP looks into the 4. With proper monitoring of honeypots, there void are no false reports 3. The danger must be considered that a break-in on a honeypot will be successful, and this will be used for further attacks against the network’s internal infrastructure 4. There are, unfortunately, mainly no legal regulations; the use of honeypots could constitute a criminal offense depending on the application too (qualified as an aid in burglary)

96

A. Luntovskyy

5.5 Practical Honeypotting As prominent examples of HP the following can be cited: Dionaea Venus Flytrap, Honeyd, Project Honey Pot [11, 13–15]. The following further examples are considered below. Example 4 (Honeypotting for Deutsche Telekom) The Company Deutsche Telekom (headquartered in Bonn, North Rhine-Westphalia) set up around 3,000 honeypot traps in April 2022. These have already recorded 30–45 million attacks per day at their peak [13]. “Everyone and everything is networked and needs cyber security. But nobody can do this alone. We need the army of the good guys. To this aim, we share our knowledge to immunize society against cyber-attacks.“ (by Dirk Ofen, Head of Telekom Security) [13]. Example 5 (HoneySAP) SAP is a leading software concern in Germany (headquartered in Waldorf, Baden-Wuerttemberg) and also provides software monitoring for HP for acquiring and collecting knowledge about networking vulnerabilities for better immunization of modern combined and structured networks. HoneySAP is a lowinteraction research-focused HP which develops specific SAP services. The HP is aimed at learning the techniques and motivations based on the already registered attacks against SAP software [14]. The Generation of Honeypot Data is a USA Patent Application [15]. The collected fake data is used to initialize an HP, and then the intrusion attempts data can be monitored and logged. Related systems and techniques are described. Example 6 (Deception Technology) This example distinguishes from the above-mentioned both by the deployment of a trendy advanced “deception technology” [11], which is maintained by an outsourcing partner company. Deception technology deploys a complex system of decoy servers and VMs (a holistic honeypot system) within a target enterprise network and provides all of the capabilities of analysis and insight for the target partner enterprise [11].

6 Conclusion The given work represents the impact of advanced cybersecurity in modern combined networks (LAN, Robotics, IoT), which provide Peer-2-Peer and Machineto-Machine communication styles:

Advanced Networking and Cybersecurity Approaches

97

• Secure network planning must include at least segmenting and firewall defense with DMZ. • Advanced firewall techniques are used, too, like IDS, IPS, and CIDN, for securing against multiple sophisticated intruders and insider attacks. • Blockchain and SC based on BC provide the compulsoriness in decentralized communication scenarios like P2P and M2M. • Advanced security approaches are widely supported via international foundations like OWASP, MITRE ATT&CK, SIEM, and widespread defense outsourcing (socalled “deception technology”). • Honeypots provide the decoys for detracting multiple intruders and insiders from real attack targets. The diversity of HP with collecting knowledge about dangerous events plays a steadily growing role in secure networking. Acknowledgements The author’s great acknowledgments belong to the colleagues from the BA Dresden and the University of Žilina both, especially to A. Haensel, F. Schweitzer, E. Zaitseva, M. Kvassay, as well as C. van Gulijk, T. Zobjack, E. Herrmann, O. Graetsch for useful support, inspiration, and challenges by fulfilling this work.

References 1. Luntovskyy, A., Guetter, D.: Highly-Distributed Systems: IoT, Robotics, Mobile Apps, Energy Efficiency, Security, Springer Nature Switzerland, Cham, monograph, March 2022, ISBN: 978–3–030–92828–5, 1st ed. 2022, XXXII, 321 pages, 189 color figures (Foreword: A. Schill) 2. Luntovskyy, A., Spillner, J.: Architectural Transformations in Network Services and Distributed Systems: Current technologies, standards and research results in advanced (mobile) networks, Springer Vieweg Wiesbaden, 2017, 344 p., ISBN: 9783658148409. https://www.springer.com/ gp/book/9783658148409#otherversion=9783658148423 3. Check Point (online): https://www.checkpoint.com/. 4. Fung, C., Boutaba, R.: Intrusion Detection Networks: A Key to Collaborative Security (ISBN: 978–1466564121), 2013, 261 p. 5. Natzschka, J.: Netzwerkplannung und Segmentierung im Campus LAN, IBH Messe, 10.5.2022, IBH IT Services Dresden (in German, online): https://www.ibh.de/ 6. OWASP Top 10 (online): https://owasp.org/ 7. MITRE (online): https://www.mitre.org/ 8. MITRE ATT&CK Knowledge Base (online): https://attack.mitre.org/ 9. MITRE ATT&CK Matrices (online): https://attack.mitre.org/matrices/ 10. Miller, D.R., Harris, S. et al.: Security Information and Event Management (SIEM) Implementation, McGraw Hill Professional, 2010, 496 p. (ISBN 978-0-071-70108-2) 11. Martin, Z.: A Honeypot Guide: Why Researchers Use Honeypots for Malware Analysis, 29 Aug. 2018 (online): https://www.intego.com/mac-security-blog/a-honeypot-guide-why-resear chers-use-honeypots-for-malware-analysis/ 12. Gartner Consulting (online): https://www.gartner.com/ 13. Deutsche Telekom Security (online): https://github.security.telekom.com/honeypot.html/ 14. Honeypotting for SAP (online): https://honeysap.readthedocs.io/ 15. Honeypotting for SAP (online): https://www.freepatentsonline.com/y2020/0186567.html/ 16. Luntovskyy, A., Beshley, M., Klymash, M. (Eds.): Future Intent-Based Networking: on the QoS Robust and Energy Efficient Heterogenous Software-Defined Networks, by Springer

98

A. Luntovskyy

LNEE 831, 2022, 28 chapters, monograph, XXI + 530 pages, Springer International Publishing (ISBN: 978-3-030-92433-1) 17. Luntovskyy, A., Beshley, M., Melnyk, I., Klymash, M., Schill, A. (Eds). Emerging Networking in the Digital Transformation Age: Approaches, Protocols, Platforms, Best Practices, and Energy Efficiency, by Springer LNEE 965, 2023, 37 chapters, monograph, XXXVI + 670 pages, Springer Nature Cham (ISBN: 978-3-031-24962-4)

Use Cases for Reliability Engineering and Computational Intelligence

Application of Machine Learning Techniques to Solve the Problem of Skin Diseases Diagnosis Eduard Kinshakov and Yuliia Parfenenko

Abstract Solving the problem of remote diagnosis of diseases, including the use of telemedicine, is an urgent task. When diagnosing dermatological diseases, the input information is an image of a skin area with a certain skin lesion. Currently, machine learning is widely used in medicine, and, as a rule, machine learning methods solve the task of recognition and classification depending on the subject area. The problem of the research is the selection of the optimal algorithm or, as it is also called, the filter, which will increase the quality of the image so that the neural network can clearly understand the disease area and recognize it. In this study, Sobel methods, the method of principal components, and brightness normalization are used to improve image quality. After each processing, the data is fed to a convolutional neural network based on the TensorFlow framework. The developed neural network is used for skin diseases classification. Keywords Machine learning algorithm · Neural network · Image processing · Sobel filter · TensorFlow · Image classification

1 Introduction Today, there is a large number of diseases that have negative consequences for the body and the future health of a person. In particular, skin diseases can spread quickly in unfavorable epidemiological conditions, so they require rapid diagnosis. A large number of settlements in Ukraine have remote ambulance stations or different types of dispensaries, which can provide at least minimal medical care, or do not have qualified medical workers of various profiles at all [1]. As of the end of 2022, since the start of a full-scale war due to stress and adverse living conditions, residents of E. Kinshakov (B) · Y. Parfenenko Faculty of Electronics and Information Technologies, Sumy State University, Sumy, Ukraine e-mail: [email protected] Y. Parfenenko e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_7

101

102

E. Kinshakov and Y. Parfenenko

the central and eastern regions have a 20% increase in skin diseases, as reported by the dermatovenereological service of Ukraine. Every third person consults a doctor too late after the appearance of the first symptoms, bringing the disease to a critical state and putting others at risk in case of skin manifestations of an infectious disease. At the same time, up to 80% of the population of all ages have a smartphone, and the number of users is increasing daily, moreover, a large number of people use the Internet and have the opportunity to communicate with a doctor online. In many countries of the world, telemedicine is actively developing and is a relevant research area that allows remote consultation with a doctor and providing first aid emergencies or planning further treatment [2, 3]. In Ukraine, this direction is only being developed and is available only in private clinics, which is quite expensive. Therefore, it is necessary to develop an information system that will help people to quickly react to various skin manifestations, thus determine the degree of severity in advance and not bring it to a critical state, which will reduce the percentage of severe cases of skin diseases [4]. Such a system will make it possible to determine a skin disease from a photo of a part of the body and will help in the initial diagnosis by a family physician in the case when access to a dermatologist is difficult. This study is devoted to the application of machine learning methods for diagnosing skin diseases. It was decided to develop the classification model based on neural networks implemented in information system, which will classify and recognize diseases through a smartphone, using the Telegram messenger, which can work with low Internet traffic. To apply machine learning methods in the diagnosis of skin diseases, it is necessary to have a large number of images with various skin diseases, which will serve as an input data set for training a neural network. In addition to using data sets available on the Internet, it is necessary to add dermatoscope images of skin diseases to the data set based on cooperation with doctors. The effectiveness of using machine learning methods for diagnosing skin diseases depends on the selected type of neural network and the adjusted parameters that allow for high classification accuracy. The accuracy of diagnosis is also affected by the quality of the input image, which in turn affects the accuracy of neural network classification. Therefore, this paper considers methods of image quality improvement, for example, brightness normalization.

2 Theoretical Background A large number of research papers are devoted to the problems of the technical side of developing software for diagnosing diseases using neural networks and increasing the accuracy of diagnosis. Most authors use convolutional neural networks because it is this approach that allows performing the task of recognition or classification. It should be emphasized that the authors are starting from the problems of their countries and from the peculiarities of diagnosing those diseases that prevail in their

Application of Machine Learning Techniques to Solve the Problem …

103

regions. Usually, they have different diagnostic accuracy results because objectively they train neural networks on data sets of different quality and size. The authors [4] used CAD system technologies to implement the diagnosis model. For the most part, the authors focus their attention on image segmentation, as well as data processing. Developers use a small amount of data for neural network training, and most importantly, photos of low quality. In addition, research results have shown that modern CNN models can outperform models generated by previous studies with proper data preprocessing, self-supervised learning, in terms of prediction accuracy. In addition, through accurate segmentation, knowledge of disease localization was obtained, which is useful for preprocessing the data used in classification, as it allows the CNN model to focus on the region of the image with the disease. In the paper [5] the research problem is focused on the detection of skin diseases in Saudi Arabia, where the weather is very hot and there is a large number of deaths from skin diseases. In this study a detection method using a pre-trained convolutional neural network (AlexNet) and SVM was developed. In this work the author has developed a model for predicting skin diseases using deep learning algorithms. Ensemble functions and deep learning have been found to achieve higher levels of accuracy and predict far more diseases than any other model. According to the results of the experiments, other models made in this field of application were able to classify a maximum of six skin diseases with a maximum level of accuracy of 50%. Using a deep learning algorithm, up to 20 diseases can be predicted with a higher level of accuracy [6]. This proves that deep learning algorithms have enormous potential in the real diagnosis of skin diseases. By using machine learning techniques with high-performance hardware on a very large data set, the classification accuracy can be greatly improved, and the diagnosis model can be used for clinical experiments. In the paper [7] the authors proposed the automatic diagnosis of five common skin diseases using a smartphone based on a deep learning technique using the clinical images and clinical information about the patient. The results show that the developed system provides diagnosis of five skin diseases with high accuracy. The developed diagnostic system can potentially be used as a decision support system by dermatologists, general practitioners, rural therapists, and patients in the diagnosis of skin diseases. This system has only five classes, and the images for the neural network training are very high quality, which automatically simplifies the process of the neural network learning [8]. A description of machine learning methods and the use of deep learning for the detection of skin diseases are presented in [9]. To solve the problem of diagnosis, three different algorithms of machine learning and deep learning networks were used for comparison. It has been observed that both classification approaches can be used to detect skin diseases with high accuracy. The main principles of image classification with the help of the TensorFlow module and SVM algorithm and disease prediction every time an image is uploaded to the testing system were described in [10].

104

E. Kinshakov and Y. Parfenenko

In [11] the authors proposed an information system capable of detecting skin diseases by combining computer vision and machine learning techniques. The developed application can be used on computers with low system characteristics. It also has a simple user interface. Methods for improving image quality and building machine learning models for diagnosing skin diseases were implemented in [12]. The researchers emphasize in their work that the machine learning data set was small, but the system was able to identify diseases with minimal error. Image processing and deep learning algorithms have been successfully implemented. Diseases for which the information system was developed to diagnose are nevus and ringworm. Sufficient testing was done using many test images. Summarizing the analysis of the previous study in the field of the skin disease diagnosing, it can be concluded that there are enough studies in the direction of recognizing skin diseases, and currently, the research continues. But after getting acquainted with the research results, it should be noted that the authors focused on one specific area of work, the holistic approach to the development of the diagnostic information system is not holistic. Some of the works use outdated methods of classification, which are currently not relevant and may have a large error. The vast majority of works use high-quality images while testing research results on lowerquality images that can be obtained from a smartphone has not been conducted. In addition, the considered information systems are trained to diagnose several diseases and do not foresee scaling in case of supplementing the input data set with new diseases. Another improvement would be to increase the number of images in the dataset to better train the prediction model, as the efficiency of the deep learning algorithm increases with larger datasets. Therefore, machine learning methods for diagnosing skin diseases and preprocessing skin images to improve their quality need further improvement [13].

3 Research Methods 3.1 Input The input data were collected from the DermNet dermatology resource [14]. This is one of the largest resources in the world, which allows you to get pictures of the disease, description of symptoms and treatment, as well as discussions of doctors on the method of treatment of this or that disease. Currently, images in *.jpg format are used to diagnose skin diseases. Other formats can be used in the future, but the format will not affect the training or image analysis in any way. A data set containing 22 classes (diseases) with different number of images was formed. The images have both infectious and non-infectious disease types, which increases the area of diagnosis.

Application of Machine Learning Techniques to Solve the Problem …

105

As a result, the developed model will be used in an intelligent system, which will primarily solve the problem of recognition, as well as the problem of classification of skin diseases. That is, the developed machine learning model should recognize the disease and assign it to the class of a certain disease, as well as to the subclass of infectiousness of the disease. For neural network training images of sufficiently good quality from the input data set are used, but in real conditions during diagnosis, the quality of the images may be much lower. Therefore, first of all, it is necessary to pre-process the image in order to increase its quality. After that, the neural network is trained on the improved data. Image preprocessing is used so that the neural network can recognize the disease with high accuracy. Methods for improving image quality are presented below.

3.2 Pre-processing with Sobel The Sobel algorithm was chosen as one of the image preprocessing methods, in this case it will help determine the edges of the image, as well as emphasize them. This method is used to extract image contours. Sobel masks are designed in such a way that they provide maximum efficiency with horizontal or vertical edge orientation. Both masks have the same coefficients because they are rotated 90° to each other. Sometimes it is desirable to get only one orientation of the gradient—horizontal or vertical [15]. ⎡

+1 0 −1

⎤

⎡

+1 +2 +1

⎤

⎢ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ G x ⎢ +2 0 −2 ⎥ ∗ I ; G y = ⎢ 0 0 0 ⎥ ∗ I ; ⎣ ⎦ ⎣ ⎦ +1 0 −1 −1 −2 −1

(1)

The Sobel filter works with a simple 3 × 3 convolution, so it is efficient for both CPU and GPU calculations. Sobel kernels are split, which is an additional optimization option [16]. The horizontal Sobel-Feldman gradient with a splitting convolution is represented by: ⎡

⎡ ⎤ 1 ⎢ ⎥ ⎢ ⎥ [ ] ⎢ ⎥ ⎢ ⎥ G y = ⎢ +2 0 −2 ⎥ = ⎢ 2 ⎥ ∗ 1 0 −1 ; ⎣ ⎦ ⎣ ⎦ +1 0 −1 1 +1 0 −1

⎤

(2)

The vertical Sobel-Feldman gradient with splitting convolution is represented by:

106

E. Kinshakov and Y. Parfenenko

⎡

+1 +2 +1

⎤

⎤

⎡ 1

⎥ [ ⎥ ⎢ ⎢ ] ⎥ ⎥ ⎢ ⎢ Gy = ⎢ 0 0 0 ⎥ = ⎢ 0 ⎥ ∗ 1 2 1 ; ⎦ ⎦ ⎣ ⎣ −1 −1 −2 −1

(3)

Each pixel of the image is processed by each kernel to obtain the final gradient value using Eq. (2). On the other hand, to increase productivity, it is permissible to use the sum of the absolute values of the directions (3). √ G = Gx 2 + Gy 2 or G = |Gx| + |Gy| (4) After calculating the vertical and horizontal gradients, the orientation of the edge can be given as follows: θ = atan(

Gy ) Gx

(5)

Usually, edge detection creates thick outlines of the object. In many cases, this is not very useful for object recognition, and therefore additional processing must be applied. To highlight the thin and sharp edges of the image, the magnitude and orientation of the gradient was used. These parameters are sufficient for edge thinning by non-maximum suppression [17].

3.3 Brightness Normalization A general image operator is a function that takes one or more input images and produces an output image. Image transformations can be considered as point operators or zone operators [18]. Pixel transformations have the feature of transforming the image value of each output pixel, which depends only on the corresponding input pixel value [19]. Examples of such operators include brightness and contrast adjustments, and color correction and conversion. Brightness and contrast adjustment two widely used point processes are multiplication and addition with a constant [20]: g(x) = α f (x) + β

(6)

The parameters α > 0 and β are often called gain and offset parameters, and are also used to control contrast and brightness, respectively [21]. It can be considered, that f (x)—pixels of the input image, g(x)—pixels of the result image. Then, it is more convenient to write the expression [22]: g(i, j ) = α · f (i, j) + β

(7)

Application of Machine Learning Techniques to Solve the Problem …

107

where i and j indicate, that the pixel is located in i-th row and j-th column respectively.

3.4 Pre-processing with PCA In order to reduce the linear dimension of image matrices, the PCA algorithm was used. It transforms a set of correlated variables ( p) in smaller number k(k < p), while keeping as much variability in the output as possible [23]. To reduce the dimensionality of the data from n to k, k ≤ n, it is necessary to choose a top −k axes of such an ellipsoid, sorted in descending order of dispersion along the axes [24]. First of all, it is necessary to calculate the variances and means of the initial signs. This is done simply using the covariance matrix. According to the definition of covariance, for two features X i and X j their covariance will be represented as: ) [ )] [ ] ( ( cov X i , X j = E (X i − μi ) X j − μ j = E X i X j − μi

(8)

where μi —the mathematical expectation of the i-th feature. At the same time, it should be noted that the covariance is symmetric and the covariance of the vector with itself will be equal to its dispersion [25]. Thus, the covariance matrix is a symmetric matrix, where the dispersions of the corresponding features lie on the diagonal, and the covariances of the corresponding pairs of features lie off the diagonal [26]. In matrix form, where X is the matrix of observations, the covariance matrix can be presented as follows: ∑

= E[(X − E[X ])(X − E[X ])T ]

(9)

Matrices as linear operators have such properties as eigenvalues and eigenvector [24]. When the matrix is applied to the corresponding linear space, the eigenvectors remain in place and are only multiplied by the corresponding eigenvalues. That is, they define a subspace that, when acted upon by this matrix as a linear operator, remains in the same place [27].

3.5 CNN for Image Detection and Classification The processed images are sent to the input of the neural network to determine the type of skin disease. The neural network is a mathematical model with coefficients, experiments on mathematical models are performed with the aim of selecting coefficients in order to increase the percentage of correct answers to the user each time and thus achieve maximum classification accuracy. After studying a number of methods and

108

E. Kinshakov and Y. Parfenenko

Fig. 1 CNN neural network architecture presented in [28]

approaches for training neural networks, it was decided to use convolutional neural networks (CNN). Neural networks of this type can provide the expected result due to their architecture and learning approach, as shown in Fig. 1. CNN provide partial resistance to scale changes, shifts, rotations, perspective changes, and other image distortions. It combines three architectural principles to ensure invariance to scaling, rotation, translation, and spatial distortions: – local receptor fields (ensure local two-dimensional connectivity of neurons); – general synaptic coefficients (ensure the detection of some features in any place of the image and reduce the total number of weight coefficients); – hierarchical organization with spatial subsamples. Currently, the CNN and its modifications are considered the best algorithms for finding objects on the scene in terms of accuracy and speed. Since 2012, the CNN has taken first place at the well-known international image recognition competition ImageNet [29]. The input layer considers the two-dimensional topology of images and consists of several maps (matrices), the map can be one if the image is presented in shades of gray, otherwise there are 3, where each map corresponds to an image with a specific channel (red, blue, and green), which is the main one in this problem [30]. The input data of each specific pixel value is normalized to the range from 0 to 1 by the following formula: f ( p, min, max) =

p − min , max − min

where f p min max

normalization function; the value of a certain pixel color from 0 to 255; minimum pixel value; maximum pixel value 255.

(10)

Application of Machine Learning Techniques to Solve the Problem …

3.5.1

109

Convolutional Layer

The convolutional layer is a set of maps, each map has a synaptic nucleus. The number of cards is determined by the requirements of the task, if you take a large number of cards, the quality of recognition will increase, but the computational complexity will increase [31]. Based on the analysis of scientific articles, in most cases it is suggested to take a ratio of one to two, that is, each map of the previous layer is associated with two maps of the convolutional layer. The size of all maps of the convolutional layer is the same and is calculated according to the formula: (w, h) = (mW − kW + 1, m H − k H + 1),

(11)

where (w, h) mW mH kW kH

the calculated size of the convolutional map; the width of the previous map; the height of the previous map; the width of the kernel; the height of the kernel.

The kernel is a filter or window that slides over the entire area of the previous map and finds certain features of the objects. For example, if the network was trained on a set of diseases, one of the cores could produce the largest signal in the field of any class during the training process, the other core could reveal other signs. The core size is usually taken in the range from 3 × 3 to 7 × 7. If the size of the nucleus is small, it will not be able to distinguish any signs, if it is too large, then the number of connections between neurons increases. Also, the size of the kernel is chosen such that the size of the maps of the convolutional layer is even, this allows not to lose information when reducing the dimension in the subsampling layer, described below. The core is a system of scales or synapses that are separated, this is one of the main features of a convolutional neural network. In a multilayer network, there are many connections between neurons, that is, synapses, which greatly slows down the detection process. In a convolutional network, on the contrary, common weights allow to reduce the number of connections and allow finding the same feature throughout the image area.

3.5.2

CNN Pooling Layer

To convert a set of selected images into a trainable set of 3D filters, a convolutional layer is used. A pooling layer is used to perform down sampling in order to reduce the size of the feature map and eliminate redundant details. These layers minimize the spatial complexity of the parameters used and also address the problem of overfitting. They are used in such a way that they can adaptively learn more discriminative and optimal features. A group of filters used in convolutional layers processes local parts

110

E. Kinshakov and Y. Parfenenko

of the input data. These filters lead to strong responses when working with certain parts of the face and suppressing others, resulting in important local structures. After these multiple filters in the convolutional layer, down sampling is performed on the filtered results in the max pooling layer to make it robust to positional variance. Normalization layer is used between them to improve the learning process and reduce the network’s dependency on initialization. It normalizes the gradient values passing through the network [31].

3.5.3

CNN Classification

One type of layer is a regular multilayer perceptron layer. The layer’s goal is to classify and model a complex nonlinear function, and optimizing it improves recognition quality. Neurons of each feature map of the previous subsampling layer are connected to one neuron of the hidden layer. Thus, the number of neurons in the hidden layer is equal to the number of feature maps in the subsampling layer, but the connections may not necessarily be one-to-one. For example, only some neurons of a feature map may be connected to the first neuron of the hidden layer, while the remaining neurons are connected to the second neuron, or all neurons of the first feature map may be connected to neurons 1 and 2 of the hidden layer [29]. The calculation of the neuron values can be described by the formula: l−1 x lj = f (x l−1 ∗ wi,l−1 j j + bj )

(12)

where x lj feature map l output of the layer j f () activation function blj layer bias coefficient.

4 Results 4.1 Input Data Preprocessing Images with dimensions (470, 720) are fed to the input of the neural network. An example of a skin disease image from dataset is shown in Fig. 2. Previously, the image goes through a number of methods that improve the quality and highlight the localization of the disease, so that the network does not waste time processing the non-informative part of the image during training.

Application of Machine Learning Techniques to Solve the Problem …

111

Fig. 2 The general view of the input image

Fig. 3 The result of brightness normalization

4.1.1

Skin Image Brightness Normalisation

Brightness normalization was used in a preprocessing stage to change the range of pixel intensity values. The input dataset includes photos with poor contrast, for example, due to glare. Normalization in this case is used as contrast stretching, which is shown in Fig. 3. This method is calculated according to formulas (6, 7).

4.1.2

Using the Sobel Filter

The Sobel filter is the primary way to obtain an edge/gradient magnitude image. It was applied by computing the image intensity gradient for each pixel within the image (4–8). The direction of the greatest increase from light to dark and the rate of change in this direction were also determined (Fig. 4).

112

E. Kinshakov and Y. Parfenenko

Fig. 4 The result of the Sobel filter usage

4.1.3

Using the PCA Algorithm

The results of principal component analysis are shown in Fig. 5. This is a linear dimensionality reduction technique that was used to extract information from a highdimensional space by projecting it into a lower-dimensional subspace (8, 9).

Fig. 5 The result of the PSA algorithm

Application of Machine Learning Techniques to Solve the Problem …

113

Fig. 6 The data for neural network training

4.1.4

Skin Diseases Classification with CNN

Once we have the processed data, it is used to train the neural network. The training takes place according to the classical approach, dividing the data into 80 percent of the training sample and 20 percent of the testing sample. The number of the training sample can be seen in Fig. 6. In this image (Fig. 7) two graphs are shown. The graph on the left shows the quality of the model based on the accuracy metric along the vertical axis, and the number of epochs indicated in the parameters along the horizontal axis. The graph on the right shows the loss function, with the actual loss evaluation along the vertical axis and the number of epochs along the horizontal axis. The main task of classification using CNN is to achieve a model accuracy of 0.8. This is the result we aim for during the model training. Before applying filters and algorithms for image preprocessing, the accuracy result was 0.4, which was unsatisfactory. After applying the algorithms and filters, the accuracy results were improved to 0.6–0.7, which is satisfactory for this research since the training was done without a graphics processing unit.

114

E. Kinshakov and Y. Parfenenko

Fig. 7 The results of neural network training

5 Conclusions As a research result, the neural network for diagnosing skin diseases was developed. The paper examines image preprocessing methods to improve neural network recognition. The methods of the Sobel filter, brightness normalization, and the PCA algorithm were used. The task was performed on a created and pre-labeled dataset with skin diseases. A convolutional neural network based on the Tensorflow framework was used. The recognition quality for the test sample of data is about 0.7 for the Sobel filter, 0.5 for the PCA algorithm, and 0.4 for brightness normalization. It was established that the Sobel method is the most effective among the considered methods when using a neural network. Further research is to work with hyperparameters and neural network optimization. Preprocessing in this study is used to minimize false positives. This is done to ensure that the neural network receives clear images with a precise localization of the disease. As this study is in the medical field, it is important to minimize false positives in the model.

Application of Machine Learning Techniques to Solve the Problem …

115

References 1. Bihunyak, T.V., Bihuniak, K.O., Redko, O.S.: Clinical Polymorphysm and differential diagnostics of Lyme illness. Bulletin Scientific Res., 1 (2018) 2. Volosovets, O.P., Bolbot, Y.K., Beketova, G.V., Berezenko, V.S., Umanets, T.R., Rechkina, O.O. et al.: Allergic and non-allergic skin diseases in children of Ukraine: a retrospective study of the prevalence and incidence over the past 24 years. Medicni Perspektivi. 26(3) (2021) 3. Nittari, G., Khuman, R., Baldoni, S., Pallotta, G., Battineni, G., Sirignano, A., et al.: Telemedicine practice: review of the current ethical and legal challenges. Telemed. e-Health. 26, 1427–1437 (2020). https://doi.org/10.1089/tmj.2019.0158 4. Son, H.M., Jeon, W., Kim, J., Heo, C.Y., Yoon, H.J., Park, J.U., et al.: AI-based localization and classification of skin disease with erythema. Sci. Rep. 11, 5350 (2021). https://doi.org/10. 1038/s41598-021-84593-z 5. Alkolifi Alenezi, N.S.: A method of skin disease detection using image processing and machine learning. Procedia Comput. Sci. 163, 85–92 (2019). https://doi.org/10.1016/j.procs.2019.12.09 6. Patnaik, S.K., Sidhu, M.S., Gehlot, Y., Sharma, B., Muthu, P.: Automated skin disease identification using deep learning algorithm. Biomed. Pharmacology J. 11(3) (2018). https://doi.org/ 10.13005/bpj/1507 7. Muhaba, K.A., Dese, K., Aga, T.M., Zewdu, F.T., Simegn, G.L.: Automatic skin disease diagnosis using deep learning from clinical image and patient information. Skin Health Disease 2(1) (2022). https://doi.org/10.1002/ski2.81 8. Bandyopadhyay, S.K., Bose, P., Bhaumik, A., Poddar, S.: Machine learning and deep learning integration for skin diseases prediction. Int. J. Eng. Trends Technol. 70(3), 13–21 (2022). https://doi.org/10.14445/22315381/IJETT-V70I2P202 9. Rajasekaran, G., Aiswarya, N., Keerthana, R.: Skin disease identification using image processing and machine learning techniques. Int. Res. J. Eng. Technol. 7(3), 1368–1371 (2020) 10. Rathod, J., Wazhmode, V., Sodha, A., Bhavathankar, P.: Diagnosis of skin diseases using convolutional neural networks. In: Proceedings of the 2nd International Conference on Electronics, Communication and Aerospace Technology (2018) 11. Malliga, S., Sherly Infanta, G., Sindoora, S., Yogarasi, S.: Skin disease detection and classification using deep learning algorithms. Int. J. Adv. Sci. Technol. 29, 255–260 (2020) 12. Wang, Z., Wang, K., Yang, F., Pan, S., Han, Y.: Image segmentation of overlapping leaves based on Chan–Vese model and Sobel operator. Inf. Process. Agric. 5(1) (2018). https://doi. org/10.1016/j.inpa.2017.09.005 13. Hao, F., Xu, D., Chen, D., Hu, Y., Zhu, C.: Sobel operator enhancement based on eightdirectional convolution and entropy. Int. J. Inf. Technol. 13(5) (2021). https://doi.org/10.1007/ s41870-021-00770-3 14. DermNet: DermNet Image Library, https://dermnetnz.org/image-library 15. Jiang, J., Jin, Z., Wang, B., Ma, L., Cui, Y.: A Sobel operator combined with patch statistics algorithm for fabric defect detection. KSII Trans. Internet Inf. Syst. 14(2), 681–701 (2020). https://doi.org/10.3837/tiis.2020.02.012 16. Wang, Q., Du, W., Ma, C., Gu, Z.: Gradient color leaf image segmentation algorithm based on Meanshift and K means. In: IEEE Advanced Information Technology, Electronic and Automation Control Conference (2021) 17. Iliukhin, S., Chernov, T., Polevoy, D., Fedorenko, F.: A method for spatially weighted image brightness normalization for face verification. In: Eleventh International Conference on Machine Vision (2019) 18. Kociołek, M., Strzelecki, M., Obuchowicz, R.: Does image normalization and intensity resolution impact texture classification? Comput. Med. Imaging Graphics, 81 (2020). https://doi. org/10.1016/j.compmedimag.2020.101716 19. Pal, M.K., Porwal, A.: A Local Brightness Normalization (LBN) algorithm for destriping Hyperion images. Int. J. Remote Sens. 36(10), 2674–2696 (2015). https://doi.org/10.1080/014 31161.2015.1043761

116

E. Kinshakov and Y. Parfenenko

20. Huizinga, W., Poot, D.H.J., Guyader, J.M., Klaassen, R., Coolen, B.F., van Kranenburg, M. et al.: PCA-based groupwise image registration for quantitative MRI. Med. Image Anal., 65–78 (2016). https://doi.org/10.1016/j.media.2015.12.004 21. Bashir, R., Junejo, R., Qadri, N.N., Fleury, M., Qadri, M.Y.: SWT and PCA image fusion methods for multi-modal imagery. Multimed Tools Appl. 78(2), 1235–1263 (2019). https:// doi.org/10.1007/s11042-018-6229-5 22. Potapov, P.: On the loss of information in PCA of spectrum-images. Ultramicroscopy 182, 191–194 (2017). https://doi.org/10.1016/j.ultramic.2017.06.023 23. Ma, J., Yuan, Y.: Dimension reduction of image deep feature using PCA. J. Vis. Commun. Image Representation, 63 (2019). https://doi.org/10.1016/j.jvcir.2019.102578 24. Zhou, C., Wang, L., Zhang, Q., Wei, X.: Face recognition based on PCA image reconstruction and LDA. Optik (Stuttg). 124(22), 5599–5603 (2013). https://doi.org/10.1016/j.ijleo.2013. 04.108 25. Murali Mohan Babu, Y.: PCA based image denoising. Signal Image Process 3(2), 236–244 (2012). https://doi.org/10.5121/sipij.2012.3218 26. Alnagdawi, M.A., Shamsuddin, S.M.H.J., Hashim, S.Z.M., Aburumman, A.: Improve image registration jeffrey’s divergence method for insufficient overlap area using kmeans++ in remote sensed images. J. Theor. Appl. Inf. Technol. 97(5), 1571–1580 (2019) 27. Talpur, S., Khoso, N.: Advanced ambulatory operating stretcher learned by means of Convulational neural network (CNN). J. Biomed. Eng. Med. Imaging 5(3) (2018). https://doi.org/10. 14738/jbemi.53.4660 28. Kurniawan, K., Sedayu, B.B., Hakim, A.R., Erawan, I.M.S.: Classification of Rastrelligerkanagurta and Rastrelligerbrachysoma using Convulational Neural Network (CNN). In: IOP Conference Series: Earth and Environmental Science (2022) 29. Yadav, S., Rathod, R., Pawar, S.R, Pawar, V.S., More, S.: Application of deep convulational neural network in medical image classification. In: 2021 International Conference on Emerging Smart Computing and Informatics, ESCI 2021 (2021) 30. Park, J., Chen, J., Cho, Y.K., Kang, D.Y., Son, B.J.: CNN-based person detection using infrared images for night-time intrusion warning systems. Sensors 20(1) (2020). https://doi.org/10.3390/ s20010034 31. Chauhan, R., Ghanshala, K.K., Joshi, R.C.: Convolutional Neural Network (CNN) for image detection and recognition. In: ICSCCC 2018—1st International Conference on Secure Cyber Computing and Communications (2018)

Analyzing Biomedical Data by Using Classification Techniques J. Kostolny, J. Rabcan, T. Kiskova, and A. Leskanicova

Abstract Brain tumors are among the deadliest cancers and are responsible for significant mortality, which the most malignant form of brain cancer is glioblastoma multiforme (GBM). GBM is described by an inferior prediction and lower survival rate, while therapy is considerably limited nowadays. Due to the scarcity of adequate treatment for this type of illness, early diagnosis associated with accurate tumor classification is critical. For category, we can essentially use data evaluation using software processing. In this article, we will focus on presenting the classification of such data types and their statistical evaluation using decision trees, which would make it possible to identify this disease in the early stages. Keywords Brain tumor · Classification techniques · Metabolites · Decision trees · PLS-DA · Decision tree

1 Introduction Brain tumors are a severe health problem that can significantly affect a patient’s quality of life. These tumors are usually divided into two main categories—benign and malignant [1]. Benign tumors are typically slow-growing and rarely spread to other parts of the body, while malignant tumors are fast-growing and aggressive and can spread rapidly to other parts of the brain and body. Diagnosis of brain tumors currently relies on various techniques such as MRI (Magnetic Resonance Imaging), CT (Computed Tomography), and PET (Positron Emission Tomography) J. Kostolny (B) · J. Rabcan Faculty of Management Science and Informatics, University of Zilina, Zilina, Slovakia e-mail: [email protected] J. Rabcan e-mail: [email protected] T. Kiskova · A. Leskanicova Institute of Biology and Ecology, Faculty of Sciences, Pavol Jozef Safarik University, Kosice, Slovakia © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_8

117

118

J. Kostolny et al.

scans. These techniques can be beneficial in identifying tumors and determining their size and location. In addition, different types of classifications of brain tumors are used to categorize tumors based on their morphological, molecular, and genetic characteristics [2]. One of the most common classifications of brain tumors is the classification according to histological class. The histological grade is usually determined by visual observation of the tumor cells under a microscope and is classified based on similarity to normal brain tissues [3]. This classification can generally be divided into gliomas, medulloblastomas, and neuronal and embryonal tumors. Another category of brain tumor classification is according to molecular characteristics. This classification categorizes tumors based on their molecular factors and aims to identify specific molecular markers present in tumors. Molecular type is used to identify different subtypes of tumors and can help decide the best treatment for the patient [4]. In addition, there is also a classification of brain tumors according to the genetic changes that occur in the tumors. This classification focuses on identifying the specific genes and genetic changes in tumors and helps determine prognosis and response to treatment. Overall, the type of brain tumors can be advantageous in determining a patient’s optimal treatment and prognosis, as different categories of tumors may have other characteristics and require different therapeutic approaches. The treatment of brain tumors is usually very complex as it depends on many factors, such as the type and location of the tumor, the patient’s condition, and age. Treatment may include surgery to remove cancer, radiation, and chemotherapy. New opportunities for using machine learning algorithms and artificial intelligence have emerged in brain tumor diagnosis in recent years. By training, these algorithms can identify tumors based on MRI and CT scans, increasing the diagnosis success rate and enabling faster and more accurate treatment [5]. Machine learning algorithms, in addition, are used to analyze genetic data from tumors and help decide on the best treatment for a particular patient [6]. Given the severity of the problem of brain tumors and the need for fast and accurate diagnosis and treatment, the development of machine learning algorithms and artificial intelligence in this field is significant and promising (Fig. 1). Brain tumors are a severe health problem that can significantly affect a patient’s quality of life. These tumors are usually divided into two main categories—benign and malignant. Benign tumors are typically slow-growing and rarely spread to other parts of the body, while malignant tumors are fast-growing and aggressive and can spread rapidly to other parts of the brain and body. Diagnosis of brain tumors currently relies on various techniques such as MRI (Magnetic Resonance Imaging), CT (Computed Tomography), and PET (Positron Emission Tomography) scans [7]. These techniques can be instrumental in identifying tumors and determining their size and location. In addition, we use different types of classification of brain tumors to categorize tumors based on their morphological, molecular, and genetic characteristics. The application of data mining methods in analyzing metabolomic data can help in faster diagnosis and suggest new treatment methods.

Analyzing Biomedical Data by Using Classification Techniques

119

Fig. 1 Brain tumor tissue location

2 Metabolomics Metabolomics is a field of research that focuses on studying metabolites—small molecules produced in the metabolic processes of cells. Metabolites include various molecules, including amino acids, carbohydrates, lipids, and nucleotides [8]. Metabolites can be measured and analyzed using multiple methods such as mass spectrometry, nuclear magnetic resonance (NMR), or chromatography [9]. These methods allow accurate quantification of metabolites and identification of their structures. One of the aims of metabolomic analysis is to identify biomarkers that can serve as indicators of disease or body condition. In the case of brain tumors, the metabolomic analysis aims to identify metabolites that tumor cells produce or are related to their growth and development. Metabolomic analysis can be helpful in diagnosis and monitoring the efficacy of treatment and identifying tumor recurrence. In addition, metabolomic analysis can be used to identify new therapeutic targets and develop new drugs. Overall, metabolomics is an up-and-coming area of research that can provide new insights into diseases and enable the development of new diagnostic and therapeutic approaches [10–12]. The occurrence of brain tumors can affect metabolic processes in the body, which reflect in the composition of the blood. Therefore, metabolic changes may be detectable through the analysis of blood samples. Metabolomics is the study of metabolites—small molecules produced in metabolism processes that we use to identify changes in the metabolic profiles of patients that may indicate the presence of cancer. Metabolomic analyses can provide information about metabolic changes in brain tumors and identify biomarkers that could be used for tumor diagnosis and monitoring. For example, tumor cells may produce some metabolites related to their growth and development [13]. The advantage of metabolomics in diagnosis is that it can be a non-invasive method using a simple blood test. In addition, we use metabolomic analysis to monitor treatment efficacy and identify tumor recurrence. Therefore, the potential

120

J. Kostolny et al.

of metabolomics in the diagnosis and treatment of brain tumors is currently intensively investigated in the research field, and many studies are showing promising results. However, further research and validation of its efficacy are required before metabolomics can be used in clinical practice.

2.1 Analyzing of Metabolomic Data The metabolomic analysis studies small molecules in biological systems and provides information on metabolic pathways and their regulation. In brain tumor diagnosis and treatment, metabolomic analyses are becoming increasingly important tools for identifying new biomarkers and targets for therapy. However, metabolomic data are often very complex and contain much information that needs to be analyzed and interpreted [14]. For this purpose, we use data mining and bioinformatics techniques to identify patterns and relationships in the data which may not be immediately visible. Using data mining and bioinformatics techniques in metabolomic analysis allows the identification of novel biomarkers associated with brain tumors and enables the diagnosis and monitoring of disease progression. In addition, these techniques can identify new targets for treatment and aid in developing new therapeutic strategies. Therefore, the importance of metabolomic analysis and data mining, and bioinformatics techniques are becoming increasingly important in the diagnosis and treatment of brain tumors. Metabolomic data are often analyzed using various data mining and bioinformatics techniques [15]. Data mining involves the identification of patterns, trends, and relationships in the data using statistical and machine learning methods. Techniques such as correlation analysis, principal component analysis (PCA), discriminant analysis, and clustering are often used in metabolomics analysis. PCA is used to simplify complex datasets and visualize relationships between metabolites. Discriminant analysis can be used to identify metabolites responsible for differences between groups of samples, for example, between a group of brain tumor patients and healthy individuals [16]. Clustering is a method that allows metabolites to be grouped according to their similarity and enables patterns to be identified that might not be firstly clear. In addition, bioinformatics tools is often using in metabolomics analysis to identify metabolites and interpret metabolomic data. These tools include metabolite databases, software tools to search for similarities with existing metabolites, and the prediction of metabolic pathways [17]. The results of metabolomic analysis and data interpretation can be extremely useful in identifying new therapeutic targets and developing new drugs for brain tumor patients.

Analyzing Biomedical Data by Using Classification Techniques

121

3 Datamining Techniques We can use data mining techniques such as partial least squares discriminant analysis (PLS-DA), variable importance in projection (VIP), or metabolite impact determination (MID) calculation to analyze metabolomic data obtained from brain tumors. These techniques allow the identification of metabolic patterns and biomarkers characteristic of brain tumors and enable their diagnosis and monitoring of disease progression [18]. Partial least squares discriminant analysis (PLS-DA) is a statistical method that identifies differences in metabolic patterns between groups of samples, for example, between representatives of healthy and diseased individuals. PLS-DA allows the identification of combinations of metabolites that best discriminate between groups and allow disease diagnosis. Variable importance in projection (VIP) is another method that allows the identification of metabolites that are most important for discriminating between groups of samples. VIP considers not only changes in metabolite concentrations between groups but also their interactions and correlations [18]. Metabolite impact determination (MID) identifies metabolites with the most significant impact on differences in metabolic patterns between groups. MID considers correlations between metabolites and their effect on the total variance in the data [19]. Using these data mining and bioinformatics techniques in the metabolomic analysis allows for identifying novel biomarkers and metabolic patterns associated with brain tumors, enabling their diagnosis, and tracking of disease progression. Decision trees are widely used tools in data analysis using data mining and decision modeling. In metabolome analysis, decision trees can be used to identify metabolic pathways and predict metabolic functions. Initially, it may be necessary to transform data obtained from metabolomic analysis into forms that can be used to train decision trees. This may include normalization, standardization, or other adjustments. One of the applications of decision trees in metabolomic analysis can be the identification of metabolic pathways [20]. This approach can be used to classify what contains metabolites from what is involved in pathways or not. Decision trees can help identify critical metabolites and compounds in these pathways and predict their impact on metabolic processes. Another use of several decision trees in metabolomic analysis can be the prediction of metabolic functions. This approach can be used to identify metabolites that are important for a particular metabolic process and based on how these metabolites affect that function. Again, decision trees can be used to identify critical metabolites and predict their impact on metabolic functions. The results obtained using the decision goals can be further used to set new drugs or to develop new therapeutic targets. Decision trees can be used to diagnose various metabolic disorders and to predict treatment outcomes.

122

J. Kostolny et al.

3.1 Tools for Analyze Metabolomics Data MetaboAnalyst is an online metabolomics data analysis tool that enables various analyses, including statistical analysis and data visualization. This tool provides users with a simple and intuitive interface for metabolic data analysis without the need for advanced knowledge in programming and bioinformatics [21]. This application provides a wide range of features that enable different types of analyses, including principal component analysis, discriminant composition analysis, co-relational analysis, and more. In addition, it also provides data visualization capabilities using a variety of graphs and charts that allow users to identify significant differences between groups of samples quickly. MetaboAnalyst allows users to evaluate metabolic patterns and biomarkers associated with brain tumors and identify potential target molecules for therapeutic interventions. This tool can also be used to compare metabolic data between different sample groups, which can aid in identifying biomarkers for various diseases. This tool thus has the potential to be a valuable tool for the diagnosis and treatment of brain tumors and can be used in combination with other analytical tools and data mining techniques to obtain even more accurate and reliable results (Fig. 2).

Fig. 2 Metaboanlyst tool dashboard

Analyzing Biomedical Data by Using Classification Techniques

123

3.2 Glioblastoma Multiforme Data Analysis Glioblastoma multiforme GBM is a very aggressive type of brain tumor often treated with surgery, radiotherapy, and chemotherapy [22]. Researchers and doctors are trying to identify new ways to treat this tumor and improve patient prognosis. In the case of glioblastoma multiforme, metabolomic analyses could help identify changes in the metabolic processes in the tumor cells. These changes help identify new therapeutic targets or indicators that could be used to monitor treatment efficacy. Some studies have looked at metabolomic analyses of glioblastoma multiforme and have shown changes in the concentration of specific metabolites such as glucose, lactate, aspartate, and glutamate [23]. These changes may be linked to the process of glycolysis that occurs in tumor cells and allows them to grow and divide rapidly. However, research into the metabolomics of glioblastoma multiforme is still at an early stage. Further studies are needed to determine more precisely the metabolic changes in tumor cells and their impact on the growth and survival of patients. Some studies have looked at metabolomic analyses of glioblastoma multiforme and have shown changes in the concentration of specific metabolites such as glucose, lactate, aspartate, and glutamate. These changes may be linked to the process of glycolysis that occurs in tumor cells and allows them to grow and divide rapidly. Data for demonstration of datamining technics is from MetIQ software package. This data is provided by BIOCRATES Life science AG company. Firstly, use for analysis PLS-DA techniques (Fig. 3). In this result we can identify two groups, data from healthy and with GBM. These techniques can identify significant differences in metabolomics concentration. Fig. 3 PLS-DA analysis of metabolites

124

J. Kostolny et al.

Next, we used the VIP technique for a more detailed identification, identifying the five most important metabolomics. This tool thus has the potential to be a valuable tool for the diagnosis and treatment of brain tumors and can be used in combination with other analytical tools and data mining techniques to obtain even more accurate and reliable results.

4 Decision Tree Induction The next part of analysis was based on decision trees [24]. Between main advantages of decision trees belongs good interpretation of results. Therefore, DTs are popular algorithm for tasks where decision based on black box is not inadmissible. For example, decisions for particular medical tasks require good understanding. A decision tree (DT) is a supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes. Each internal nodes is associated with one input attribute [25]. The root is the top-tree node when each process of classification starts. Internal nodes are associated with one of the input attributes. Each internal node has outcoming edges, each of these edges corresponds with one of the associated attribute values. Internal nodes make a test according to the attribute which is associated with them and according to the test result, a classified instance shifts down the tree until a leaf is reached. The leaves assign a classified instance to one of the predefined classes. There are different algorithms of decision tree induction. Between most popular belongs ID3, C4.5, CART, CHAID. One of the first algorithm ID3 uses information gain to select association attributes which is based on Shannon entropy measurement. Shannon entropy of set S of examples is defined as: H (S) = −

C kj kj ∗ log2 , |S| |S| j=1

(1)

where |S| is the cardinality measurement of set S and k j is number of instances belonging to the j-th class. The Shannon entropy tells how mutch the uncertainty of the set S is reduced after splitting by attribute A. Information gain G(S, A) is the measurement of entropy before splitting and after splitting of set S by attribute A. Information gain G(S, A) is defined as: G(S, A) = H (S) − H (S|A ) = H (S) |Sv | − H (Sv ), |S| v∈vals(A)

(2)

Analyzing Biomedical Data by Using Classification Techniques

125

where Sv is the subset of S in which all instances have value of attribute A equal to v. The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the output attribute. In this case the relative entropies H (S|A) subtracted from the total entropy H (S) are 0. Information gain tends to prefer attributes with a numerous amount of distinct values. Therefore, it is not suitable for task where input attributes are not binary. Therefore, information gain has been improved to information gain ratio. The information gain ratio is the ratio between the information gain and the Split Information value: Gr (S, A) =

G(S, A) − v∈vals( A) |S|S|v | ∗ log2 |S|S|v |

(3)

To increase the classification performance of decision trees, a technique called tree pruning is used. Pruning reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the decision trees, and hence improves predictive accuracy by the reduction of overfitting. Pruning processes can be divided into two types Pre-pruning and Post-pruning. Post-pruning techniques are applied when tree is inducted. Prepruning techniques stops tree growth in some branch during tree induction (e.g. max. Tree depth or information gain (Attr) > minGain). The algorithm used in this paper make works with Pre-pruning according to two thresholds. • Minimal leaf size. If the number of instances of a node in its subset S is less then minimal leaf size, this node will be leaf. • Minimal gain. If the information gain ration in some node in is less then minimal gain, this node will be leaf. A resulting decision tree is obtained by running the induction algorithm several times with different threshold values. After this runs, we select the tree with biggest classification performance.

4.1 Experimental Settings In this section, we used a decision tree for brain tumor prediction. For this purpose, we used demonstration data from section Glioblastoma multiforme data analysis. This data have 78 instances described by 186 attributes. The classification is focused on the separation of instances with a brain tumor from healthy individuals. Therefore, the output attribute defines two classes: Control and Tumor. Class Control covers instances where the brain tumor is not present. Class Tumor contains instances with brain tumor. Dataset was randomly divided into training and testing sets in ratio 70:30. Instances from testing sets are used only for training of DT and testing set is used for evaluating of performance of classifier. This was repeated several times and the results have been averaged. We estimated the suitable values of the thresholds

126

J. Kostolny et al.

Fig. 4 VIP score plot of analyzed data

Table 1 Table of classification results

Classifier

Sensitivity

Specificity

Accuracy

DT

0.926

0.891

0.910

SVM

0.902

0.891

0.897

kNN

0.902

0.864

0.884

NB

0.878

0.918

0.897

NN (MLP)

0.951

0.864

0.910

parameters: Minimal leaf size, Minimal gain. Experiments were performed with different values of threshold. If the model error was high, then the parameters of the algorithm that has been used to create the model are changed and a new model is created. The resulting decision tree is shown on Fig. 4. To verify that decision tree is suitable classifier for selected data. We made comparison with other algorithms of machine learning for this task. This comparison involves naïve Bayes, neural network, support vector machine, k-nearest neighbors. Each of these algorithms has some input parameters. We evaluated values of these input parameters in similar way like for decision tree and the best result is shown in Table 1. The classification was evaluated by sensitivity, specificity and accuracy. The Table 1 shows that decision trees are very useful classifier. It has the best value of classification accuracy with (Multi-Layer Perceptron) MLP neural network. But in comparison with neural network, decision trees are better interpretable. Especially in medical task, the explanation of decisions is very important (Fig. 5).

Analyzing Biomedical Data by Using Classification Techniques

127

Fig. 5 Diagram of resulting decision tree

5 Conclusion In this paper, we have presented a method for identifying tumors from metabolomic data. Classification of such data is complicated, and the proposed plans, such as PCA or PLS-DA analysis, lay a good foundation for the studies. We can extend and refine the research using decision tree techniques, which provide additional validation of the classification according to specific metabolites, and in the future, provides the possibility of creating a tool that can determine with high confidence what type of tissue is involved based on the value of the identified metabolomic data. In further analyses, we want to deal with the application of reliability analysis based on uncertain data [26], which can be helpful in the study of metabolomic data. These data may be pending or inaccurate due to various factors, such as substandard sampling, improper sample storage, or interference in the analysis process. Creating a structural reliability function could help remove these uncertainties and allow a reliable comparison of metabolic pathways between different samples. For example, it would be possible to identify metabolic pathways that differ significantly between healthy and diseased individuals or other treatment groups using a structural reliability function. The reliability structure function would also allow the reliability of metabolic pathway analysis results to be evaluated and provide information on the quality of the measurements [27]. Overall, extending the application to metabolomics data could give helpful information for metabolomics researchers and practitioners and help to determine metabolic pathways and their interrelationships more accurately.

128

J. Kostolny et al.

Acknowledgements This publication has been produced with the support of the Integrated Infrastructure Operational Program for the project: Creation of a Digital Biobank to Support the Systemic Public Research Infrastructure, ITMS: 313011AFG4.

References 1. Ostrom, Q.T., Patil, N., Cioffi, G., Waite, K., Kruchko, C., Barnholtz-Sloan, J.S.: CBTRUS statistical report: primary brain and other central nervous system tumors diagnosed in the United States in 2013–2017. Neuro Oncol 22(12 Suppl 2), IV1–IV96 (2020), https://doi.org/ 10.1093/NEUONC/NOAA200 2. Siegel, R.L., Miller, K.D., Fuchs, H.E., Jemal, A.: Cancer statistics, 2021. CA Cancer J. Clin. 71(1), 7–33 (2021). https://doi.org/10.3322/CAAC.21654 3. Kumar, R., Srivastava, R., Srivastava, S.: Detection and classification of cancer from microscopic biopsy images using clinically significant and biologically interpretable features. J. Med. Eng. 2015, 1–14 (2015). https://doi.org/10.1155/2015/457906 4. Mamatjan, Y., et al.: Molecular signatures for Tumor classification: an analysis of the cancer genome atlas data. J. Mol. Diagn. 19(6), 881–891 (2017). https://doi.org/10.1016/J.JMOLDX. 2017.07.008 5. Zhang, J., Li, Y., Zhao, Y., Qiao, J.: CT and MRI of superficial solid tumors. Quant Imaging Med. Surg. 8(2), 232 (2018). https://doi.org/10.21037/QIMS.2018.03.03 6. Hajjo, R., Sabbah, D.A., Bardaweel, S.K., Tropsha, A.: Identification of tumor-specific MRI biomarkers using machine learning (ML). Diagnostics 11(5) (2021), https://doi.org/10.3390/ DIAGNOSTICS11050742 7. Treglia, G., et al.: Diagnostic performance and prognostic value of PET/CT with different tracers for Brain Tumors: a systematic review of published meta-analyses. Int. J. Mol. Sci. 20(19), 4669 (2019). https://doi.org/10.3390/IJMS20194669 8. Li, S., Gao, D., Jiang, Y.: Function, detection and alteration of Acylcarnitine Metabolism in Hepatocellular Carcinoma. Metabolites 9(2) (2019), https://doi.org/10.3390/METABO902 0036 9. Dona, A.C., et al.: A guide to the identification of metabolites in NMR-based metabonomics/ metabolomics experiments. Comput. Struct. Biotechnol. J. 14, 135–153 (2016). https://doi.org/ 10.1016/J.CSBJ.2016.02.005 10. Gaca-Tabaszewska, M., Bogusiewicz, J., Bojko, B.: Metabolomic and Lipidomic profiling of gliomas—a new direction in personalized Therapies. Cancers (Basel) 14(20) (2022), https:// doi.org/10.3390/CANCERS14205041 11. Alfaifi, A. et al.: Metabolomics: A New Era in the Diagnosis or Prognosis of B-Cell NonHodgkin’s Lymphoma. Diagnostics 2023 13, 861, 13(5), 861 (2023), https://doi.org/ 10.3390/DIAGNOSTICS13050861 12. Chen, Z., Li, Z., Li, H., Jiang, Y.: Metabolomics: a promising diagnostic and therapeutic implement for breast cancer. Onco. Targets Ther. 12, 6797 (2019). https://doi.org/10.2147/ OTT.S215628 13. Elia, I., Haigis, M.C.: Metabolites and the tumour microenvironment: from cellular mechanisms to systemic metabolism. Nature Metabolism 2021 3:1 3(1), 21–32 (2021), https://doi.org/10. 1038/s42255-020-00317-z 14. Millington, D.S., Stevens, R.D.: Acylcarnitines: analysis in plasma and whole blood using tandem mass spectrometry. Methods Mol. Biol. 708, 55–72 (2011). https://doi.org/10.1007/ 978-1-61737-985-7_3/COVER 15. Chovancova, O., Stafurikova, A., MacEkova, D., Kiskova, T., Rabcan, J., Kostolny, J.: Impact of Metabolomics on depression using data mining techniques. Proceedings of the 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems:

Analyzing Biomedical Data by Using Classification Techniques

16. 17. 18.

19.

20.

21.

22.

23.

24.

25.

26. 27.

129

Technology and Applications, IDAACS 2019, vol. 2, pp. 651–655 (2019), https://doi.org/10. 1109/IDAACS.2019.8924245 Zhang, Z., Castelló, A.: Principal components analysis in clinical studies. Ann. Transl. Med. 5(17) (2017), https://doi.org/10.21037/ATM.2017.07.12 Karp, P.D., Latendresse, M., Caspi, R.: The pathway tools pathway prediction algorithm. Stand. Genomic. Sci. 5(3), 424 (2011). https://doi.org/10.4056/SIGS.1794338 Chovancova, O., MacEkova, D., Kostolny, J., Stafurikova, A., Kiskova, T.: Quantitative metabolomics analysis of depression based on PLS-DA model. 2019 42nd International Conference on Telecommunications and Signal Processing, TSP 2019, pp. 298–301 (2019), https:// doi.org/10.1109/TSP.2019.8769066 Banimustafa, A.H., Hardy, N.W.: A strategy for selecting data mining techniques in metabolomics. Methods Mol. Biol. 860, 317–333 (2012). https://doi.org/10.1007/978-1-61779594-7_18 Hummel, J., Strehmel, N., Selbig, J., Walthe, D., Kopka, J.: Decision tree supported substructure prediction of metabolites from GC-MS profiles. Metabolomics 6(2), 322–333. https://doi.org/ 10.1007/s11306-010-0198-7. Epub 2010 Feb 16. PMID: 20526350; PMCID: PMC2874469 Chong, J., Wishart, D.S., Xia, J.: Using MetaboAnalyst 4.0 for comprehensive and integrative metabolomics data analysis. Curr. Protoc. Bioinformatics 68(1) (2019), https://doi.org/10.1002/ CPBI.86 Silantyev, A.S. et al.: Current and future trends on diagnosis and Prognosis of Glioblastoma: from molecular biology to Proteomics. Cells 8(8) (2019), https://doi.org/10.3390/CELLS8 080863 Johnson, B.E., et al.: Mutational analysis reveals the origin and therapy-driven evolution of recurrent glioma. Science 343(6167), 189–193 (2014). https://doi.org/10.1126/SCIENCE.123 9947 Levashenko, V.G., Zaitseva, E.N.: Usage of new information estimations for induction of fuzzy decision trees. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2412, 493–499 (2002). https:// doi.org/10.1007/3-540-45675-9_74/COVER Rabcan, J., Rusnak, P., Kostolny, J., Stankovic, R.S.: Comparison of algorithms for fuzzy decision tree induction. ICETA 2020—18th IEEE International Conference on Emerging eLearning Technologies and Applications, Proceedings, pp. 544–551 (2020), https://doi.org/10.1109/ICE TA51985.2020.9379189 Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Tran. Reliab. 65(4), 1710–1723 (2016). https://doi.org/10.1109/TR.2016.2578948 Zaitseva, E., Levashenko, V.: Reliability analysis of multi-state system and multiple-valued logic. Int. J. Qual. Reliab. Manag. 34(6), 862–878 (2017)

Wildfire Risk Assessment Using Earth Observation Data: A Case Study of the Eastern Carpathians at the Slovak-Ukrainian Frontier Sergey Stankevich, Elena Zaitseva, Anna Kozlova, and Artem Andreiev

Abstract Wildfires, as a global phenomenon, are one of the primary sources of environmental and social disturbances. To prevent and mitigate the damaging effects of fires in Europe, a great concern is focused on fire risk assessment at the PanEuropean scale. The research proposes a conceptual approach for fire risk assessment using multiscale multi-temporal Earth observation data. The elaborated methodology integrates time series of components associated with the fire process, and risk analysis is performed according to hazard functions. The fuel load was restored by multivariate regression, with the 1 km FirEUrisk fuel types map and leaf area index obtained from 10 m Sentinel-2. Meteorological conditions were restored from soil moisture modelling based on 10 m radar Sentinel-1, optical Sentinel-2, and ALOS DEM data, as well as 1 km MODIS land surface temperature. The proposed approach was tested on natural and semi-natural mountain landscapes of the Eastern Carpathians at the Slovak-Ukrainian frontier. The resulting map of fire risk was referred to four gradations: high, existential, potential, and no risk. The result was validated using the FIRMS NASA fire data archive. Keywords Risk analysis · Earth observation data · Time series · Fire disturbance · The Eastern Carpathians

S. Stankevich (B) · A. Kozlova · A. Andreiev Scientific Centre for Aerospace Research of the Earth, NAS of Ukraine, Kyiv, Ukraine e-mail: [email protected] A. Andreiev e-mail: [email protected] E. Zaitseva Department of Informatics, University of Žilina, Žilina, Slovakia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_9

131

132

S. Stankevich et al.

1 Introduction Wildfires, as a global phenomenon, are one of the primary sources of environmental and social disturbances [1, 2]. Wildfire is considered as an any unplanned, unwanted fire that is burning in and consume natural fuels: forest, brush, and grass [3]. According to the fire environment concept [4], weather, topography and fuel are defined as key elements influencing fire initiation, propagation and effects. Earth observation data and derived data products are widely applied to obtain temperature, relative humidity, fine fuel [5] and soil moisture [6], slope and elevation [7], fuel types and productivity [8] for assessing fire initiation drivers. To prevent and mitigate the damaging effects of fires in Europe, a great concern is focused on fire risk assessment at the Pan-European scale [9]. The complexity of the problem of wildfire risk assessment is caused by the simultaneous influence of a large number of drivers of different physical nature. Direct aware, adjustment and mutual coupling of these drivers are impossible. The problem issue is further exacerbated by the reliance on a time series joint analysis of the various wildfire drivers. Thus, it is necessary to (i) develop an approach to unifying and fusing the impact of individual drivers of different physical nature; (ii) consolidate the separate time series of individual drivers into a single time series of joint risk assessment; (iii) define a rule for the current risk assessment, taking into account the historical data.

2 Risk Assessment Methodology 2.1 Approach Concept From the general definition of risk [10, 11], it follows that its key component is the occurrence likelihood of the incidence under analysis. Therefore, it is convenient and correct to come down from physical risk drivers to a probabilistic description of them. A distinction of time series analysis for risk assessment is the dependence on the relative rather than absolute value of the risk driver. Different kinds of normalization can be used to equalize the risk driver values [12]. However, the discrete ratio of time i is most consistent with its insight into [13]. At the same time, the samples ri = xxi−1 profound asymmetry of the r i ratio leads to the need for linearization. An effective tool for this purpose is logarithmation [14]: li = ln ri = ln xi − ln xi−1

(1)

The probability Pi of a negative change within the time series and, accordingly, an increase in risk will be:

Wildfire Risk Assessment Using Earth Observation Data: A Case Study … ⎧ l i ⎪ ⎪ ⎨ Φ σ if increasing the value of the x driver negatively affects the situation l Pi ∼ = ⎪ l ⎪ ⎩ 1 − Φ i otherwise σl

133

(2)

where Φ(·) is the cumulative distribution function of the standard normal distribution, and σ l is standard deviation of the detrended time series (1) [15].

2.2 Applying Earth Observation Data The analysis shows that the main drivers affecting the risk of wildfire are the available volume of natural fuels (NFV), surface soil moisture (SSM) and land surface temperature (LST) [16–18]. Technically, for all of above-named key variables, there are already-present remote sensing data products exist [19, 20]. The main drawback of the available data products is the coarse spatial resolution, usually 1 km or worse, clearly insufficient for local-scale studies. A unique feature of our approach in current research is an especial technique for spatial resolution enhancement based on regression approximation over a deliberately selected robust sample of low-resolution data [21]. For this purpose, the 10 m resolution Sentinel-1 GRD dual-polarization radar product was designated as the primary source of high-resolution remotely sensed data. In addition to good correlation with vegetation structure [22] and SSM [23], Sentinel-1 radar data are independent of cloud cover, i.e. do not disrupt the continuity of the observations time series. For example, the approximating dependence of the NFV variable F on the leaf area index (LAI) value v derived from the Sentinel-2 [24] is shown in Fig. 1. Inputs for the wildfire simulations required maps of surface and canopy fuels. The most commonly used fuel classification systems are the Northern Forest Fire Laboratory (NFFL) system [25], the Fire Behaviour Fuel Models (FBFM) [26], the Fuel Characteristic Classification System (FCCS) [27], all created for the United States; the Canadian Fire Behaviour Prediction System [28]; the FirEUrisk hierarchical fuel classification system [8], and the Mediterranean-European Prometheus system [29], created for the European Union. Fuel maps derived from these classifications are of regional scale.

2.3 Risk Evaluation Having partial probabilities Pi ( j) (2) for each j-th risk drivers, j = 1, …, m, where m is the number of risk drivers, at each i-th time slice i = 1, …, n it is possible to fuse these probabilities into a single joint probability Pi . In this case, we applied the Bayesian renormalization for the probabilities’ fusion:

134

S. Stankevich et al.

Fig. 1 NFV versus LAI approximation with Sentinel-2 product m

Pi =

( j)

j=1 m j=1

( j)

Pi

+

Pi

m j=1

( j)

(3)

(1 − Pi )

The final risk value Pn+1 is calculated from a whole time series in the current spatial location: 1 (Pi − Pi−1 ) n − 1 i=2 n

Pn+1 = Pn +

(4)

The second term in (4) describes the time series dynamics [30]. Quite strong smoothing was applied to the probability time series to reduce the outliers’ effects [31].

Wildfire Risk Assessment Using Earth Observation Data: A Case Study …

135

3 A Case Study of the Eastern Carpathians at the Slovak-Ukrainian Frontier 3.1 Study Area The study area is located at the border between northeastern Slovakia and western Ukraine and belongs to the Eastern Carpathians (Fig. 2, left panel). Comprising two protected areas, Poloniny National Park in Slovakia and Uzhansky National Nature Park in Ukraine (Fig. 2, right panel), this area is of particular concern for wildfire risk assessment as a recognized biodiversity hotspot within the European temperate zone [32, 33]. Selected areas of the parks are included in Primaeval Beech Forests of the Carpathians UNESCO World Heritage Site [34]. Forests cover about 80% of the area. Alpine meadows called poloniny, and valley and foothill grasslands represent other wildlands of the area. They are the most vulnerable areas to fire ignition, especially in dry periods of early spring and late fall [35].

3.2 Fuels Data Fuel loads for particular fuel types specific to the study area were obtained from the European fuel map crosswalked to the FBFM (Fire Behavior Fuel Models). This is a raster layer representing the first-level fuel types of the FirEUrisk fuel classification system for the continental scale at 1 km [8]. FBFM standard fuel models were chosen, as this system is widely used and flexible [26]. Fuel types selected for the study area are shown in Table 1. Parameters of the FBFM used for the crosswalk to the first-level FirEUrisk selected fuel types are shown in Table 2.

Fig. 2 Location of the study area and the distribution of fire hotspots from the Fire Information for Resource Management System (FIRMS), 01/01/2022–30/06/2022

136

S. Stankevich et al.

Table 1 Suggested attribution of the first-level FirEUrisk fuel types to the FBFM standard fuel models in Europe [8] FirEUrisk fuel type

1st level fuel type

1121

Broadleaf deciduous open (15–70%) forest

1122

Broadleaf deciduous closed (70–100%) forest

1212

Needleleaf evergreen closed (70–100%) forest

1301

Mixed open (15–70%) forest

1302

Mixed closed (70–100%) forest

31

Low (0–0.3 m) Grassland

32

Medium (0.3–0.7 m) Grassland

41

Herbaceous cropland

61

Continuous fabric: urban fabric (≥80%)

62

Discontinuous fabric: vegetation and urban fabric (15–80%)

7

Nonfuel

Table 2 Fuel loads by the standard fuel models of FBFM used for the crosswalk to the first-level FirEUrisk fuel types Dead fuel load, t ha−1

Live fuel load, t ha−1

1h

10 h

100 h

Herb

1121

10.09

5.49

0.00

3.47

15.69

29.25

1122

2.47

0.34

0.56

1.46

2.47

6.4

1212

2.13

4.04

2.80

0.00

0.45

2.58

1301

4.60

7.62

1.91

0.00

9.75

14.35

1302

1.12

4.93

6.28

0.00

0.00

1.12

31

0.22

0.00

0.00

7.62

0.00

7.84

32

1.12

2.24

0.00

16.36

0.00

17.48

41

0.22

0.00

0.00

7.62

0.00

7.84

61

0.00

0.00

0.00

0.00

0.00

0.00

62

1.01

6.73

0.00

0.00

13.90

14.91

7

0.00

0.00

0.00

0.00

0.00

0.00

FirEUrisk fuel type

Fine fuel load, t ha−1 (FirEUrisk)

Woody

3.3 Earth Observation Data Time Series According to the task, the required parameters are LST product, LAI for NFV, slope values and Sentinel-1 for SSM. Land surface temperature. Modis products were selected to obtain land surface temperature values over the study area. To obtain cloud-free composite, a set of more than one image is necessary. Thus, the decision was made to use the 8-day

Wildfire Risk Assessment Using Earth Observation Data: A Case Study …

137

per-pixel mosaic of Land Surface Temperature and Emissivity with a 1 km (km) spatial resolution [36]. Five time periods were used, namely from 2022–03-07 to 2022–03-14, from 2022–03-15 to 2022–03-22, from 2022–03-31 to 2022–04-07, from 2022–05-10 to 2022–05-17, and from 2022–06-11 to 2022–06-18. LAI. The LAI values are available from Sentinel-2 multispectral instrument (MSI) imagery. The preprocessing of this imagery implies two steps. Firstly, the atmospheric correction is performed. This procedure implies transformation from the top of the atmosphere reflectance (TOA, Level 1C) to the bottom of the atmosphere reflectance (BOA, Level 2A) using the Sen2Cor tool (https://step.esa.int/main/snap-supportedplugins/). Secondly, all bands are resampled to 10 m resolution by the nearest neighbour method. After the preprocessing, the LAI layer is calculated by SNAP software [37]. To fulfil task requirements, the imagery was downloaded and processed for five following dates: 2022–03-12, 2022–03-22, 2022–03-27, 2022–05-16, and 2022–06-18. Sentinel 1. The Sentinel-1 imagery is required to obtain local Incidence Angle values. This product is preprocessed by SNAP software [38]. Imagery for the following dates were downloaded: 2022–03-11, 2022–03-23, 2022–04-04, 2022–05-10, 2022–0522, and 2022–06-15. Slope. Slope values are obtained from the Shuttle Radar Topography Mission (SRTM) digital elevation data [39]. This service provides imagery with a spatial resolution of 30 m. However, since the data is available only for the 2000 year, one image is used for the whole dataset.

3.4 Risk Map The consistent implementation of the proposed methodology shows results in the map of wildfire risk (Fig. 3). The risk’s nominal values were assigned taking into account both the latest value of the fused probability of wildfire (low, medium, high) and the averaged over the entire observation period the derivative of this probability (positive or negative) [40]. Each pixel of this map presents one of the six gradations of wildfire risk assessed from the fused partial probabilities of risk drivers into a single joint probability. The results acquired from FIRMS (https://firms.modaps.eosdis.nasa.gov) for the period from 1st January 2022 till 30th June 2022 were visualized [41] and superimposed over the wildfire risk map (Fig. 4) to validate the proposed risk assessment methodology by corresponding the assessed risk to the real fire events (Table 3).

138

S. Stankevich et al.

Fig. 3 Map of fire risk within the study area. The risk of wildfire is referred to six gradations from critical to no risk

4 Discussion In general, the obtained maps show a fairly high match rate with the result of actual forest fires observation by FIRMS (Fig. 4). Predicted risk levels for all detected fires do not fall below “Moderate”. Moreover, in this did not take into account the probability of wildfire detection with FIRMS, but this system itself is not perfect [42]. In addition, the satellite data from the low-resolution MODIS and VIIRS sensors used in this system make it possible to reliably detect only large fires—several hundred meters in size [43]. Although there are known developments for subpixel detection of wildfires by VIIRS data [44], many modern studies are focused on wildfires detection using medium-resolution sensors, such as Landsat 8 OLI/TIRS and Sentinel 2 MSI [45, 46]. Nevertheless, any wildfire risk assessments are intrinsically proned to considerable uncertainty. The main sources of such uncertainty are errors in determining the amount of natural fuel within the land cover classes (Table 2), inaccuracies in the land cover classification (Table 1) and a significant variation in the NVF approximation by satellite measurements (Fig. 1). As supplementary, significant uncertainties are induced by the errors in satellite data and products derived that were used in analysis of previous history of ground conditions as well as in trends extraction from probabilistic parameters of wildfire risk (4).

Wildfire Risk Assessment Using Earth Observation Data: A Case Study …

139

Fig. 4 Fire spots from FIRMS (01/01/2022–30/06/2022) over the resulted wildfire risk map

Table 3 Corresponding of the assessed risk to the real fire events Fire spot (real event)

Date

Time

Gradation of risk (assessed)

1

13.04.2022

11:48

Significant

2

14.04.2022

10:41

Critical

3

14.04.2022

10:41

Critical

4

14.04.2022

10:41

Critical

5

14.04.2022

12:21

Moderate

6

14.04.2022

12:21

Moderate

7

14.04.2022

12:21

Moderate

8

14.04.2022

12:21

Moderate

To enhance the reliability of the output wildfire risk map, the following measures are possible. (1) It must be developed and implemented a more accurate groundverified the natural fuel map of Ukraine similar to FirEUrisk one. Now there is no such map. (2) To assess the ground preconditions for the wildfire occurrence, it is necessary to take into account the current physical parameters not only of the land surface, but also of the atmosphere, i.e. weather conditions within the observed territory [47]. Such studies have been conducted since the 1970s. In particular, Forest

140

S. Stankevich et al.

Fire Danger Rating [48], Fire Weather Index (FWI) [49] and many other varieties of weather indices [50] are widely used for this purpose in the world. Numerous attempts have been made to improve the predictive power of weather indicators, FWI primarily [51, 52]. (3) Socio-economic prerequisites, such as proximity to cities and culturesignificant objects, transport accessibility, recreational potential, the protected area restrictions, etc., must be taken into account as drivers of statistic performance, primarily the probability of a fire [53]. (4) An important tool for improving the wildfire risk mapping confidence is multi-cause geospatial modelling based on complex comprehensive spatio-temporal scenarios [54].

5 Conclusions Thus, the proposed methodology for wildfire risk assessment based on time series of multisource and multiscale EO data has demonstrated promising results. It can be recommended to analytics and decision-makers in forest management and conservation. An especial advantage of our approach is one’s flexibility and versatility. The transformation of heterogeneous physical drivers into governing probabilities allows the formation and correct fusion of time series of remote, ground-based and census observations. In this way, it is possible to prescind from the insignificant interrelations of wildfire drivers and to focus on risk forecasting precisely. Acknowledgements This work was supported by the Slovak Research and Development Agency, Slovakia under grant “Risk assessment of environmental disturbance using Earth observation data” (reg.no. SK-UA-21-0037).

References 1. Bowman, D.M.J.S., Balch, J.K., Artaxo, P., Bond, W.J., Carlson, J.M., Cochrane, M.A., D’Antonio, C.M., DeFries, R., Doyle, J., Harrison, S.P., Johnston, F.H., Keeley, J.E., Krawchuk, M.A., Kull, C.A., Marston, J.B., Moritz, M.A., Prentice, I.C., Roos, C.I., Scott, A.C., Swetman, T.W., Van der Werf, Pyne, S.G.: Fire in the earth system. Science 324(5926), 481–484 (2009). https://doi.org/10.1126/science.1163886 2. Thonicke, K., Venevsky, S., Sitch, S., Cramer, W.: The role of fire disturbance for global vegetation dynamics: coupling fire into a dynamic global vegetation model. Glob. Ecol. Biogeogr. 10(6), 661–677 (2008). https://doi.org/10.1046/j.1466-822x.2001.00175.x 3. CIFFC. Canadian Wildland Fire Glossary. In Canadian Interagency Forest Fire Centre (2022). https://ciffc.ca/sites/default/files/2022-03/CWFM_glossary_EN.pdf 4. Countryman, C.M.: The fire environment concept, USDA Forest Service, Pacific Southwest Range and Experiment Station, Berkeley, California, USA (1972). https://www.frames.gov/doc uments/behaveplus/publications/Countryman_1972_TheFireEnvironmentConcept_ocr.pdf 5. Kussul, N., Fedorov, O., Yailymov, B., Pidgorodetska, L., Kolos, L., Yailymova, H., Shelestov, A.: Fire danger assessment using moderate-spatial resolution satellite data. Fire 6(72), 1–13 (2023). https://doi.org/10.3390/fire6020072

Wildfire Risk Assessment Using Earth Observation Data: A Case Study …

141

6. Svideniuk, M.: Methodology for determining the physical parameters of ground plane by the results of the optical and radar data fusion. Ukrainian J. Remote Sensing 8(3), 4–26 (2021). https://doi.org/10.36023/ujrs.2021.8.3.197 7. Adaktylou, N., Stratoulias, D., Landenberger, R.: Wildfire risk assessment based on geospatial open data: application on Chios. Greece. ISPRS Int. J. Geo-Inf. 9, 516 (2020). https://doi.org/ 10.3390/ijgi9090516 8. Aragoneses, E., García, M., Salis, M., Ribeiro, L., Chuvieco, E.: Classification and mapping of European fuels using a hierarchical-multipurpose fuel classification system. Earth Syst. Sci. Data (2022). https://doi.org/10.5194/essd-2022-184 9. Oom, D., de Rigo, D., Pfeiffer, H., Branco, A., Ferrari, D., Grecchi, R., Artés-Vivancos, T., Houston Durrant, T., Boca, R., Maianti, P., Libertá, G., San-Miguel-Ayanz, J., et al.: PanEuropean wildfire risk assessment, EUR 31160 EN, Publications Office of the European Union, Luxembourg, 2022, ISBN: 978-92-76-55137-9 (2022). https://doi.org/10.2760/9429 10. Renn, O., Ortleb, J., Benighaus, L., Benighaus, C.: Risks. In: Pechan, P., Renn, O., Watt, A., Pongratz, I. (Eds.) Safe or Not Safe, pp. 1–40. Springer, New York (2011). https://doi.org/10. 1007/978-1-4419-7868-4 11. Šoti´c, A., Raji´c, R.: The review of the definition of risk. Online J. Appl. Knowl. Manag. 3(3), 17– 26 (2011). http://www.iiakm.org/ojakm/articles/2015/volume3_3/OJAKM_Volume3_3pp1726.pdf 12. Prado, V., Wender, B.A., Seager, T.P.: Interpretation of comparative LCAs: external normalization and a method of mutual differences. Int. J. Life Cycle Assess. 22(12), 2018–2029 (2017). https://doi.org/10.1007/s11367-017-1281-3 13. Dan, J., Shi, W., Dong, F., Hirota, K.: Piecewise trend approximation: a ratio-based time series representation. Abstract Appl. Anal., 603629 (2013). https://doi.org/10.1155/2013/603629 14. Mills, I., Morfey, C.: On logarithmic ratio quantities and their units. Metrologia 42(4), 246–252 (2005). https://doi.org/10.1155/2013/603629 15. Vamo¸s, C., Cr˘aciun, M.:vSerial correlation of detrended time series. Phys. Rev. E 78(3), 036707 (2008). https://doi.org/10.1103/PhysRevE.78.036707 16. Molaudzi, O.D., Adelabu, S.A.: Review of the use of remote sensing for monitoring wildfire risk conditions to support fire risk assessment in protected areas. South African J. Geomatics 7(3), 222–242 (2018). https://doi.org/10.4314/sajg.v7i3.3 17. Arellano-Pérez, S., Castedo-Dorado, F., López-Sánchez, C.A., González-Ferreiro, E., Yang, Z., Díaz-Varela, R.A., Álvarez-González, J.G., Vega, J.A., Ruiz-González, A.D.: Potential of Sentinel-2A data to model surface and canopy fuel characteristics in relation to crown fire hazard. Remote Sensing 10(10), 1645 (2018). https://doi.org/10.3390/rs10101645 18. Maffei, C., Lindenbergh, R., Menenti, M.: Combining multi-spectral and thermal remote sensing to predict forest fire characteristics. ISPRS J. Photogramm. Remote. Sens. 181, 400–412 (2021). https://doi.org/10.1016/j.isprsjprs.2021.09.016 19. Ryan, K.C., Opperman, T.S.: LANDFIRE—a national vegetation/fuels data base for use in fuels treatment, restoration, and suppression planning. For. Ecol. Manage. 294, 208–216 (2012). https://doi.org/10.1016/j.foreco.2012.11.003 20. Yu, P., Zhao, T., Shi, J., Ran, Y., Jia, L., Ji, D., Xue, H.: Global spatiotemporally continuous MODIS land surface temperature dataset. Scientific Data 623(9), 143 (2022). https://doi.org/ 10.1038/s41597-022-01214-8 21. Zaitseva., E., Stankevich, S., Kozlova, A., Piestova, I., Levashenko, V., Rusnak, P.: Assessment of the risk of disturbance impact on primeval and managed forests based on Earth observation data using the example of Slovak Eastern Carpathians. IEEE Access 9, 162847–162856 (2021). https://doi.org/10.1109/ACCESS.2021.3134375 22. Kötz, B., Schaepman, M., Morsdorf, F., Bowyer, P., Itten, K., Allgöwer, B.: Radiative transfer modeling within a heterogeneous canopy for estimation of forest fire fuel properties. Remote Sens. Environ. 92(3), 332–344 (2004). https://doi.org/10.1016/j.rse.2004.05.015 23. Garkusha, I.N., Hnatushenko, V.V., Vasyliev, V.V.: Using Sentinel-1 data for monitoring of soil moisture. Proceedings of the 37th International Geoscience and Remote Sensing Symposium (IGARSS 2017). Fort Worth: IEEE, pp. 1656–1659 (2017). https://doi.org/10.1109/IGARSS. 2017.8127291

142

S. Stankevich et al.

24. Weiss, M., Baret, F., Jay, S.: S2ToolBox Level 2 products: LAI, FAPAR, FCOVER. Version 2.1 (2020). https://step.esa.int/docs/extra/ATBD_S2ToolBox_V2.1.pdf 25. Anderson, H.: Aids to determining fuel models for estimating fire behavior, US Department of Agriculture, Forest Service, Intermountain Forest and Range Experiment Station, Washington, DC, USA (1982) 26. Scott, J.H., Burgan, R.E.: Standard fire behavior fuel models: a comprehensive set for use with Rothermel’s surface fire spread model. Gen. Tech. Rep. RMRS-GTR-153. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Stationn, 72 (2005). https://doi.org/10.2737/rmrs-gtr-153 27. Ottmar, R.D., Sandberg, D.V., Riccardi, C.L., Prichard, S.J.: An overview of the fuel characteristic classification system—quantifying, classifying, and creating fuelbeds for resource planning. Can. J. For. Res. 37(12), 2383–2393 (2007) 28. Forestry Canada Fire Danger Group: Development and structure of the Canadian fire behaviour prediction system, Forestry Canada, Inf. Repor, Ottawa, 63 (1992) 29. European Commission: Prometheus, S.V. Project. Management Techniques for Optimisation of Suppression and Minimization of Wildfire Effect, European Commission Contract Number ENV4-CT98–0716 (1999) 30. Rabcan, J., Rusnak, P., Zaitseva, E., Macekova, D., Kvassay M., Sotakova, I.: Analysis of data reliability based on importance analysis. Proceedings of International Conference on Information and Digital Technologies (IDT 2019). Žilina: IEEE, pp. 402–408 (2019). https://doi. org/10.1109/DT.2019.8813668 31. Zhang, A., Song, S., Wang, J., Yu, P.: Time series data cleaning: from anomaly detection to anomaly repairing. Proc. VLDB Endowment 10(10), 1046–1057 (2017). https://doi.org/10. 14778/3115404.3115410 32. Jovanovi´c, I., Dragiši´c, A., Ostoji´c, D., Krsteski, D.: Beech forests as world heritage in aspect to the next extension of the ancient and primeval beech forests of the Carpathians and other regions of Europe world heritage site. Zastita Prirode. 69(1–2), 5–32 (2019). https://doi.org/ 10.5937/zaspri1901015j 33. Rehush, N., Waser, L.T.: Assessing the structure of primeval and managed beech forests in the Ukrainian Carpathians using remote sensing. Can. J. For. Res. 47(1), 63–72 (2017). https://doi. org/10.1139/cjfr-2016-0253 34. IUCN World Heritage Outlook: 2020 Conservation Outlook Assessment. Ancient and Primeval Beech Forests of the Carpathians and Other Regions of Europe, Albania, Austria, Belgium, Bulgaria, Croatia, Italy, Germany, Romania, Slovenia, Slovakia, Spain, Ukraine (2020). https:// worldheritageoutlook.iucn.org 35. Zibtsev, S.V., Soshenskyi, O.M., Myroniuk, V.V., Gumeniuk, V.V.: Wildfire in Ukraine: an overview of fires and fire management system. Ukrainian J. Forest Wood Sci. 11(2), 15–31 (2020). https://doi.org/10.31548/forest2020.02.015 36. Wan, Z., Hook, S., Hulley, G.: MODIS/terra land surface temperature/emissivity 8-Day L3 global 1 km SIN grid V061. NASA EOSDIS Land Process. DAAC (2021). https://doi.org/10. 5067/MODIS/MOD11A2.061 37. Sentinel 2 Toolbox // ESA Science toolbox exploration platform Available at: http://step.esa. int/main/toolboxes/sentinel-2-toolbox/ 38. Sentinel 1 Toolbox // ESA Science toolbox exploration platform Available at: http://step.esa. int/main/toolboxes/sentinel-1-toolbox/ (accessed April 8, 2018) 39. Farr, T.G., Rosen, P.A., Caro, E., Crippen, R., Duren, R., Hensley, S., Kobrick, M., Paller, M., Rodriguez, E., Roth, L., Seal, D., Shaffer, S., Shimada, J., Umland, J., Werner, M., Oskin, M., Burbank, D., Alsdorf, D.E.: The shuttle radar topography mission: Rev. Geophy. 45(2), RG2004 (2007). https://doi.org/10.1029/2005RG000183 40. Messenger, R., Mandell, L.: A modal search technique for predictive nominal scale multivariate analysis. J. Amer. Stat. Assoc. 67(340) (1972). https://doi.org/10.1080/01621459.1972.104 81290 41. Arsirii, O.O., Krachunov, H.A., Smyk, S.Y., Troianovska, Y.L.: Methods of analysis and visualization of active fires and burnt areas of geospatial data. Herald Adv. Inf. Technol. 5(1), 62–73 (2022). https://doi.org/10.15276/hait.05.2022.6

Wildfire Risk Assessment Using Earth Observation Data: A Case Study …

143

42. Ying, L., Shen, Z., Yang, M., Piao, S.: Wildfire detection probability of MODIS fire products under the constraint of environmental factors: a study based on confirmed ground wildfire records. Remote Sensing 11(24), 3031 (2019). https://doi.org/10.3390/rs11243031 43. Hantson, S., Padilla, M., Corti, D., Chuvieco, E.: Strengths and weaknesses of MODIS hotspots to characterize global fire occurrence. Remote Sens. Environ. 13(1), 152–159 (2013). https:// doi.org/10.1016/j.rse.2012.12.004 44. Elvidge, C.D., Zhizhin, M., Hsu, F.C., Sparks, T., Ghosh, T.: Subpixel analysis of primary and secondary infrared emitters with nighttime VIIRS data. Fire 4(4), 83 (2021). https://doi.org/ 10.3390/fire4040083 45. Gülc˙I, S., Yüksel, K., Gümü¸s, S., W˙Ing, M. Mapping wildfires using Sentinel 2 MSI and Landsat 8 imagery: spatial data generation for forestry. Europ. J. Forest Eng. 7(2), 57–66 (2021). https:// doi.org/10.33904/ejfe.1031090 46. Hu, X., Ban, Y., Nascetti, A.: Sentinel-2 MSI data for active fire detection in major fire-prone biomes: a multi-criteria approach. Int. J. Appl. Earth Observ. Geoinformation 101, 102347 (2021). https://doi.org/10.1016/j.jag.2021.102347 47. Todorova, E., Zhiyanski, M.K., Todorov, L.: Using high precision climate data for wildfire risk assessment. Silva Balcanica. 24(1), 5–16 (2023). https://doi.org/10.3897/silvabalcanica. 24.e101192 48. Wotton, B.M.: Interpreting and using outputs from the Canadian forest fire danger rating system in research applications. Environ. Ecol. Stat. 16(2), 107–131 (2009). https://doi.org/10.1007/ s10651-007-0084-2 49. Varela, V., Sfetsos, A., Vlachogiannis, D., Gounaris, N.: Fire Weather Index (FWI) classification for fire danger assessment applied in Greece. Tethys J. Mediterranean Meteorol. Climatol. 15, 31–40 (2018). https://doi.org/10.3369/tethys.2018.15.03 50. Zacharakis, I., Tsihrintzis, V.A.: Environmental forest fire danger rating systems and indices around the globe: a review. Land 12(1), 194 (2023). https://doi.org/10.3390/land12010194 51. Júnior, J.S.S., Pãulo, J., Mendes, J., Alves, D., Ribeiro, L.M.: Automatic calibration of forest fire weather index for independent customizable regions based on historical records. Proceedings of Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE 2020). Laguna Hills: IEEE, 1–8 (2020). https://doi.org/10.1109/AIKE48582.2020.00011 52. Pinto, M.M., DaCamara, C.C., Hurduc, A., Trigo, R.M., Trigo, I.F.: Enhancing the fire weather index with atmospheric instability information. Environ. Res. Lett. 15(9), 0940b7 (2020). https://doi.org/10.1088/1748-9326/ab9e22 53. Vigna, I., Besana, A., Comino, E., Pezzoli, A.: Application of the socio-ecological system framework to forest fire risk management: a systematic literature review. Sustainability 13(4), 2121 (2021). https://doi.org/10.3390/su13042121 54. Castel-Clavera, J., Pimont, F., Opitz, T., Ruffault, J., Rivière, M., Dupuy, J.-L.: Disentangling the factors of spatio-temporal patterns of wildfire activity in south-eastern France. Int. J. Wildland Fire 32(1), 15–28 (2023). https://doi.org/10.1071/WF22086

Digital Safety Delivery: How a Safety Management System Looks Different from a Data Perspective Paul Singh and Coen van Gulijk

Abstract The BowTie diagram is a graphical visualisation tool. Its purpose is to describe the structure of the safety management system in place to prevent threats from realising a top event and to mitigate the consequences if a top event were to become reality. The barriers on a BowTie diagram are traditionally shown as simple barriers (detect, diagnose, act) but in reality, are configured as complex systems. Time-series data from processing environments with the use of appropriate analytical tools can be used in conjunction to monitor and report the health of barriers and reflect upon the whole safety management system. Process Safety Performance Indicators can then be reported on a real or near-real time basis. These indicators would become more transparent from operational personnel to senior management levels thereby increasing the understanding of the health of the safety management system. Due to the real or near-real time reporting with time-series data analytical tools, management can be informed of the process status which could lead to timely decisions and actions. Process health can be monitored and reported on active dashboards with greater reliability and accuracy. Keywords BowTie · Barriers · Complex barriers · Time-series data · Big data · Process safety performance indicators · Safety management systems

1 Introduction Big Data can lead organisations into a new frontier for digitalisation and analysing the performance of safety management systems on process sites. This chapter discusses the background and importance of Big Data and its application to process safety P. Singh (B) School of Applied Sciences, University of Huddersfield, Huddersfield, UK e-mail: [email protected] C. van Gulijk The Netherlands Organization, Leiden, Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_10

145

146

P. Singh and C. van Gulijk

which leads to how safety management systems are currently represented on site and how those representations may be updated with the use of time-series data. The barriers depicted on site, in the traditional sense, are not as simple as they seem. In reality barriers are complex and that reality can be monitored with the use of Big Data and reflected on an updated form of an existing tool, the BowTie diagram. “Data is the new oil” [1]. The World Economic Forum also declared data as an asset [2] which leads to Big Data being vital in organisational competitiveness [3]. Currently, there is no agreed upon definition for Big Data [3]. Many sources state that Big Data includes data with the following characteristics: 1. 2. 3. 4. 5.

Volume Variety Veracity Value Velocity.

Process data analytics are in their early stages within industry [4]. There are pitfalls associated with some software tools such as Microsoft Excel and Power BI [3, 5] in handling Big Data or that the data needs to be manipulated to handle the time-series data generated from the process for Power BI. The amount of data generated by many processing organisations are massive and many “process historian databases are massive”. Data manipulation with the use of some software tools may be difficult due to the difficulty of manipulating time-series data which contain a variety of data formats such as numerical, Boolean, string and array. Despite concerted efforts by many organisations, safety incidents, abnormal occurrences and accidents still occur in industry. Perhaps with the use of Big Data, these negative safety situations can be avoided going forward. On March 23, 2005, the BP Texas City Refinery incident killed 15 workers and injured many more [6]. The site had a history of issues which were not investigated by BP [6, 7]. Recommendations from the 2007 Baker report [8] surrounded elements of process safety management especially process safety performance indicators (PSPIs) [9, 10]. The recommendations from the Baker report directly point to the development of a process safety management system with leading and lagging indicators. The use of Big Data can support the reporting of leading and lagging indicators and demonstrate the health and viability of a process safety management system. Certain lagging indicators, such as indicators which report the number of incidences that have occurred, may be converted to leading indicators. Examples of new leading indicators with the use of Big Data could relate to time related indicators. There was an overfill of the petrol storage tank, Tank 912, at the Buncefield oil storage depot on December 11, 2005. As identified by the Health and Safety Executive (HSE) in the United Kingdom [11], failures in design and lack of maintenance of the overfill protections systems and mitigation measures were identified as the technical causes for the explosion. Additional root causes were identified as “broader management failings”.

Digital Safety Delivery: How a Safety Management System Looks …

147

From Buncefield, the Director of Hazardous Installations Directorate, Gordon MacDonald issued the following challenge to industry: 1. Do we understand what can go wrong? 2. Do we know what our systems are to prevent this happening? 3. Do we have information to assure us they are working effectively? Could Big Data have been used to identify successful tests on the high-level switch or gauge? Yes, if the data was readily available and in a usable format. Could Big Data have been used to identify when maintenance was last carried out in the form of representative key performance indicators (KPIs) or PSPIs? Yes, if the maintenance reports and their results were also readily available. Would a dashboard identifying key measurements and flowrates supported staff to notify if or when the tank would overfill? Once again, yes, if that information was easily accessible and presented. Could Big Data help industry answer those questions posed by the HSE in a clear and transparent manner? Yes, only if the systems were set up to show the information distinctly. In December 2007, another tragic explosion occurred, this time at T2 Laboratories, Inc. From the United States Chemical Safety and Hazard Investigation Board (CSB) investigation, the root cause was determined as the failure to recognize the hazards of runaway reactions [12, 13]. In addition to the lack of knowledge and understanding of runaway reactions, the process did not contain an adequate cooling system and the pressure relief system was not capable of relieving excess pressure from the runaway reaction. Were there any early indicators that could have shown the loss of control of reaction rate? A detailed analysis could have identified the existing barriers visualised on a BowTie diagram and Big Data could have been utilized to showcase the health of that barrier as described in a recent publication by Singh et al. [14]. The Macondo, Deepwater Horizon drilling rig in the Gulf of Mexico experienced an explosion and fire on April 20, 2010 [15, 16]. The CSB investigation noted findings surrounding issues with safety culture, risk management, as well as leading and lagging indicators, to name a few. BP’s focus for the safety performance level relied upon injury rates as an indicator which did not allow for additional measurement and monitoring of the safety performance [16]. The performance indicators should also provide a measure of the safety management system performance employed by the organization [15]. With the analyses performed by Skogdalen et al. [16] and additional commentary provided by Manuele [15], data could have been utilized to support the monitoring of barrier health along with the potential of providing information around precursor incidents. How do we get safety value from safety management systems? That question was recognized a while ago [17]. How do we deal with the data? The data produced from site is time-series data, that is, a set of observations with each being recorded at a specific time [18].

148

P. Singh and C. van Gulijk

2 Data Analysis and BowTies 2.1 Time-Series Data Analysis Time-series data, discrete sets, from processing environments, with the use of appropriate tools, can be connected to monitor and view the data as the data trends over time. Typically, processing environments utilise distributed control systems (DCS) which allow users to view the data over time and then that data is subsequently stored in a data-lake archiving system. DCS systems do not possess the functionality to allow the user to manipulate that data as it is generated. Additional users, such as site engineers, would extract data from the data-lake and manipulate that data for process performance indication using software tools such as spreadsheet applications. One specific software tool, Seeq, allows consolidation of data sources such that the time-series are matched to allow a singular view of the datasets over time [19]. Seeq utilises an advanced calculation engine which runs in the periphery that facilitates trending, comparisons, and calculations using timestamps. It should be noted that this software does not transform the data at source. The time series data can now be aligned and analysed with various analytical tools within applications. For the analyses, tools can be used for cleansing data and perform various calculations. For example, for a particular analysis, time series data is retrieved for a specific measurement tag. To perform the analysis, the data may need to be narrowed down to ensure that the analysis reflects a specific part of the process, i.e., a specific stage in a process. Additional data sources would then be required to aid in the identification of a stage in the process. Conditions can then be created to allow the user to compare the specific measurement tag to a specified stage in the process. Now, further analyses, such as calculations to represent that data as a performance indicator, can be performed and the results displayed accordingly. If a pre-existing calculation tool is not present, then additional user-function features can be used to build in additional applications to aid in the analysis. Results from analyses can then be presented using a graphical user interface (GUI). For example, Seeq possesses a GUI named the Seeq Workbench facility [20]. With the use of a tool that allows for time-series data analyses, the health of the safety management system can be determined. However, what visualisation and analyses tools are available to identify and represent the structure of the safety management system itself?

2.2 BowTies BowTies are graphical analysis tools [21]. These graphical tools allow the designer to create a model of the hazards and top events that arise from specific threats which lead to certain consequences. On the diagram, preventative and mitigating barriers

Digital Safety Delivery: How a Safety Management System Looks …

149

Fig. 1 Sample BowTie diagram [14]

are identified [22]. In 2018, Hughes et al. [23, 24] demonstrated that BowTies could be augmented with unstructured Big Data from management systems using a natural language approach based on ontologies [23, 24]. The major hazards and top-level events are noted centrally on the diagram. A sample BowTie is shown in Fig. 1. Barriers identified on the BowTie diagram can then be classified further.

2.3 Complex Barriers There are passive, active, symbolic or human barriers [25]. Passive barriers are permanent and in use, active barriers, as defined by the Center for Chemical Process Safety (CCPS), are activated when conditions are met based on three criteria (detect, decide, act), symbolic barriers are warning signs and human pertain to the operator. To be classified as a barrier, according to the CCPS, three functions that must exist are: detect (the threat), decide (logic decision) and act (intervene) [21]. For example, a thermocouple is connected to a controller with pre-programmed logic which is then connected to final element, typically a valve. Are barriers as simple as they are defined? Short answer, no. From the analysis performed on a process, the traditional definition of a barrier as stated by CCPS was considered. However, investigation into the barrier setup as well as the data

150

P. Singh and C. van Gulijk

T-100

Controller 1 V-700

T-200

Controller 2

Fig. 2 Actual barrier arrangement [14]

available showed a more complex arrangement exists. Many processes utilise redundant equipment to allow for greater assurance of the working nature of the process. One additional layer of complexity shown in Fig. 2 below is the redundancy of an additional thermocouple and controller arrangement in case the first setup was to fail. From the redundancy arrangement of a barrier, the number of relevant datachannels that must be extracted is increased. This introduces rules about the survivability of the barrier. Due to the redundancy and the data available, organisations may not deem it necessary to determine the reliability of the barrier as data is available to show that the barrier is functional. Are biannual or annual functional tests still required? Possibly not. As shown in the complex barrier setup with redundancy, additional BowTies and barriers may be interlinked as components may be part of other barriers as primary control or as an independent safety layer. The equipment may not be physically connected to each other or within each other’s proximity, however, if their connectivity serves multiple purposes, they are inadvertently connected. This interconnectivity does not follow the rigorous structure of a true independent layer of protection as proposed in many standards across a multitude of industries. That being said, is that a bad thing that the standard is not followed so strictly? Does this mean that the barrier components are in use and being monitored continuously? Would that result in a physically more robust system under constant observation, hence, inherently safer? So, when barriers and their components are monitored through effective and efficient data-systems, the detection measurement devices may also be connected to other barriers, as shown in Fig. 3. There is no reason to stop at one barrier or one group of barriers. This can be applied to multiple barriers for one BowTie and then, subsequently, across multiple

Digital Safety Delivery: How a Safety Management System Looks …

Controller 3

T-100

151

V-800

Controller 1 V-700

T-200

Controller 2

Fig. 3 Barrier complexity on-site [14]

BowTies for a single process. Extending this thought further, multiple BowTies for one process can be applied to multiple processes. Software tools possessing the ability to link together assets using asset trees would be beneficial as then similar equipment used across the site could be linked across multiple processes. With this information, even more data can be linked to the type of equipment utilised which would allow for additional confidence in the statistical analyses of equipment reliability. To reflect upon the safety management system, are BowTies enough to showcase how the system is setup? Does the BowTie diagram need to be modified to reflect the true nature of the complex nature of the barriers utilised or is another tool required? As mentioned earlier, the standard definition for a barrier represented on a BowTie diagram is that the barrier has 3 distinct components which detect, diagnose and finally act. With the analyses conducted on an existing process, barriers are actually complex in nature with built in redundancy and possibly dual functionality (control and safety, for example). How could this barrier be identified on a BowTie diagram? By revisiting the BowTie shown in Fig. 1, how would it look with the complex barrier setup as shown in Fig. 2 with the redundancy or Fig. 3 with the added complexity? A mock-up is shown in Fig. 4. Organisations can now represent the redundancy and complexity on a BowTie diagram and with the use of data determine the health of the safety management system. How would that information be displayed from the operator to senior management? Creation of a dashboard allows for the monitoring of the health of the safety management system and prevention of the occurrence of a loss of control event. Inevitably, this could lead to the creation of appropriate PSPIs or KPIs.

152

P. Singh and C. van Gulijk

Fig. 4 Potential BowTie representation with redundancy

3 Online Process Safety Performance Indicators and Safety Management Systems Using Big Data 3.1 Online PSPIs Detailed analysis of minor incidents could provide insight for creation of leading indicators to prevent or minimise occurrence of abnormal situations [Conklin, 2012 as cited in 26]. “A first step when developing indicators is therefore often to define what deviations are, and thereafter define how they should be monitored.” [16]. To follow up from that comment, what are leading and lagging PSPIs for an organisation? As described by [10], leading indicators are “precursors of harm” or could also be seen as input indicators [Hopkins, 2009 as cited in 27]. Results of process operations can be referred to as lagging indicators [27–29]. Essentially, the process steps had been completed, therefore the result displayed is for an instance in the process that occurred in the past. As per Erikson [2009, as cited in 27], many organisations showcase process performance with the use of lagging indicators. To demonstrate safety performance and possibly compliance with external bodies, both leading and lagging indicators are required. The PSPIs, regardless of leading or lagging, should provide meaningful information regarding the state of the process to allow management to subsequently make decisions which lead to actions [Mogford, 2005, as cited in 10, 30].

Digital Safety Delivery: How a Safety Management System Looks …

153

Do BowTies and identifying barriers leading to safety management systems lead to indicators that can be reported in real time with PSPIs? If so, is this a new safety management system?

3.2 Safety Management Systems with Big Data Ultimately, the goal is to use Big Data to determine the health of a safety management system. With the use of a tool that allows analyses of time-series data, we can begin with a dashboard that provides an overall picture of the process. Figure 5 below shows a sample of a possible dashboard that has already zoomed into Status Area 1. This begins the exploration phase of the process. On that figure, the overall status of safety critical equipment is shown for Area 1. The operator of the management system can also see other status areas or view alternative analysis windows, possibly focus upon particular process steps and not on specific equipment in an area. With this example, the status of one piece of equipment is red. The operator can now drill down to find more detailed information about that piece of equipment. Figure 6 below shows where that equipment lies on a specific barrier and specific BowTie. The operator would now understand the relevance of that piece of equipment and its importance in the process. That particular piece of equipment is identified as part of a barrier related to a particular threat which could lead to a top event. As the

Fig. 5 Status area 1 dashboard

154

P. Singh and C. van Gulijk

Fig. 6 Equipment drill down to BowTie

Fig. 7 Drill down to barrier scorecard or PSPI

barrier is shown to be in a “red” state, the operator can now drill down even further, as shown in Fig. 7 to understand why the barrier is showing as red and reveal that barrier’s scorecard, or possible PSPI.

Digital Safety Delivery: How a Safety Management System Looks …

155

Fig. 8 Documentation and live feed

Going one step further, the tool could also be linked to live data that is also related to that barrier or relevant documentation such as a maintenance or test record. This is shown in Fig. 8. With the current software tool available, this safety management system is a possibility. The entire software tool does not need to be built from scratch. Software solutions available could potentially be linked. For example, one solution which is capable of handling the time-series data generated from site which could then be used to create the algorithms used to compute the relevant probabilities of failure upon demand. Those results could then be reported into another software tool which may be used to develop the relevant diagrams to aid in the visualisation of the safety management system. This visualisation is now capable of offering real or near-real time reporting of the health of the process. Organisations would possess the ability to visually inspect their processes, site or even all sites to see their health. From this information, management should then be able to analyse the information and then come up with a decision which could lead to a positive action. These positive actions would limit or minimise the potential for an incident, abnormal occurrence or accident on site. Further development of such tools could also lead to a simplification for audit and reporting purposes, either internally or externally (i.e., to a governing body). Simplification of reporting results in savings. Additional savings would occur in the form of minimalization of incidents, abnormal occurrences and accidents. Industry and government could see a monumental shift in reporting of the health of a process or site with these tools. The reporting mechanism, as showcased, would be of greater accuracy and reliability than current reporting means and methods.

156

P. Singh and C. van Gulijk

4 Conclusion When the Director of Hazardous Installations Directorate, Gordon MacDonald issued the following challenge to industry: Do we understand what can go wrong? Do we know what our systems are to prevent this happening? Do we have information to assure us they are working effectively? For Buncefield, with the appropriate time-series analytical tools, algorithms could have been put into place to identify successful tests and notify operational personnel of unsuccessful or incomplete tests. With the use of complex barriers, redundancy and consistent use of system components would allow for increased reliability of the system which could also be monitored and reported on a real or near-real time basis. The threat of the runaway reaction which led to the explosion at T2 Laboratories, Inc. could have been represented on a BowTie diagram with the barrier health monitored and reported using Big Data. All algorithms used to compute PSPIs and the reporting of those PSPIs can all be completed with time-series analytical tools and dashboards, once again on a real or near-real time basis. Big Data can be used to answer those questions from governing organisations in a clear, effective and efficient manner, beneficial to all when applied to safety management systems. Existing BowTie diagrams need to be updated to reflect the reality of the complex nature of the barriers used on processes. In addition to an updated BowTie diagram, a digital safety management system would benefit the organisation as the health of the safety management system would be on display for the operator, all the way up to senior management, to view, analyse and make decisions which lead to positive actions. Acknowledgements This research was supported in collaboration by the University of Huddersfield and by Syngenta Huddersfield Manufacturing Centre.

References 1. Cannataci, J., et al.: Legal challenges of big data. In: Joe, C., Oreste, P., Valeria, F. (eds.). Edward Elgar Publishing Limited, Cheltenham, UK (2020) 2. WEF: Big data, big impact: New possibilities for international development. https://www3. weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf (2012). Accessed 16 March 2022 3. Qin, S.J., Chiang, L.H.: Advances and opportunities in machine learning for process data analytics. Comput. Chem. Eng. 126, 465–473 (2019) 4. Bubbico, R., et al.: Dynamic assessment of safety barriers preventing escalation in offshore Oil&Gas. Saf. Sci. 121, 319–330 (2020) 5. Colegrove, L.F., Seasholtz, M.B., Khare, C.: Getting started on the journey. 112, 41 6. Holmstrom, D., et al.: CSB investigation of the explosions and fire at the BP texas city refinery on March 23, 2005. Proc. Saf. Prog. 25(4), 345–349 (2006)

Digital Safety Delivery: How a Safety Management System Looks …

157

7. Stemn, E., et al.: Failure to learn from safety incidents: status, challenges and opportunities. Saf. Sci. 101, 313–325 (2018) 8. Baker, J.A., et al.: The report of the BP U.S. refineries independent safety review panel, p. 374. (2007) 9. Allars, K.: BP Texas City incident Baker review p. 2. 2007 10. Hopkins, A.: Thinking about process safety indicators. Saf. Sci. 47(4), 460–465 (2009) 11. HSE: Buncefield: Why did it happen? HSE, p. 36. (2011). https://www.hse.gov.uk/comah/bun cefield/buncefield-report.pdf 12. Theis, A.E.: Case study: T2 laboratories explosion. J. Loss Prev. Process Ind. 30, 296–300 (2014) 13. CSB: T2 Laboratories, Inc. runaway reaction. CSB (2009) 14. Singh, P., Sunderland, N., van Gulijk, C.: Determination of the health of a barrier with timeseries data how a safety barrier looks different from a data perspective. J. Loss Prev. Process Ind. 80, 104889 (2022) 15. Manuele, F.A.: Highly unusual: CSB’s comments signal long-term effects on the practice of safety. Prof. Saf. 62(4), 26–33 (2017) 16. Skogdalen, J.E., Utne, I.B., Vinnem, J.E.: Developing safety indicators for preventing offshore oil and gas deep water drilling blowouts. Saf. Sci. 49(8), 1187–1199 (2011) 17. Van Gulijk, C., et al.: Big data risk analysis for rail safety? In: Podofillini, L., et al. (eds). Proceedings of ESREL 2015. CRC/Balkema (2015) 18. Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods, 2nd edn. Springer New York, New York, NY (1991) 19. Seeq.: Seeq About Us. Retrieved from https://www.seeq.com/about, 25 February 2023 20. Seeq.: Seeq Workbench. Seeq Workbench 2023. Retrieved from https://www.seeq.com/pro duct/workbench, 25 February 2023 21. CCPS.: Bow ties in risk management: a concept book for process safety, p. 224. Wiley (2018) 22. Khakzad, N., Khan, F., Amyotte, P.: Risk-based design of process systems using discrete-time Bayesian networks. Reliab. Eng. Syst. Saf. 109, 5–17 (2013) 23. Hughes, P., et al.: From free-text to structured safety management: introduction of a semiautomated classification method of railway hazard reports to elements on a bow-tie diagram. Saf. Sci. 110, 11–19 (2018) 24. Hughes, P., et al.: Extracting safety information from multi-lingual accident reports using an ontology-based approach. Saf. Sci. 118, 288–297 (2019) 25. de Dianous, V., Fiévez, C.: ARAMIS project: a more explicit demonstration of risk control through the use of bow–tie diagrams and the evaluation of safety barrier performance. J. Hazard. Mater. 130(3), 220–233 (2006) 26. Busch, C., et al.: Serious injuries & fatalities. Prof. Saf. 66(1), 26–31 (2021) 27. Sultana, S., Andersen, B.S., Haugen, S.: Identifying safety indicators for safety performance measurement using a system engineering approach. Process Saf. Environ. Prot. 128, 107–120 (2019) 28. Louvar, J.: Guidance for safety performance indicators. Process Saf. Prog. 29(4), 387–388 (2010) 29. Selvik, J.T., Bansal, S., Abrahamsen, E.B.: On the use of criteria based on the SMART acronym to assess quality of performance indicators for safety management in process industries. J. Loss Prev. Process Ind. 70, 104392 (2021) 30. Pasman, H., Rogers, W.: How can we use the information provided by process safety performance indicators? Possibilities and limitations. J. Loss Prev. Process Ind. 30, 197–206 (2014)

Reliability Optimization of New Generation Nuclear Power Plants Using Artificial Intelligence Jorge E. Núñez Mc Leod and Selva S. Rivera

Abstract The objective of the project was to develop an Artificial Intelligence module so that the exploration and exploitation that is incorporated into an evolutionary algorithm, would allow to face the optimization of the reliability of the conceptual design of New Generation Nuclear Power Plants. As particular objectives, it was expected to develop an adaptive evolutionary algorithm with efficient Artificial Intelligence for the proposed problem, to develop a tool for optimizing the reliability design in projects of new generation Nuclear Power Plants and to provide human resources in Artificial Intelligence. To evaluate the results of the new algorithm, the percentages of improvement in the time of obtaining the same quasi-optimal solution with respect to obtaining through the Evolutionary Algorithm without Artificial Intelligence have been measured. The times improved compared to the previous ones by 32%, which clearly demonstrates the improvement incorporated by the development carried out. Keywords Reliability · Optimization · Nuclear power plant · Artificial intelligence · Risk

J. E. N. Mc Leod (B) · S. S. Rivera Faculty of Engineering, Universidad Nacional de Cuyo, Mendoza, Argentina e-mail: [email protected] S. S. Rivera e-mail: [email protected] J. E. N. Mc Leod CONICET, Mendoza, Argentina © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_11

159

160

J. E. N. Mc Leod and S. S. Rivera

1 Introduction 1.1 New Generation Nuclear Power Plants The historic accidents of Three Mile Island Unit 2 (TMI-2) in the United States in March 1979 and Chernobyl (Chernobyl Nuclear Power Plant—Reactor 4) in Ukrainian SSR in the Soviet Union in April 1986, led to a movement of redesign of nuclear power plants (NPPs) that could face various threats by changing the traditional designs. New generations of NPPs should demonstrate that they were able to reduce the frequency of damage to the core and decrease the frequency of large releases of radiation after an accident. The subsequent Fukushima Daiichi Nuclear Power Plant accident in Japan in March 2011 reignited concerns about achieving minimal risk designs.

1.2 Risk The area of nuclear power plants has a very rigorous regulation. In Argentina, regulation is based on the concept of risk. This concept can be understood as the relationship between the consequences of an accident and the frequency of its occurrence. Mathematically it is understood as the product. To obtain a license to build a Nuclear Power Plant, the manufacturer must demonstrate that the design meets the risk requirements for the public [1] expressed in Fig. 1. Figure 1 is an X–Y chart, which shows the frequency [1/year] versus individual dose [mSv] for the public. In this graph each type of accident possible for the plant will be modelled to define both parameters. If these points are located below the limit, the plant is acceptable from the point of view of risk (other considerations are taken into account for the acceptance of the plant). In all cases, the risk is related to accidents in the plant. The determination of these points involves the realization of deep and complex studies of the plant. These studies, which cover different disciplines, are recognized as Probabilistic Safety Analysis (PSA) or Probabilistic Risk Analysis (PRA).

1.3 Probabilistic Risk Analysis The PRA [2] for the case that interests us could be summarized in a set of stages, some sequential and others not, which are:

Reliability Optimization of New Generation Nuclear Power Plants …

161

Fig. 1 Mandatory limits for the Argentine nuclear regulatory authority

1. Selection of Initiating Events. Events that are the initiators of accidental chains within the installation. These can be of internal or external origin. Among the external include: earthquakes, floods, fires, aircraft falls, etc. 2. Modeling of systems using the fault tree technique. This methodology models the interaction of component or human failures that can lead to the loss of subsystems within the facility. 3. Modeling of accidental sequences using the event tree technique. With this methodology, the possible routes of continuation of an accident are modeled taking into account the postulated initiating events and the possible subsystems involved. 4. Determination of plant states and grouping of sequences leading to similar states. 5. Determination of the release of radionuclides within the containment of the facility. 6. Determination of the failures of the containment of the installation. 7. Modeling of the spread of radionuclides in the environment, with special interest in the most affected group of people. Calculation of the dose received by the public. This simplified list shows in a very concise way the steps necessary to complete a PRA and aims to give the reader a frame of reference for further explanations. With all this process completed, you have the necessary information for filling the graph that contains the criterion curve. With this graphing of the scenarios studied, the risk study is completed.

162

J. E. N. Mc Leod and S. S. Rivera

1.4 Artificial Intelligence. Evolutionary Algorithms Artificial intelligence comprises an endless set of techniques for processing data with the intention of selecting an alternative that responds to the proposed objectives, such as minimizing the risk subject to a given engineering design. In this sense, we have decided to work with Evolutionary Algorithms [3], adapting them to the needs of respecting the design of the NPP, although with the ability to modify it intelligently. Evolutionary Algorithms (EA) are an alternative approach to addressing complex search and learning problems through computational models of evolutionary processes. The generic purpose of EAs is to guide a stochastic search by evolving a set of structures and iteratively selecting the most appropriate ones. EAs is also part of a set of problem-solving methodologies that mimic natural processes, such as Neural Networks or Simulated Annealing, with greater or lesser accuracy [4]. All of them are included under the term Natural Computation. It must be clear at all times that a simulation of natural processes is not sought, but rather an emulation of these processes. Therefore, an EA will be more suitable the better it solves the problem posed, regardless of its fidelity to biology. In fact, most of the algorithms that derive from this approach are exaggeratedly simplistic from a biological point of view, but complex enough to provide robust and powerful search mechanisms. In general, EA is any stochastic search procedure based on the principle of evolution. This principle has a double aspect: the ultimate goal is the “survival of the fittest”, the way to achieve it is by “adaptation to the environment”. More graphically, the fittest have a better chance of survival and, as a result, more opportunities to pass on their characteristics to subsequent generations. More specifically, when executing an EA, a population of individuals, representing a set of candidates for solutions to a problem, is subjected to a series of transformations (crossing and mutation) with which the search is updated and then to a selection process that favors the best individuals, see Fig. 2. Each transformation + selection cycle constitutes one generation. It is expected of the EA that, after a certain number of generations, the best individual will be reasonably close to the solution sought (Fig. 3). In other words, an EA tries to develop stochastic search mechanisms in parallel with which to improve the classical deterministic search techniques when these are not good or do not even exist. For improvement to be effective, such mechanisms must be “targeted”, hence the need to introduce a selection procedure. In short, in order to sufficiently emulate the evolution process, an EA must have: 1. A population of possible solutions duly represented through individuals. 2. A selection procedure based on the aptitude of individuals. 3. A transformation procedure (crossing and mutation), that is, the construction of new solutions from those currently available. 4. A procedure for evaluating the adaptation of each individual to the boundary conditions.

Reliability Optimization of New Generation Nuclear Power Plants …

163

Fig. 2 A simplified evolutionary process. The figure shows the crossover process and the mutation in an individual with a red mark

Fig. 3 Coded of EA individuals

164

J. E. N. Mc Leod and S. S. Rivera

1.5 Contributions from Different Perspectives For many years there has been interest in optimizing the reliability and avail-ability of systems in nuclear power plants. Researchers such as Hasanuzzaman and Aldemir proposed the use of dynamic programming to minimize the maintenance cost [5], this was an important contribution so that the optimization of maintenance schemes was restricted to the resolution of each system independently of the others. Vaurio presented a work [6] where the models depended on time and regulatory constraints were allowed evaluated the whole set economically; but still applied to each system independently. Several works [7–11] advanced on the joint application of the resolution of the problem with Genetic Algorithms (GA) and some PRA tools, in particular with the fault tree technique [12]. However, they limited their studies to the optimization of systems on an individual basis. Years later [13] presented a work on an optimization with GA on the set of components determined as priorities by the PRA using for its treatment the approach of Reliability Centered Maintenance, [14]; but it does not allow the change of the components or the change of the design, but it does not incorporate the evaluation of human action on risk. It is relevant to understand the complexity of the work done to briefly summarize the works consulted to substantiate the models built. The risk approach in the analysis of facility designs became a relevant aspect of all studies in the 1990s [15–18]. The redundancy of components in a system has been a widely debated topic. A series of works allowed progress in the determination of the optimums of redundant systems. Works such as [19–23] helped us understand the complexities of dealing with equipment redundancies. On the other hand, it is necessary to keep in mind the scheduling of maintenance and testing tasks. Without the modeling of those it is impossible to generate a system that can find the global optimal of the problem. This is how the work [24] served as a reference to incorporate the programming of these tasks. Finally, the works [17, 25–27] where an important aspect is the modeling of the behavior of the components with their aging.

1.6 The Focus on This Work The previous points help us to understand the context in which this work was developed. Summarizing the previous points, a series of reliability and risk studies of NPPs are carried out to demonstrate that they comply with strict risk-based regulations. This ends up generating oversized systems to meet reliability objectives. This occurs because each system is designed independently and is never redesigned based on the final engineering of the plant. It is the typical problem of constructing an optimal solution from local optima. A comprehensive approach to the problem is needed.

Reliability Optimization of New Generation Nuclear Power Plants …

165

On the other hand, regulations in the nuclear area in Argentina are based on limiting risk. It is necessary to know the limits of the acceptable region since the optimization of the system (decrease in cost) will increase the total risk of the plant; but within the acceptable region. In this way, there will be no oversizing of the security systems. In this work, an evolutionary algorithm was developed that can modify the detailed engineering of the NPP to comply with risk regulations, optimizing the overall cost of the installation. To achieve this, we worked with engineering transformed into fault trees, which determines the reliability of each system. We also worked with the event trees to see the impact that each modification has on the risk of NPP (e.g., increase or decrease of redundancies, introduction of diversity in systems, change in the quality of components, change in the times between testing of components and systems, change of supervision levels in human tasks, etc.).

2 Definition of the Objective Function and Its Constraints According to [28], the probability of failure of a component at time t (standby time) that is repaired at constant intervals Tm can be represented by the following equation: μ μ qs = 1 − [−N (Tm /θ ) ] [t−N (Tm /θ ) ]

(1)

where N is the number of maintenance activities that were completed up to time t, μ is the aging factor and θ is the life characteristic of the component. We will assume that t, Tm and N are related as follows: T = N * Tm

(2)

Replacing it is: μ

qs = 1 − [−N (t/N θ)

]

(3)

Common Cause Failure (CCF), [29], except for dependent failures, which are modeled through fault trees, human event trees, and event trees, assume that their root causes of failure are found in human activities. These human activities are associated in the model mainly with three aspects: the design, maintenance and testing of systems. The human error associated with the design of components in nuclear activity has a value of 2 to 3 orders lower than the rest of the contributors and therefore has not been taken into account in the model. On the other hand, the errors associated with the maintenance and testing of each component are significantly relevant. Finally, the probability of failure of each component is calculated as: q = qd + qs + qcc f

(4)

166

J. E. N. Mc Leod and S. S. Rivera

The terms being the probabilities of failure on demand, failure of the system in standby and failure of common cause. The latter interpreted as the sum of the probabilities of failure due to the maintenance and testing activities of the components. Therefore, q is: k q = qd + qs + (q M + qTk )cc f

(5)

We will group the components according to the level of complexity of the maintenance or test task performed, the relationship between components or their relevance to a system when a maintenance or test activity is performed, and this is directly related to the value of the assigned failure probability. Therefore, when calculating the probability of failure of a component, the group to which it was assigned must be taken into account. On the other hand, these groups allow each component to be associated with the maintenance and test frequencies that will finish defining the unavailability of each one. Thus, finally, the calculation equation for determining the probability of failure of its components will be: μ k q = qd + 1 − [−N (t/N θ) ] + (q M + qTk )cc f

(6)

Thus constituted the model of calculation of the unavailability of components the complete model of the system to be optimized will be: min

m n

ci j × xi j

(7)

i=1 j=1

subject to ⎛ ⎞a m n

x j +1 gh

μ] −N (t/N θ) k k [ ⎝ ⎠ 1 − qd + 1 − f eg × < f dg + (q M + q P )cc f j h=1

j=1

bin f ≤ x j ≤ bsup T epin f ≤ t j ≤ T epsup

(8)

Nin f ≤ N j ≤ Nsup where xj is the level of redundancy of a component, tj is the time between tests of a component, and Nj is the number of maintenance activities performed on the component. Each of these variables has a lower and an upper limit given by the regulations of the area, best practices, company policies, etc. The matrix agh is an

Reliability Optimization of New Generation Nuclear Power Plants …

167

assignment matrix that identifies which systems are included in each sequence of event trees. On the other hand, initially only the value of each component was taken into account; but this one should be complemented with others. Thus, all the costs involved must be taken into account, such as the cost of the equipment itself (Ce), the cost of maintenance (Cm) and the cost of testing (Ct), and to these is added the Social Cost (Cs). The Cs is evaluated on the population group benefited, i.e., the town that receives electricity from the Nuclear Power Plant. The cost associated with the dose is one of the variables that are parameters of optimization and the results will be sensitive to this value. According to the recommendations of the International Commission on Radiological Protection, [30], and the regulations of the Department of Energy of the United States, [31], the penalty for each 0.6 Sv has been set at $ 1,000,000 for this study. The calculation of Cs is:

rl Cs = (PD ∗ D)h ∗ (P ∗ G)

(9)

h=1

where the sum is performed over the entire set of Release Categories (identified rl), PD is the mean probability of the dose, D is the average dose, P is the number of individuals in the population, and G is the penalty. C = Ce + Cm + C p + Cs

(10)

Finally, the final system to be optimized is: ⎧⎡ ⎫ ⎤ n m rl ⎨ ⎬ min ⎣ Ce + Cm + C p i j × xi j ⎦ + (PD ∗ D)h ∗ (P ∗ G) ⎩ ⎭ i=1 j=1

h=1

subject to ⎛ ⎞a m n

x j +1 gh

μ] −N (t/N θ) k k [ ⎝ ⎠ 1 − qd + 1 − f eg × < f dg + (q M + q P )cc f h=1

j

j=1

bin f ≤ x j ≤ bsup T epin f ≤ t j ≤ T epsup Nin f ≤ N j ≤ Nsup

(11)

168

J. E. N. Mc Leod and S. S. Rivera

3 Definition of Artificial Intelligence It was decided not to use any standard or predefined AE, but to face the de-sign of a specific paradigm for the resolution of the problem described in the previous section. The decision is based on the fact that generic algorithms do not incorporate information from the problem and treat it as a black box. It should be borne in mind that for the design and implementation of the EA it is necessary to define unequivocally the following methods and criteria that will make up the core of the EA: 1. Coding criterion: refers to the way in which the variables that define the problem will be encoded and grouped in a chain of information similar to the genes of a chromosome. Each of these chromosomes from the point of view of the problem will define an installation, with its maintenance schemes, testing, its components and redundancies, and its management levels. From the point of view of EA, each of these chromosomes will define an individual with a better or worse adaptation to environmental conditions. These are the restrictions we impose in the search for the solution of the problem. For our case, it has been decided to encode the individuals in such a way that each gene represents one of the variables that define the problem as integer values. The adopted coding can be seen in Fig. 3. Additionally, each gene will be associated with information that defines the variability of the gene. In our case, it will involve placing the lower and upper limit values for each variable. This approach incorporates the uncertainty information of the variables into the chromosomal composition. The fact of working with integer values (the algorithm also supports working with real values if necessary) instead of binary strings, ensures that in the crossing process viable individuals will always be generated. The latter cannot be assured with the standard binary chain approach of EAs. 2. Treatment criteria for non-feasible individuals: Depending on the design of the individuals and the genetic operations of crossing and mutation performed with them, descendants that are biologically unviable can be created. From the mathematical point of view, they would be non-feasible solutions to our problem. Therefore, special care must be taken when designing the EA since the production of non-feasible individuals may entail the need to create special routines for the treatment of these. Treatment may simply be elimination of the population of these individuals or due to the number of individuals not feasible to proceed to a genetic correction treatment. The latter is not always easy to do depending on the unviability of the individual. The proposed coding and design of genetic operators should ensure that it is not necessary to treat non-viable individuals. That is, in no case will non-viable individuals be generated. This is a strong restriction that we impose on design; but that we consider it possible to meet the consequent efficiency gain. 3. Initialization criterion: The creation of the initial population from scratch is a relevant point in the whole process. In general, this population is created with

Reliability Optimization of New Generation Nuclear Power Plants …

169

randomly generated genetic information. That is, each variable of the problem is assigned random values. Again, the problem of unviable individuals arises, since, if these represent a significant proportion of the initial population, the latter finally only represents a small biased set of the search space for the problem. The initial population will be generated randomly and taking into account that the genetic information of the chromosomes has incorporated the variability of the genes, the population generated in this way is 100% viable. 4. Stop criterion: the stop of the evolution of the EA is based on the pre-execution specification of a certain number of generations. 5. Fitness assessment functions: The assessment function is constructed to indicate the individual’s level of adjustment or adaptation to environmental conditions (i.e. constraints of the problem). Individuals with better adaptation will be more likely to procreate and/or pass on their genetic material to subsequent generations. Individuals with poor adaptations will not be eliminated immediately but will see their chances of crossing and mutation diminished. This is so since parts of the genetic information of these individuals may contain genes relevant to improve other individuals. Each individual who meets the constraints of the problem will be evaluated in relation to the total cost involved. 6. Genetic operators: There is a wide range of genetic operators. Traditionally, two respond clearly to the biological sources on which AEs are inspired. These are the crossover operator and the mutation operator. The design of these operators, thanks to the genetic coding that has been developed for this problem, become trivial and traditional operators can be implemented. The design of the crossing operator is, thanks to the coding made of the individual and the treatment by values of the genes, trivial. The process consists of generating a random number that identifies the cut-off site of the chromosome. The cut does not break the genes. An integer value is stored in each gene. On the other hand, the design of the mutation operator requires more work and we can define it mathematically as: m pm : I λ → I λ

(12)

That is, the operator takes an individual from the population and returns an individual to the population. The application of this operator is: − → a = a1 , ..., a1 = m pm (a1 , ..., al ) = m pm ( a) According to (∀i ∈ {1, ..., l}) : (∀i ∈ {1, ..., l}) : ai =

⎧ ⎨

ai , i f χi > pm χi (lri , upi ), i f χi ≤ pm ∧ i < q ⎩ int(χi (lri , upi )), i f χi ≤ pm ∧ i ≥ q

170

J. E. N. Mc Leod and S. S. Rivera

where ai represents the ith gen and pm is the probability of mutation. χi denotes a uniform random variable sampled again for each gene. lr i and upi represent the lower and upper limits respectively of the genes and q is the integer value of the gene on the chromosome. With regard to the values of the probabilities or rates of crossing and mutation, a special scheme was implemented consisting of: • The evolution of the population begins with a high mutation rate, in our case we use a rate of 70%, maintaining a low crossing rate, in our case of the order of 20%. • The mutation rate will vary from a high value (70%) to a low value (5%) in a progressive and linear way. While the crossing rate will grow to its maximum value (60%) also in a linear way. • The time of decrease or growth of both rates will be regulated according to the genetic diversity produced and the response in the exploration of the search space. • Then the process will remain stable until adequate convergence is achieved. 7. Selection criteria: The selection of individuals, both for crossing and mutation, is usually based on lottery sampling, in which an individual’s probability of being selected grows linearly with its ability to adapt. The selection was approached with a technique called Stochastic Stratified Tournament Sampling (SSTS) [32]. This method obtains a sample that allows to properly manage genetic diversity. In SSTS, the population is stratified (e.g. in groups of 5 individuals) and sampling is conducted in each group, looking for suitable pairs. Then a competition (tournament) is held, based on a weighted adjustment of both chromosomes. Selection can be conducted in such a way that the solution population remains stable or grows. With this, we would conclude the definition of the EA that will be used to solve the problem.

4 Results The results obtained after the execution of the EA on several occasions have allowed to consistently generate responses improve the design of the installation. The consequences of a design improvement can be seen in Fig. 4. The decrease in frequency of events with consequences in the installation in the upper part of Region 1, is due to design changes that do not significantly impact the costs of the installation (times between tests and times between maintenance); but they do so very positively in reducing the risk of an important set of systems. On the other hand, the increase in the frequency of the lower part of Region 1, is also due to changes in design; but in this case related to the levels of redundancies

Reliability Optimization of New Generation Nuclear Power Plants …

171

Fig. 4 Results. Region 1 before the optimization process. Region 2 after the optimization process

of equipment or the quality thereof, which is reflected in a significant total decrease in the cost of installation. When observing Region 2, it is noticed that it has a smaller vertical expansion than Region 1. This is interpreted in light of the changes produced after the optimization process as obtaining a better-balanced design. The times improved with respect to the previous algorithms by 32%, which clearly demonstrates the improvement incorporated by the development carried out.

5 Discussion and Conclusions These results allow us to affirm that the chromosome scheme defined for the representation of the information of the design of the installation has been adequate. In addition, the way of treating genes has not shown problems in terms of treating local optima. On the other hand, we can also affirm that the strategy of using an exploration and exploitation scheme with a dynamic profile has been positive. However, the need to control the evolution times of both parameters implies a permanent control of the evolution of the population in its first tens of generations and until the probability of mutation decreases to its minimum value. This warns us of the need to generate a control based on Artificial Intelligence. This control should be able to make the decision to keep the mutation rate high while also deciding what to do with the crossover rate.

172

J. E. N. Mc Leod and S. S. Rivera

The implemented methodology has made it possible to successfully deal with a search space with multiple local optimals. The answer found is better than what you might find with traditional algorithms or gradient-based search techniques. Execution times even for a large model like the one experienced have been low and limited compared to traditional search methods. It remains as work for the future the development of a control system that implements Artificial Intelligence. This system should be able to assess the need to dynamically modify crossover and mutation rates to ensure proper exploration of the search space.

References 1. Nuclear Regulatory Authority: Norma Básica de Seguridad Radiológica, AR 10.1.1. Rev. 4. Argentina (2019) 2. Nuclear Regulatory Commission: PRA Procedures Guide. NUREG/CR-2300. NRC, Washington (1983) 3. Bäck, T.: Evolutionary Algorithms in Theory and Practice. Oxford University Press, New York (1996) 4. Michalewicz, Z., Fogel, D.: How to Solve It: Modern Heuristics. Springer-Verlag, Berlin (2002) 5. Harunuzzaman, M., Aldemir, T.: Optimization of standby safety system maintenance schedules in nuclear power plants. Nucl. Technol. 113, 354–367 (1996) 6. Vaurio, J.: On time-dependent availability and maintenance optimization of standby units under various maintenance policies. Reliab. Eng. Syst. Saf. 56(1), 79–89 (1997) 7. Lapa, C., Pereira, C., Mol, A.: Maximization of a nuclear system availability through maintenance scheduling optimization using a genetic algorithm. Nucl. Eng. Des. 196, 219–231 (2000) 8. Marseguerra, M., Zio, E.: Optimizing maintenance and repair policies via a combination of genetic algorithms and Monte Carlo simulation. Reliab. Eng. Syst. Saf. 68, 69–83 (2000) 9. Martorell, S., Carlos, S., Sánchez, A., Serradell, V.: Constrained optimization of test intervals using a steady-state genetic algorithm. Reliab. Eng. Syst. Saf. 67, 215–232 (2000) 10. Rocco, C., Miller, A., Moreno, J., Carrasquero, N., Medina, M.: Sensitivity and uncertainty analysis in optimization programs using an evolutionary approach: a maintenance application. Reliab. Eng. Syst. Saf. 67, 249–256 (2000) 11. Tsai, Y., Wang, T., Teng, H.: Optimizing preventive maintenance for mechanical components using genetic algorithms. Reliab. Eng. Syst. Saf. 74, 89–97 (2001) 12. Vesely, W., Goldberg, F., Roberts, N., Hasssl, D.: Fault Tree Handbook. Nuclear Regulatory Commission, Washington (1981) 13. Jiejuan, T., Dingyuan, M., Dazhi, X.: A genetic algorithm solution for a nuclear power plant risk–cost maintenance model. Nucl. Eng. Des. 229, 81–89 (2004) 14. August J.: Applied Reliability-Centered Maintenance. Penwell, Tulsa (1999) 15. Kafka, P.: Probabilistic safety assessment: quantitative process to balance design, manufacturing and operation for safety of plant structures and systems. Nucl. Eng. Design 165, 333–350 (1996) 16. Rohrer, R., Nierode, C.: Simple method for risk assessment of nuclear power plant refueling outages. Nucl. Eng. Des. 167, 193–201 (1996) 17. Duthie, J., Robertson, M., Clayton, A., Lidbury, D.: Risk-based approaches to ageing and maintenance management. Nucl. Eng. Des. 184, 27–38 (1998) 18. Richner M., Zimmermann S.: Applications of simplified and of detailed PSA models. In: Probabilistic Safety Assessment and Management. Springer-Verlag, Berlin (1998)

Reliability Optimization of New Generation Nuclear Power Plants …

173

19. Luus, R.: Optimization of system reliability by a new nonlinear integer programming procedure. IEEE Trans. Reliab. 24(1), 14–16 (1975) 20. Dhingra, A.: Optimal apportionment of reliability & redundancy in series systems under multiple objectives. IEEE Trans. Reliab. 41(4), 576–582 (1992) 21. Xu, Z., Kuo, W., Lin, H.: Optimization limits in improving system reliability. IEEE Trans Reliab 39(1), 51–60 (1990) 22. Coit, D., Smith, A.: Reliability optimization of series-parallel systems using a genetic algorithm. IEEE Trans. Reliab. 45(2), 254–260 (1996) 23. Cantoni, M., Marseguerra, M., Zio, E.: Genetic algorithms and Monte Carlo simulation for optimal plant design. Reliab. Eng. Syst. Saf. 68, 29–38 (2000) 24. Choi, Y., Feltus, M.: Application of reliability-centered maintenance to boiling water re-actor emergency core cooling systems fault-tree analysis. Nucl. Technol. 111, 115–121 (1995) 25. Radulovich, R., Vesely, W., Aldemir, T.: Aging effects on time-dependent nuclear plant component unavailability: an investigation of variations from static calculations. Nucl. Technol. 112, 21–41 (1995) 26. Phillips, J., Roesener, W., Magleby, H., Geidl, V.: Incorporation of passive components aging into PRAs. Nucl. Eng. Des. 142, 167–177 (1993) 27. Crocker, J., Kumar, U.: Age-related maintenance versus reliability centred maintenance: a case study on aeroengines. Reliab. Eng. Syst. Saf. 67, 113–118 (2000) 28. Lewis, E.: Introduction to Reliability Engineering. Wiley, New York (1987) 29. International Atomic Energy Agency: Procedures for conducting common cause failure analysis in probabilistic safety assessment. In: IAEA-TECDOC-648. IAEA, Vienna (1992) 30. International Commission on Radiological Protection: ICRP Publication 60: 1990 Recommendations of the International Commission on Radiological Protection. Pergamon Press, Oxford (1990) 31. Department of Energy: Applying the ALARA process for Radiation Protection of the Public and Environmental. DOE, Washington (1997) 32. McLeod, JEN.: Sampling methods in evolutionary computation assuring genetic diversity and stochastic selections. In: Annicchiarico, W. (ed) Evolutionary Algorithms and Intelligent Tools in Engineering Optimization. Wit Press, Southampton (2005)

Algorithmic Management and Occupational Safety: The End Does not Justify the Means Thijmen Zoomer, Dolf van der Beek, Coen van Gulijk, and Jan Harmen Kwantes

Abstract The fourth industrial revolution has greatly impacted the domain of occupational safety and health. Advanced data-driven techniques, such as reliability engineering and predictive based safety, can be of great help in reducing workplace accidents. Data-driven techniques in the workplace can also be used to automate management tasks, judge performance and dictate the workflow of employees. This practice is called algorithmic management and, when used to reduce workplace accidents, is indistinguishable from advanced, professional predictive based safety. Algorithmic management is often used to attain greater productivity and can lead to a level of work intensity that is dangerous for the safety and health of employees. Thus, algorithmic management can be used to reduce workplace accidents, but also pose a risk for the overall health of employees. In this article, three guidelines are given that ensure algorithmic management helps worker safety, instead of endangering it, namely: algorithmic management should only be used for the goal of worker safety, it should never be used to grade the performance or discipline employees. The system should be transparent and understandable for every employee that it is applied to. These guidelines are illustrated by examples of algorithmic management and a discussion of the current laws and regulations. The article concludes with the implications for policy and further research. Keywords Algorithmic management · Worker safety · Occupational safety and health · Data collections and algorithms · Artificial intelligence · Computer vison

T. Zoomer (B) · D. van der Beek · C. van Gulijk · J. H. Kwantes TNO, Work Health Technology, Leiden, The Netherlands e-mail: [email protected] D. van der Beek e-mail: [email protected] C. van Gulijk e-mail: [email protected] J. H. Kwantes e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_12

175

176

T. Zoomer et al.

1 Introduction Algorithmic management consists of a broad set of data collection and software tools to optimise work processes. Platform companies, such as Uber and Deliveroo, derive their right to exist from algorithmic management, but the methods are also spreading to traditional employers. By means of algorithmic management, certain tasks can be automated that are usually carried out by managers, namely the monitoring, assessment, comparison and guidance of employees. Algorithmic management can serve many purposes, such as achieving higher productivity or speeding up service delivery. Managing employees by algorithms, rather than by managers of flesh and blood, is already an established practice in many organisations in the Netherlands. For example, algorithmic management is used in large logistics warehouses because they are more effective in determining the most efficient route or sequence of activities and work seamlessly with other IT systems such as the warehouse management system (WMS) and enterprise resource planning (ERP) system. In these smart warehouses, employees follow the instructions of the algorithm and are evaluated by the same system on their productivity and compared to other employees [1]. Algorithmic management is also effective in enhancing occupational safety, for example by checking whether workers are wearing their personal protective equipment [2], whether they are entering danger zones and whether they are exhibiting safe driving behaviour. Breakthroughs in Artificial Intelligence (AI) seem to have created a large toolbox that benefits the safety and health of workers. Yet there appears to be a significant downside to algorithmic management: it can lead to less autonomy, low job quality and poorer employee health [3]. Algorithmic management is thus a paradox in terms of occupational safety. It is both an asset and a threat to workers’ health.

1.1 Reliability Engineering, Predictive-Based Safety and Algorithmic Management Reliability engineering has been a trailblazer in the digitalisation and datafication of safety predictions. The developments in algorithmic management have been running parallel to reliability engineering, but are now converging in the subdomain of Predictive Based Safety (PBS). PBS is a collection of techniques used to prevent occupational accidents. PBS works as follows: first, an attempt is made to learn as much as possible from data concerning occupational safety in the past. Data sources are used such as accident reports, historical project data, smart-phone data or sensor data. All data that can be related to occupational safety are relevant. On this data, an algorithm is trained that learns what went wrong in the past (descriptive analysis) and why (discovery analysis). The next step is then to determine what the improvement points are for occupational safety and how employees can prevent danger in the workplace (prescriptive analysis). In fact, the effect of possible future decisions

Algorithmic Management and Occupational Safety: The End Does …

177

is quantified in order to choose the most favourable option. Thus, you do not need a workplace accident to correct unsafe conditions, but can experience this learning moment before anything goes wrong. PBS is often a complement to Behaviour-Based Safety techniques [4]. Although PBS is a relatively new approach, there are strong indications that it is going to play a major role in occupational safety. Companies at the forefront of adopting such technologies claim to have observed prediction accuracy rates of up to 86% and a reduction in recorded incidents of up to 60% [4]. PBS is also successful from an efficiency point of view. It automates manual data collection, analysis and report writing. Less compensation has to be paid to employees, more effective manhours are available and it leads to lower insurance costs [5]. PBS has now attracted the interest of major insurers such as AXA. This creates a huge force field, because if insurers offer discounts when organisations adopt PBS, its adoption is financially stimulated and will spread quickly. Such a dynamic is already underway outside the Netherlands [6]. PBS technology is advancing by leaps and bounds and there is recognition from the market. This success is leading to a professionalisation of PBS. Whereas previously it was mainly used as a research technique, based on manually and incidentally collected data, it is now taking on a more automatic and comprehensive character. Modern PBS can convert real-time data into safety insights, allowing safety monitoring, and even correcting or directing employees, to be done by a smart algorithm instead of a manager. Thus, PBS seems to converge with algorithmic management. The same combination of data collection and automatic control can be used to increase labour productivity and manage workers in general.

1.2 Responsible Use of Algorithmic Management But with these sophisticated, data-driven systems, employers should be wary, even when used purely for occupational safety, as it can also harm rather than protect workers’ health. Research shows that it can be bad for workers’ autonomy and job quality [7–9]. Algorithmic management can put such pressure on employees that they go beyond their own limits, resulting in a variety of physical complaints and a higher risk of work-related accidents [10]. The success of PBS therefore raises new questions for the occupational safety domain. Professional PBS is the same as algorithmic management, as it is also used to monitor and control workers. This article discusses the desirability of algorithmic management in the safety domain and proposes the following three rules of thumb. Applying algorithmic management to increase safety is justified if it: 1. Is purely deployed for security purposes, and not for other business purposes; 2. Can never be used to judge or discipline employees individually; and

178

T. Zoomer et al.

3. The system is clear, transparent and understandable to all employees to whom it is applied. To support these rules of thumb, the article discusses the technique of algorithmic management and illustrates it with a number of practical examples. Next, the legislation and regulation of algorithmic management is discussed. Finally we consider the aspects that will be necessary to manage algorithmic management in the future.

2 The Many Forms of Algorithmic Management Algorithmic management is based on data from the workplace, of which there is an increasing amount available. The machines and tools used in the workplace are still rarely purely analogue. In factories and other workplaces, the Internet of Things is booming, and all kinds of objects continuously collect data [11]. Data can also be collected by processing camera images. Computers register raw camera images as a matrix of numerical values, which are suitable for arithmetic transformations. Using AI techniques, these numerical values can be linked to certain outcomes. Thus, a computer with a trained algorithm not only stores the camera images, but also recognizes in real time what is happening within the camera images. This is called computer vision, a technique that, thanks to AI breakthroughs in deep learning, has become very effective and has great potential in the field of PBS and algorithmic management [12]. To give a first example: Amazon automatically alerted their warehouse workers when they violate the COVID-19 social distance rules. This is made possible by an algorithm that uses camera images to recognise whether employees are standing too close to each other [10].

2.1 Personal Protective Equipment and Computer Vision In construction, there are still many opportunities for new safety techniques. Construction is the sector with the most serious occupational accidents, also in the Netherlands [13], and where little digitisation has taken place [14, 15]. Many techniques are still in their infancy or are only just being introduced. These include sensors, drones, AI, 3D scanning and digital models [16]. Especially in construction, it is difficult to build a technical infrastructure due to rapidly changing and temporary jobs, but also because the use of sensors is rather limited. Costs play a role here. For example, Infrared sensors can be placed in smart helmets so that it is clear when the helmet is worn or not, but this is still relatively expensive. The use of cameras with computer vision is cheaper. This technology can check whether workers are wearing their helmet, whether they are wearing a reflective vest and even whether they are wearing goggles, although the technology is not yet perfect [17]. It is difficult to recognise individuals in cluttered backgrounds, as is often the

Algorithmic Management and Occupational Safety: The End Does …

179

case in construction. Also, workers can be occluded from view and it is difficult to make the correct identification when the worker is far away from the camera [17]. Nevertheless, this technique is mature enough to be offered on the market as a commercial product [2].

2.2 Control on the Shop Floor In more controlled environments than construction, such as factories or warehouses in logistics, there are many opportunities for using computer vision software to improve occupational safety. In addition to checking for the use of personal protective equipment, AI can also recognise whether workers are in an unsafe area, performing unsafe actions and whether they are following safety procedures properly [18]. Computer vision can be used to track workers on the shop floor. When they enter an area or perform an action that is marked as unsafe in the PBS system, action can be taken, such as making a note, issuing a warning or stopping a machine as a preventive measure. In this way, “digital fences” can be erected on the shop floor. This can be more effective than real fences, as access to the area can be set dynamically. The area can be set as accessible only for a certain time, to workers wearing certain protective equipment, or even only to certain people. Certain areas can also be monitored more closely and checked for proper behaviour. For example, on staircases, the system can monitor whether employees are keeping their hands on the bannisters, whether employees are walking along the designated walkways, whether they are lifting objects with the correct posture and even if a worker has tripped and fallen. All of these features are already advertised by commercial software packages from companies such as AICS and Intenseye [2, 19]. Simply authorising specific employees to enter certain areas requires the recognition of individual employees, for example by using facial recognition software. While automatic security checks may still be endorsed by many employees, the widespread use of facial recognition software is undoubtedly more controversial.

2.3 Safe Driving Behaviour Amazon, one of the leaders in applying AI in the workplace, shows that privacy and safety can conflict. The company applies algorithmic management to its truck and parcel delivery drivers, among others. The Amazon vehicles are equipped with the Driveri system, which consists of sensors, cameras and computer vision software. The system monitors 16 different safety topics, such as whether the driver keeps a safe distance from other vehicles, whether the driver exceeds the speed limit, and whether the driver stops at stop signs. If this is not the case, the driver is automatically instructed accordingly and the bad driving behaviour is reported to the (human)

180

T. Zoomer et al.

manager of the driver. The Driveri system not only monitors the vehicle’s surroundings, but also the driver. The system checks whether the driver is wearing his seatbelt and can even recognise when the driver is yawning. If that is the case, the system will instruct the driver to take a 15-min break [20]. There are no hard figures publicly available about the effectiveness of the Driveri system. Amazon undoubtedly knows these figures and will probably use the technology because management can see the effectiveness in the data. Whereas the system provides the management with more insight, for the drivers it is a major intrusion into their working day. After all, it is as if their boss is sitting next to them in the car all day, faultlessly noting and storing every little mistake. It is therefore not surprising that drivers are dissatisfied with the system. It is seen as a violation of their privacy and as a sign that Amazon does not trust the drivers. However, drivers have no choice: those who do not agree to the terms of the Driveri system cannot work for Amazon [20].

3 Guidelines for Responsible Algorithmic Management in Safety Amazon uses the Driveri system not only to obtain information about safety, but also to collect information about the worker’s performance, and to manage them. In addition, the employee is evaluated by his manager on the basis of the information collected. This is a typical example of algorithmic management, in which it is not used purely for safety, but to maintain general employer standards, automate management and achieve higher productivity.

3.1 The Conflicting Goals of Health and Productivity When algorithmic management focuses on achieving maximum labour productivity, there is a danger productivity will be gained at the expense of workers’ health. By means of algorithmic management, productivity can be measured and adjusted accordingly, by setting higher goals, increasing the pace of work or minimising rest periods in the work process. This is very demanding on workers. Amazon, for example, calls its warehouse employees “industrial athletes” and recommends that they live like an athlete in training so that they can cope with the heavy workload [21]. In practice, it leads to workers experiencing very high work intensity, stress and excessive exhaustion [10]. The high work rate also results in many physical complaints acquired at work: in the US, Amazon employees suffer twice as many serious physical complaints as in the rest of the industry [22]. In the Netherlands, at the post and logistics company PostNL, the high work pace leads to complaints

Algorithmic Management and Occupational Safety: The End Does …

181

in the back, shoulder and wrists [23]. These health effects are endorsed in recent scientific research [7–10]. Algorithmic management can protect the health of workers, but it can also drive workers to such a high workload that it causes them to suffer health problems and increase the risk of accidents. Using algorithmic management for greater labour productivity is undesirable. Place the health of employees at the heart of company policy and only allow algorithmic management if it complies with the first rule of thumb, namely: algorithmic management must be used only for safety purposes and not for other company goals.

3.2 Fighting the System In addition to increasing the workload to dangerous levels, algorithmic management can also create great resistance among employees. As the example of Amazon’s Driveri system shows, employees feel distrusted and their privacy violated [20]. This is exemplary of the feelings of employees who are being monitored by algorithmic management [24]. In order to regain autonomy at work, workers start to thwart or even deliberately sabotage the system. There are many examples of this; Uber drivers have all kinds of tricks to manipulate the Uber app so that they do not have to drive long rides or rides with drunk bar patrons as passengers [25]. Journalists judged by an algorithm learned to manipulate the data provided so that the system gave them a high score [26]. American police officers and lawyers fought the algorithmic assessment of their work by blocking data collection and purposefully producing more data if necessary. They felt that the algorithm simplified their work and micro-managed them [27]. The message is clear: workers do not want to be monitored every moment of the day, not even through an automated system. Even if the aim is to increase labour safety, it creates resistance, as can be seen with Amazon’s Driveri system. This is disastrous, as the support and trust of the shop floor is essential for increasing safety. Algorithmic management, no matter how smart and ingenious, is no substitute for employee commitment and participation in a culture of safe work. If employees have a negative attitude towards the system, they will avoid, ignore or even sabotage it. As a result, safety rules will not be respected and unnecessary risks will be taken. The aim of employees then becomes to be as little affected by the safety system as possible instead of working safely. In order to respect employees and prevent resistance, the second rule of thumb must be adhered to: algorithmic management may never be used to judge or discipline employees individually. Even if they break safety rules. This can be easily complied with. Make sure that the insights of the algorithmic management system cannot be traced back to individual employees. With anonymous data, you still learn where, when and how unsafe situations occur, just not from whom. That is enough to improve safety, but not to punish employees if they do not behave in an exemplary manner. This way, employees do not feel spied upon, but their safety can be improved.

182

T. Zoomer et al.

3.3 The Desired and Mandatory Transparency The first two rules of thumb, to only use algorithmic management for the purpose of occupational safety, and not to use it for individual assessment or discipline, go beyond what the law requires. The third rule of thumb, that the system must be clear, transparent and understandable to all workers to whom it is applied, also goes beyond the law for now, but it will most likely become a legal requirement in the coming years. At least, for platform companies. The European Commission made a proposal to improve the working conditions of platform workers. This includes requirements for the regulation of algorithmic management for platform workers. This is the first time that algorithmic management is explicitly mentioned in European legislation. The proposal states: “Algorithmic management is a relatively new and–apart from EU data protection rules–largely unregulated phenomenon in the platform economy that poses challenges to both workers and the self-employed working through digital labour platforms” [28]. The proposal focuses on transparency in the use of algorithmic management: algorithms that monitor or control platform workers must be transparent to platform workers and platform workers have the right to challenge automatic decisions made by algorithms [28]. The proposal for a new platform law has led to a significant change in the draft Artificial Intelligence Regulation of the European Commission, which is already well advanced in the legislative process. Algorithmic management systems that work with employees’ personal data are now classified as high-risk, resulting in a number of strict requirements. Algorithmic management systems must be thoroughly tested and documented, must operate on a high-quality dataset (without bias in the data) and must have human oversight that monitors the system [29]. The new platform law only seeks to regulate algorithmic management for platform workers. This is unfortunate, because algorithmic management is not only found in platform companies, and is also used by traditional employers. Every employee to whom algorithmic management is applied should have insight into what happens to their data and why certain decisions are made, just as employees should expect managers to explain important decisions that affect them. This ties in closely with the issue of trust and resistance to the system. A system that cannot be understood is difficult to trust. Therefore, the AI system must be well documented, and efforts must be made to make the model understandable, through explainable AI.

3.4 What Laws Apply Now The new Platform Act and the adjusted Artificial Intelligence Act will have a major impact on the regulation of algorithmic management in the future. But there are also laws that have an impact now.

Algorithmic Management and Occupational Safety: The End Does …

183

Although algorithmic management is not (explicitly) included, the General Data Protection Regulation (GDPR) is of great importance to algorithmic management. After all, when algorithmic management is applied, the risk of privacy violations of employees is lurking. The requirements that are set for, for example, camera surveillance of employees, are considerable. There must be a legitimate interest and a necessity for camera surveillance. And the employer must perform a privacy test and a data protection impact assessment (DPIA). The Working Conditions Act also provides starting points to be cautious with the deployment of algorithmic management. If we know that these kinds of systems can lead to, for example, increased work stress for employees, then this is contrary to Sect. 3 of the Working Conditions Act, which states that the employer may not reasonably organise the work in such a way as to endanger the safety and health of employees. If a company plans to introduce systems aimed at monitoring employees on presence, behaviour or performance, the works council has a right of consent (art. 27, paragraph 1, under l. of the Dutch Works Councils Act).

4 Conclusion The current and future regulation of algorithmic management therefore does provide starting points, but also leaves a lot of room for companies to push the boundaries and design systems that are harmful to the health of employees. It is a new technique, on which lawmakers are just starting to focus and on which case law is scarce. Therefore, for the time being, many difficult choices regarding the enhancement of occupational safety through algorithmic management are on the plate of employers, occupational health and safety experts and works councils. We advise them to apply the three statements discussed when they want to use an algorithmic management system to increase safety. The use of algorithmic management is justified if it: 1. Is purely deployed for security purposes, and not for other business purposes; 2. Can never be used to judge or discipline employees individually; and 3. The system is clear, transparent and understandable to all employees to whom it is applied. These recommendations are consistent with a human-centred view of labour. The use of algorithmic management is not value-free. The human aspect of work is increasingly overridden by production pressures, administrative processes and “controls” that imply that management essentially does not trust employees; that employees do not share the same values and drivers as business owners. This creates the risk that employees are increasingly treated as interchangeable resources without recognition of their personality, creativity, innovation and any variety or diversity with added value for work in general and safety in particular. The lack of humanity that often occurs in algorithmic management systems can manifest itself in the form of disrespect (no time to go to the toilet), discrimination

184

T. Zoomer et al.

(over-representation of certain groups as so-called low performers due to one-sided– biased–datasets), a lack of diversity, excessive work pressure and unrealistic performance expectations (ever-increasing targets with risks for safety and health). Such practices result in a growing distrust of algorithmic management systems among employees. Prevent this and involve employees in the plans and preconditions for the introduction and implementation of algorithmic management. It is important to build on shared trust instead of institutionalised, algorithmic distrust. Indeed, research by Interaction Associates in 2015 shows that companies with high trust are 2½ times more likely to be high-performing organisations than those with low trust [30]. This requires a positive approach to employees in relation to health and safety [31]. This means not standardising and quantifying work as much as possible, but rather creating space for workers autonomy under typically varying conditions. Algorithmic management systems that enhance safety certainly have a role to play here. They can inform both management and workers about safety opportunities through analysis, automate safety administration, and automatically intervene or sound warnings at critical moments, for example if a worker comes too close to a dangerous machine. Algorithmic management ideally complements worker participation and workers’ valuable knowledge of health and safety issues. While algorithms can detect certain unknown patterns, only workers understand the daily ‘real’ work (‘work as done’). Leaning solely on algorithmic management runs the risk of steering towards a paper reality (“work as imagined”) without taking into account the usually complex interaction between man, machine, organisation and work processes that ultimately determine occupational safety [32, 33].

5 Discussion In the coming years, many new policies on algorithmic management will appear. The new Platform Act shows that algorithmic management is seen by Brussels policymakers as a large and far-reaching phenomenon that will affect the working lives of many employees. Nevertheless, the proposed regulation of algorithmic management under the Platform Act is inadequate. Under this law, algorithmic management is regulated only for platform companies, but not for traditional employers. This would mean that the vast majority of the workforce would not have the rights to view and challenge algorithmic management systems. Ideally, these rights could be extended to all workers, through national or EU legislation. In addition, the high workload often associated with algorithmic management raises serious concerns about the health and safety of workers. In principle, a dangerously high workload is already prohibited under the Dutch Working Conditions Act, just like it is in many other nations. It should not make any difference whether employees are steered by an algorithm or a human manager to work at dangerous intensities. Nevertheless, explicit attention to these dangers in algorithmic management is desirable. The quantification of productivity very easily plays into the hands of overly high production targets, so that an excessively high workload seems to

Algorithmic Management and Occupational Safety: The End Does …

185

be an inherent danger of algorithmic management. What will require attention is the dilemma that potentially arises from the deployment of such systems: does a supposed (or perhaps even proven) improvement in safety outweigh a reduction in the quality of work with ditto consequences for health? It will therefore be necessary to examine whether the employer has adequately substantiated these considerations and adequately monitors and adjusts the impact of the deployment of the system. Occupational health and safety experts, works council members and trade unions (in connection with the collective labour agreement) also play an important role in this discussion. Algorithmic management has hardly been studied compared to other work domains, certainly in the Netherlands. The first wave of studies show that there are dangers in the areas of health, autonomy, job satisfaction and that it leads to resistance among many employees. There is also much that is unclear. For example, it is not known how many employers are currently applying forms of algorithmic management or what proportion of the Dutch working population is already struggling with the phenomenon. However, the first signs are worrying. For example, in a study among its own members, the Dutch labour union FNV found that 14% of employees working from home are monitored by their manager using software to check whether they are really at work [34]. In addition to the exact scope of algorithmic management, more research is also needed into the experiences that employees have with this technology, to find out how the innovations can be embedded in such a way that they actually strengthen employees in their work and their safety. Reliability engineering and predictive based safety techniques can be used in algorithmic safety management and help to keep workers safe and reduce accidents. In fact, it will become harder to define what the exact differences are between predictive based safety and algorithmic (safety) management. One main difference is that in algorithmic management the interaction with the worker is of central importance, but research on how workers would like to interact with algorithmic management systems is absent. This factor will be crucial to successful, worker-friendly, algorithmic safety management.

References 1. Delfanti, A.: Machinic dispossession and augmented despotism: digital work in an Amazon warehouse. New Media Soc. 23(1), 39–55 (2021) 2. AICS. Intelligent Safety Systems: EHS Management Service-AICS (asus.com) (2023) 3. Zoomer, T., Otten, B.: Het algoritme de baas. Het algoritme de baas | Wiardi Beckman Stichting (wbs.nl) 4. Gattie, T.: Predictive-based safety: data, analytics and AI take safety programs to the next level (2019). https://www.newmetrix.com/ai-in-construction-blog/predictive-based-saf ety-takes-safety-to-next-level 5. Schultz, G.: Advanced and predictive analytics in safety: are they worth the investment? (2013). https://www.ehstoday.com/safety/article/21915659/advanced-and-predictiveanalytics-in-safety-are-they-worth-the-investment

186

T. Zoomer et al.

6. Hall, S., Gettie, T.: Predictive-based safety: Reshaping the landscape of insuring construction projects. Predictive-based safety: Reshaping the landscape of insuring construction projects (axaxl.com) (2021) 7. Möhlmann, M., Zalmanson, L.: Hands on the wheel: Navigating algorithmic management and Uber drivers, in Autonomy. In: International conference on information systems (ICIS), pp. 10–13. Seoul South Korea (2017) 8. Wood, A.J., Graham, M., Lehdonvirta, V., Hjorth, I.: Good gig, bad gig: autonomy and algorithmic control in the global gig economy. Work Employ Soc. 33(1), 56–75 (2019) 9. Veen, A., Barratt, T., Goods, C.: Platform-capital’s ‘appetite’ for control: A labour process analysis of food-delivery work in Australia. Work, Employment and Society (2020). https:// doi.org/10.1177/0950017019836911 10. Wood, A. J.: Algorithmic management consequences for work organisation and working conditions (No. 2021/07). In: JRC Working Papers Series on Labour, Education and Technology (2021) 11. Eurofound.: Automation, digitisation and platforms: implications for work and employment. Automation, digitisation and platforms: Implications for work and employment (europa.eu) (2018) 12. Fang, W., Love, P.E., Luo, H., Ding, L.: Computer vision for behaviour-based safety in construction: a review and future directions. Adv. Eng. Inform. 43, 100980 (2020) 13. Inspectie SZW.: Monitor arbeidsongevallen en klachten arbeidsomstandigheden (2020). https://www.inspectieszw.nl/publicaties/rapporten/2020/05/14/monitor-arbeidsongevallenen-klachten-arbeidsomstandigheden 14. Committee for European Construction Equipment: Digitalising the construction sector. In: Unlocking the Potential of Data with a Value Chain Approach (2019). 15. Baldini, G., Barboni, M., Bono, F., Delipetrev, B., Duch Brown, N., Fernandez Macias, E., Nepelski, D.: digital transformation in transport, construction, energy, government and public administration (2019) 16. European Construction Sector Observatory: Digitalisation in the construction sector: analytical report (2021). https://ec.europa.eu/docsroom/documents/45547/attachments/1/translations/en/ renditions/pdf 17. Wu, J., Cai, N., Chen, W., Wang, H., Wang, G.: Automatic detection of hardhats worn by construction personnel: a deep learning approach and benchmark dataset. Autom. Constr. 106, 102894 (2019) 18. Fang, W., Ding, L., Love, P.E., Luo, H., Li, H., Pena-Mora, F., Zhou, C.: Computer vision applications in construction safety assurance. Autom. Constr. 110, 103013 (2020) 19. Intenseye.: AI powered workplace safety. In: Intenseye: AI for Workplace Safety–Intenseye (2023) 20. Palmer, A.: Amazon is using AI-equipped cameras in delivery vans and some drivers are concerned about privacy. CNBC (2021). https://www.cnbc.com/2021/02/03/amazon-using-aiequipped-cameras-in-delivery-vans.html 21. Clark, M.: Amazon’s newest euphemism for overworked employees is ‘industrial athlete’ (2021). https://www.theverge.com/2021/6/2/22465357/amazon-industrial-athlete-warehouseworker-wellness-pamphlet 22. Gartenberg, C.: Amazon’s serious injury rate at warehouses was still nearly double the rest of the industry in 2020 (2021). https://www.theverge.com/2021/6/1/22463132/amazon-injuryrate-warehouses-osha-data-report 23. Bremmer, D.: Het ‘ziekmakende’ werkregime bij PostNL: elke 4 seconden een pakketje sorteren. Algemeen Dagblad (2018) 24. Mateescu, A., Nguyen, A.: Explainer: algorithmic management in the workplace (2019). https:// datasociety.net/wp-content/uploads/2019/02/DS_Algorithmic_Management_Explainer.pdf 25. Lee, M.K., Kusbit, D., Metsky, E., Dabbish, L.: Working with machines: the impact of algorithmic, data-driven management on human workers. In: Proceedings of the 33rd Annual ACM SIGCHI Conference, Seoul, South Korea, pp. 1603–1612. ACM Press, New York (2015).

Algorithmic Management and Occupational Safety: The End Does …

187

26. Christin, A.: Algorithms in practice: comparing web journalism and criminal justice. Big Data Soc. 4(2), 2053951717718855 (2017) 27. Brayne, S., Christin, A.: Technologies of crime prediction: the reception of algorithms in policing and criminal courts. Soc. Probl. 68(3), 608–624 (2021) 28. European Commission.: Directive of the European parliament and the council on improving working conditions in platform work (2021a). https://ec.europa.eu/commission/presscorner/ detail/nl/ip_21_6605 29. European Commission.: Regulation of the European parliament and of the council: laying down harmonized rules on the artificial intelligence and amending certain union legislative acts (2021b). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52021PC0206 30. Interaction Associates.: The Little Book of Big Trust (2015). Retrieved from: Interaction Associates|Work Better Together. 31. Dekker, S.: Safety differently: human factors for a New Era. Collegiate Aviation Rev. 34(2), 107 (2016) 32. Hollnagel, E., Leonhardt, J., Shorrock. S., Licu, T.: From Safety-I to Safety-II. A White Paper. Brussels: EUROCONTROL Network Manager (2013) 33. Van Kampen, J., Van der Beek, D., Groeneweg, J.: The value of safety indicators. SPE Econ & Mgmt 6, 131–140 (2014) 34. NOS.: ‘Gluurapparatuur’ in trek door thuiswerken, vakbonden bezorgd. ‘Gluurapparatuur’ in trek door thuiswerken, vakbonden bezorgd (nos.nl) (2021)

Technologies and Solutions for Smart Home and Smart Office Andriy Luntovskyy, Mykola Beshley, Dietbert Guetter, and Halyna Beshley

Abstract This work is dedicated to the Smart Office and Smart Home design: a brief smart survey of the used protocols, platforms, and best practices under consideration of the following criteria like price factor, easy configuring and manageability, data security, and privacy aspects, as well as energy efficiency. The practical problems for IoT and IIoT components compatibility as well as cloud independence, are discussed too. The authors mostly favor open-source and cloud-free solutions for Smart Home and established commercial platforms for Smart Office with advanced security, as shown in the presented case studies. Blockchaining of IoT and IIoT contributes to the compulsoriness and commitment in the decentralized world of “smart things” at home and in office rooms. However, secured smart devices can only be achieved with the combination of known crypto-technologies. Through a step-by-step provision of different blockchain-based platforms, the declared protection goals can be reached. The main features of the functioning of IoT, Edge, and Cloud computing technologies are considered. An intelligent IoT system has been developed to collect data from a specific location and transmit it to an Edge device for on-site processing. This data will be transferred to the cloud system or further processing or remote management, if needed. The solution aims to reduce the cost of maintaining the system by reducing the volume of messages that are sent to the cloud platform, which is quite expensive and a weighty factor in using the system, to provide remote management without buying a fixed server part. An algorithm for predicting temperature values in the A. Luntovskyy (B) · M. Beshley · D. Guetter · H. Beshley BA Dresden University of Coop. Education, Saxon Study Academy, Hans-Grundig-Str. 25, 01307 Dresden, Germany e-mail: [email protected] Lviv Polytechnic National University, Lviv 79013, Ukraine M. Beshley e-mail: [email protected] D. Guetter e-mail: [email protected] H. Beshley e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 C. van Gulijk et al. (eds.), Reliability Engineering and Computational Intelligence for Complex Systems, Studies in Systems, Decision and Control 496, https://doi.org/10.1007/978-3-031-40997-4_13

189

190

A. Luntovskyy et al.

server room of a smart office has been proposed and implemented in a designed IoT system based on Raspberry Pi 4 using an LSTM model. Using this solution in practice will increase the reliability of server equipment by early prediction, warning, and taking measures in case of temperature rise. Keywords IoT and IIoT · Smart technologies · Radio and contactless sensing · Smart home · Smart office · HVAC (Heating, Ventilation, Air-conditioning) · Blockchain · CIDN · Energy efficiency · Energy harvesting · LSTM

1 Motivation and the Aims of the Work The concepts of Smart Home and Smart Office regularly use the following design approaches and targets: cross-layered design (PHY-MAC-NWK-SEC-APL), LowDuty-Cycle, data security, and energy efficiency [1–7]. The following IoT technologies or their combinations and modifications are usually deployed to reach the above-mentioned goals (Table 1): The IoT and IIoT devices use almost the advanced M2M (machine-2-machine) communication style with uplink traffic domination. The following further parameters and distinguishing features are required [1–10] herewith: (1) (2) (3) (4)

Tiny till middle data rate; Good network covering; Interoperability with 4G/5G mobile radio [11]; Advanced security, which can be based on Blockchain and collaborative intrusion detection networks (CIDN) use [1–10, 12, 13].

Modern IoT technologies provide the ability to exchange information in various volumes and at different distances: from a few meters up to tens of kilometers. As the need to transfer information of varying volumes and quality of service parameters increases, the need to combine these technologies to satisfy the needs of users to develop smart infrastructures becomes urgent. The IoT technology’s convergence for developing smart infrastructures is depicted in Fig. 1. Each technology has its own parameters, which can be used only for a limited range of tasks. These parameters are, first of all, throughput and data transmission distance. Other parameters are essential and of secondary importance and primarily affect the ability of the sensors to form a stable network structure that satisfies the quality of service requirements. Often narrow-band technologies are physically combined in a single device. This combination provides additional flexibility in data transmission because there are alternative delivery routes for the information. If one device has exchanged information with another device using one technology, it can transmit that information to a third device using another technology. However, the key point in this method of information transfer is the compatibility of technologies, that is, the possibility of

Technologies and Solutions for Smart Home and Smart Office

191

Table 1 IoT technologies (radio/contactless) for smart home and smart office Type

Title

Frequency, MHz

Purpose

SigFox

868 MHz (EU), 902 MHz (USA)

Low-power wide-area network

LoRa

2,4 GHz, 868/915 MHz, Low-power wide-area 433 MHz, 169 MHz network

NarrowBand-IoT

700 - 2100 MHz

4G mobile network

LTE-CatM

700 - 2100 MHz

4G mobile network

WiFi5 IEEE 802.11ac

ISM 2400/5000 MHz

Wireless LAN

WiFi6 IEEE 802.11ax

ISM 2400/5000 MHz

Wireless LAN

ZigBee IEEE 802.15.4

900/2400 MHz

Low-power personal-area network

6LoWPAN

ISM 2400/5000 MHz

Low-power personal-area network

Bluetooth IEEE 802.15.1

ISM 2400/5000 MHz

Low-power personal-area network

EnOcean

868 MHz (EU), 928 MHz (Japan), 902 MHz (USA)

Low-power personal-area network (energy harvesting)

RFID

LW 125–134 kHz, SW 13,56 MHz; UHF 865–869 MHz (EU), UHF 950 MHz (US/Asia), SHF 2,45 GHz/5,8 GHz

Multiple standards

NFC

13,56 MHz

Standardized RFID, ISO 14443/ISO 15693

Radio sensors WAN

LAN

PAN

Contactless sensors Tags, labels, chip cards

total quality of service, which will allow delivering data to the destination on time, and they will be either minimally distorted or equivalent to the original. This condition imposes some limitations when combining different technologies, for example, on the formation of a local network structure of wireless technologies, in which each device (link of the local network) can make its own adjustments to the quality of service. Transmission of data by wireless networks is due to the convenience that there is no reference to a permanent location. That is, devices can move around and send data to the destination without much obstruction. However, there is a portion of static sensors for which there is no acute need to rebuild the network that is present when devices move. However, there is one common problem that applies to both static and dynamic devices (sensors). It has to do with the use of spectrum, which can be either licensed or unlicensed. Licensed

192

A. Luntovskyy et al.

Fig. 1 IoT technologies convergence for developing smart infrastructures

spectrum provides more QoS (Quality of Service) options than unlicensed spectrum. When transmitting data in an unlicensed spectrum, there is almost no guarantee that the data will reach its destination on time, which is important for real-time critical services. Licensed spectrum ensures better service quality by regulating the use of radio frequencies and controlling quality standards. Licensed operators have access to specific frequency ranges and must adhere to established norms and rules, allowing them to provide greater network stability and reliability. Unlicensed spectrum may have more interference and signal loss, which can negatively affect service quality. Therefore, depending on the scope of the application and the type of services to be provided within smart systems, an important task is to select the optimal IoT technologies and solutions to use for construction.

Technologies and Solutions for Smart Home and Smart Office

193

2 Challenges for Smart Home and Smart Office The works [5–7, 12] examine multiple case studies and answer the key questions about secured and blockchain-based as well as energy-efficient Smart Home and Smart Office solutions. The following features are considered: (1) Small energy consumption and energy efficiency; (2) Wide interoperability to 5G and Beyond (6G), WSN, RFID, NFC, Robotics, and Wearable; (3) (New efficient communications models with decentralization, M2M communication style, fog-based models, P2P instead of convenient C-S, cloud-centric and fog-decentralized. However, the slogan is as follows: “Energy Efficient and Blockchained IoT systems” (refer to Fig. 2). Energy efficiency via the so-called “Low Duty Cycle” principle is provided under the support of the holistic multilayered approach [1–7]. Wireless sensor piconets are, in general, subject to the condition of unified development with optimization of energy consumption in all layers of the OSI reference model up-to-down and across. The uniform approach to energy efficiency through the so-called “Low-Duty-Cycle” is followed [1–7]. The minimal energy consumption is foreseen for only the modes “Idle” and “Sleep”. The ratio a for duty percentage is represented as: a = Tduty/Toverall

(1)

a = Tduty/(Tduty + Tsleep), i.e., high energy consumption is typical for “Transmitting”, “Receiving”, as well as, “Normal”: “Computing”, and “Transition”. Mostly sleeping sensors are then energy-efficient. The reference value for the Duty Cycle a is about 7–10%.

Fig. 2 Challenges for IoT [5, 6]

194

A. Luntovskyy et al.

3 System Integrators for Smart Office In the world of IoT and IIoT, there are multiple platforms to support the integration of the existing sensor hardware (i.e., smart things). The system integrators for IoT and Smart Home possess proven know-how and, as a rule, use the following cloud-based or cloud-centric platforms [1–3, 9]: • • • • •

IBM Watson IoT Platform; Microsoft Azure IoT Hub and Win for IoT; Google Cloud IoT; Amazon AWS IoT; SAP Internet of Things, SAP Leonardo IoT, SAP Edge Service.

There are a lot of further cloud-free commercial and freeware integrators too. They are more relevant for Smart Home (refer to Sect. 4). The advanced platforms guarantee integrated connectivity and interoperability and satisfy the following challenges for Smart Office (refer to Fig. 2): • For IoT devices NB-IoT, EnOcean, and LoRa WAN are the most interesting technologies with low power and long-range communication; • On the Edge of a modern IoT network, you can still use 4G, 5G and Beyond, Wi-Fi, Ethernet, and LTE-CatM [14]. These technologies are augmented via RFID/ NFC, refer [1–7]. On the other hand, multiple open-source SW solutions and platforms for IoT device integration can be mentioned, like Robot OS, OPC UA, RabbitMQ, MosQuitto, Automation ML tools, which are based on the above-listed application protocols [3]. These can be taxonomized by their universality and the supported communication protocols: from the most simple tools and frameworks up to the whole integration platforms. The next distinguishing feature is so-called advanced security.

3.1 Secure IoT Platforms for Smart Office Thought a little further, the step-by-step provision of different blockchain-based platforms and solutions can help to reach the desired energy and data efficiency as well as the protection goals. As a reference SAP-based IoT integration platform can be mentioned (Fig. 3). Nowadays, blockchain platforms play an increasing role (refer to Sect. 5). More and more hyper-scalers bring their own cloud solutions: • IBM Blockchain • Microsoft Azure Blockchain • Oracle Blockchain Cloud Service.

Technologies and Solutions for Smart Home and Smart Office

195

Fig. 3 SAP-based IoT integration [5, 6]

They can be used as an entry to connect with IoT and other enterprise solutions to create complete business processes. Besides the enterprise-grade platforms, there are also existing blockchain technologies like Ethereum, Hyperledger Fabric, Quorum, Stellar, Ripple, Bitcoin, and others, with their corresponding platforms. But they are mainly used as open networks outside of business scopes [5, 13]. Multiple case studies and best practices are provided in [1–7, 12], inter alia: (a) Monitoring and management via the combined RFID/Wi-Fi solutions, Blockchain-based Supply Chain Management; (b) Star without gateway (GW): NB-IoT, LTE-CatM; (c) Star with GW: EnOcean; (d) Mesh with GW: EnOcean; (e) LoRa WAN with GW; (f) Microcell-based energy-efficient hierarchical WSN/Wi-Fi system; (g) GW-based WSN/ Wi-Fi with layered infrastructure; (h) Annual Costs for WSN or Wi-Fi etc. Let us consider two following scenarios in detail: • Scenario 1: Automatization Sensors via NB-IoT; • Scenario 2: Energy-Efficient EnOcean Sensor Constellation.

3.2 Scenario 1: Automatization Sensors via NB-IoT The results of the simulation, automated position constellation of the sensors, and manual power monitoring are given below (refer to Figs. 5 and 6). The simulation algorithms use the empirical propagation models, which are adopted for modeling indoor and outdoor scenarios, LOS, multiwall and NLOS environments [1–10, 12, 15–18] under use of frequencies F = 868 till 2400 MHz and diverse material attenuation factors for the walls and building sections. The favoring factors for a star constellation for NB-IoT (refer to Fig. 4) are listed in Table 2.

196

A. Luntovskyy et al.

Fig. 4 Scenario 1 on a star constellation for NB-IoT [6]

Fig. 5 Scenario 2 on mesh with GW based on EnOcean [6, 8]

QoS and data security must be provided via the 4G/5G mobile providers in the frame of the SLA (service level agreement). The sensors can control the following parameters: CO2 concentration, humidity, and temperature for office rooms (refer to Table 3).

Technologies and Solutions for Smart Home and Smart Office

197

Fig. 6 Scenario 3 for smart home based on mesh Wi-Fi with 3D-roaming

Table 2 The favouring factors for EnOcean and NB-IoT constellation [6] The favoring factors for EnOcean

Favoring factors for NB-IoT

Star and mesh topology with energy harvesting A large number of sensors within a small area Time-critical applications, such as a workplace booking system Medium to the long-term planned use of the sensor network Square office complexes without tubular high-building sections

Star topology for NB-IoT A small number of sensors within a large area Applications that do not require a live view Requirement for the rapid establishment of a sensor network The low transmission frequency of the sensors, Low duty cycle

Table 3 Comparison of EnOcean and NB-IoT sensor functionality [6, 8, 9] WSN type

Sensor title

Vendor

Deployment and measured parameters

NB-IoT/LTE cat NB1

NB-IoT

IM buildings

CO2 Humidity

Comfort sensor EnOcean

EnOcean

Temperature Pressac

CO2 Humidity

Wireless CO2 sensor

Temperature

198

A. Luntovskyy et al.

3.3 Scenario 2: Energy-Efficient EnOcean Sensor Constellation There are different topologies possible [6, 8]: • An EnOcean star with (GW) and Energy Harvesting with the frequencies by F = 868 and 2400 MHz; • An EnOcean mesh with GW and Energy Harvesting with the frequencies by F = 700–2100 MHz; • A star without GW via alternative solution through NB-IoT, LTE-CatM. The favoring factors for a star constellation for EnOcean (refer to Fig. 5) are as follows (refer to Table 2). There is a next energy-efficient solution. The sensor constellation is considered aimed at the best positions for solar energy harvesting [6, 8]. The mesh of EnOcean sensors communicates with the presented one hope GW and then with the cloud with dedicated application servers for data acquisition, processing, and DB retrieving. That requires rather a commercial cloud-centric integration platform. The comparison, as well as the favoring factors for EnOcean and NB-IoT constellation, are grouped to Tables 2 and 3 correspondingly.

4 Platforms for Easy Smart Home Integration As we already constituted, there is a wide spectrum of communication technologies that are practically relevant for IoT device integration (refer to Table 1). LAN and WiFi deployment is relevant too for the use of data flows based on “communication GW2-routers” and “GW-2-complex devices” connectivity, for instance, PC, notebooks, MFP, NAS, Smart TVs, radios, etc. As a rule, the Wi-Fi mesh networks use 3Droaming protocol IEEE 802.11 s (Fig. 6). The presented building with multiple floors provides a “smart” Wi-Fi mesh network with 3D-roaming for the listed networking devices. Furthermore, the following standards can be used for the mobile IoT boards in intranet (refer to Sect. 1, Table 1) for small and simple devices’ connectivity: ZigBee (e.g., for Alexa GWs for LED-lamps, subwoofers, mini-cameras, digital cinemas, etc.) and its multiple clones like, e.g., ConBee II [9, 18], Bluetooth (BT) as well as established on the market Z-Wave, EnOcean, also Homematik IP, etc. An IoT GW supports such protocols: as Z-Wave, Wi-Fi, ZigBee, and Bluetooth. The GWs should be compatible one each other for multiple vendors. The voice control can be provided via Amazon Alexa or Google Home. A Philips Hue Bridge enables simple control by voice via Alexa too. However, using license-free ISM bands leads to harmony problems with WiFi, ZigBee, and Bluetooth [8–10, 15]. The compatibility between diverse network technologies is an essential problem [19, 20]. For instance, the choice of concrete

Technologies and Solutions for Smart Home and Smart Office

199

Fig. 7 Scenario 4 for smart home (HVAC, lighting) based on further sensing protocols [2, 15]

technology and supporting platform [9, 10, 15–18] is problematic because multiple companies do not fully implement the up-to-date ZigBee 3.0 standard, and therefore the integrators work on a lot of compatibility problems [19, 20]. The Smart Home, based on further sensor protocols, i.e., BT, controls HVAC (Heating, Ventilation, Air-Conditioning) and lighting, as shown in Fig. 7. The designers frequently face the following problems: • System interoperability; • Confusion and obtrusiveness of different user hardware and software interfaces; • Lack of data security and privacy due to proprietary software or external data, storage effects based on the clouds (theft of data, outflow of data to companies); • Requirement of IoT connections for reliable control. The concept of a mobile IoT Board for testing and learning aims using ZigBee is depicted in Fig. 5. The ZigBee Bridge provides the passage for the traffic from IoT devices to LAN or Wi-Fi and then to the Internet. As the best practices for the design platform alternatives, three variants are considered below: • Home Assistant v5 [10]; • Tuya IoT Development Platform [15, 16]; • Azure IoT Hub [17]. The discussed platforms are rather open-source and provide the following advantages and disadvantages regarding: • Cloud-centricity; • Data security; • Language support and manageability. Let’s consider it and provide more details.

200

A. Luntovskyy et al.

4.1 Tuya IoT Development Platform Tuya Inc. is a Chinese software company [15] founded in 2014 and focused on developing AI and IoT solutions. The Tuya Smart platform provides multiple solutions related to Smart Home too. Tuya provides a cloud-based platform that connects multiple IoT devices. The Tuya vendor company offers various inexpensive solutions in a wide range of Smart Home areas and, therefore, a hardware basis is selected here. The products of other companies are often compatible too and can also be integrated, e.g., LED lamps from Philips. However, the compatibility must be tested separately for each type of IoT device. In Fig. 8, a workflow for a Tuya cloud-based solution is depicted. Unfortunately, a row of such solutions must be more secure and privacy uncritical. The most important details of the hack of the Smart Home and IoT platform from the Chinese vendor Tuya have been published by [16]. Provisioning as a process means the initialization as well as preparation of the IoT network to provide the set of services to its users. During provisioning, an IoT device listens to broadcast packets without being on the Wi-Fi itself. Subsequently, it communicates unencrypted with Tuya servers in the cloud. Then, it’s easy to intercept this traffic.

4.2 Home Assistant The author and founder of the concept [10] is Paulus Schoutsen, together with Home Assistant (HA) Core Team and Community (2013). According to his own clever statement: “A good home automation never annoys, but is missed when it is not working.” The integration platform works under the operating system Software Appliance & Virtual Appliance (Linux) and the programming language Python 3.9. As a category

Fig. 8 Tuya cloud-based solution (based on [15])

Technologies and Solutions for Smart Home and Smart Office

201

for using Smart Home, IoT under the Free License of Apache Software Foundation must be mentioned. HA v5 is nowadays a free and open-source home automation software designed to be the centralized management system in a Smart Home. Written in Python, too, it focuses on local control and privacy within an intranet. It has a wide range of IoT devices supported with modular plugins to various IoT sensor technologies, systems, and integrators’ services.

4.3 Azure IoT Hub Azure IoT Hub [17] is a part of Microsoft’s Azure portal and managed service hosted in the cloud. Its most important advantage is as follows: simple and fast scalability. PaaS (Platform as a Service) is used as the cloud service model. Azure IoT Hub offers reliable and secure two-way communication between IoT devices and applications with a wide range of functions. The clients are divided into devices and services. The implementation can be done with the specialized SDKs for Java, Node JS, Python,.NET, and C# as well as under the use of the standard protocols like [1–7, 12]: • MQTT (Message Queuing Telemetry Transport); • AMQP (Advanced Message Queuing Protocol); • REST (Representational State Transfer) over HTTPS. Smart Applications have become not only cloud-centric but also container- and micro-service-based. The main tasks of each IoT system are to collect data from sensors and send them to a message broker in the cloud through a specific protocol. After that, the broker passes the data to queues that other applications use to transform the data and, after transformation, stores the data in a database as a time series representation. Before deploying an architecture for a smart system, it is important to pay attention to the following components: • • • • • •

Data format; Data transfer protocol; Broker for message processing and GW; Data processing layer; Data storage; Visualization and monitoring tool.

In IoT systems, data transmission between devices and the cloud is a major cost driver. Especially when devices are connected through a mobile network, data transfer will be quite costly. Moreover, when it comes to a large number of devices, even the traffic to the cloud platform can affect the costs. Therefore, it is important to choose the data format with the maximum payload.

202

A. Luntovskyy et al.

The most common formats currently available are: • XML (Extensible Markup Language); • JSON (JavaScript Object Notation); • Byte stream. Figures 9 and 10 depict a comparison of the three formats of data representation, which shows that JSON is better than XML and Binary by all factors [18]. It is worth noting that JSON is currently the most popular format among information systems. This format is popular due to its simplicity, lightness, and libraries for many programming languages. Based on the comparison results, we can say that the JSON format is best suited for transferring data from sensors to the message broker. The data transmission protocol is also an important component affecting the latency between the message broker and IoT devices.

Fig. 9 Comparison by message size

Fig. 10 Comparison of serialization and deserialization time

Technologies and Solutions for Smart Home and Smart Office

203

Fig. 11 Comparison of HTTP and MQTT by transmission time

Protocols are characterized by the following parameters: • Transmission; • IoT device energy consumption. HTTP and MQTT protocols are the most used in IoT systems, so these two protocols will be compared. Data from the study of IoT cloud provider Flespi will be used for comparison. Based on Fig. 11, we can see that MQTT is 25 times faster than HTTP. This is because MQTT constantly uses the same connection to the server when HTTP opens a new connection each time. Also, one of the reasons is the overhead of HTTP messages. Comparison of HTTP and MQTT by the amount of transferred data and transfer time for 1000 messages. In Fig. 12, we can see that MQTT is 20 times faster and requires 50 times less traffic, which makes it cost-effective. Another advantage of MQTT is that, unlike HTTP, it is completely asynchronous, which reduces latency to zero. Also, MQTT can be used to subscribe IoT devices to administrative tops, making it possible to administer all devices in the network. Based on research and comparisons, MQTT will be used as a protocol to send data to the message broker in this system.

4.4 Heterogenous Automatization Example for Smart Home Osram, Amazon, Bosch, ConBee/ZigBee and CC25351 (sticks from Dresden Elektronik [13]), Samsung, Tuya, Xiaomi, and Aquar can be used together in a heterogenous automatization example for Smart Home. Still, problems are available with the compatibility of the different manufacturers, despite the same protocols [1–7, 20, 21]. The presented example is aimed at learning aims. The hardware is deployed and

204

A. Luntovskyy et al.

Fig. 12 Comparison of MQTT and HTTP by time and traffic Fig. 13 A box with multiple built-in IoT devices (own construction of Dietbert Guetter)

used for learning aims for an appropriate IoT course at Saxon Academy of Studies, BA Dresden University of Cooperative Education [20, 21]. The IoT device programming is easy: simple menu-driven interface with a smartphone and based only on “IF–THEN-ELSE” rules [20, 21]. Some Python-similar languages can be used alternately. A box with multiple built-in IoT devices is depicted in Fig. 13. The top for the mobile IoT devices (battery-based) is given in Fig. 14. A preliminary inventory list for selected compatible components is shown in Table 4. The list of learning goals was created as advisedly price-sensitive [20, 21].

Technologies and Solutions for Smart Home and Smart Office

205

Fig. 14 The top for the mobile IoT devices (battery-based, own construction of Dietbert Guetter)

Table 4 Inventory list for selected compatible components Vendor

Trade mark

Functionality

Pearl

7Links

ZigBee GW (Tuya compatible)

7Links

Wi-Fi security camera

Revolt

ZigBee radiator thermostat

VisorTech

RF GW

VisorTech

RF remote control

SilverCrest

ZigBee intermediate plug

SilverCrest

ZigBee movement sensor

SilverCrest

ZigBee door/window closing contact + alarm

Livarno

ZigBee LED-lamp (color)

Lidl

TP-Link

Router (TL-WR710N)

ELV Ikea

Siren (122,772) TradFri

ZigBee LED-lamp

4.5 Testing of MAKS PRO System Based on LoRaWAN for Deployment of Smart Homes and Smart Offices We describe in detail the functionality and technical characteristics of the MAKS PRO system for a smart home or office based on experimental studies (Fig. 15). The MAKS PRO testbed is deployed at Lviv Polytechnic National University for students who study and research IoT systems networks and communications in general. Practical studies allowed a better assessment of its properties. For communication with sensors and detectors, the MAKS PRO system uses the LoRaWAN technology operating at 868.0 … 868.6 MHz with channel redundancy and a range of up to 2000 m in open

206

A. Luntovskyy et al.

Fig. 15 MAKS PRO LoRaWAN-based wireless system for smart homes and smart offices deployment

space. The radio communication is secure, two-way, and encrypted with a 256-bit key. The MAKS PRO wireless system for smart homes or offices has proven to be reliable, flexible, easy to use, easy to install, and configure a system with an intuitive interface, reliable communication, and many functions as a security system. Noted the long life of the devices, the placement of temperature sensors in most detectors perfectly realized the interaction of functional modules. The system can be configured with the Web application and with the application for the smartphone. A typical wireless system for a Smart Home/Office includes: • main unit with Wi-Fi/GPRS/LTE/Ethernet module or GW for collecting information from sensors using LoRaWAN (MAKS PRO WI-FI); • motion and glass break sensors (MAKS PIR Combi); • high decibel siren or alarm (MAKS Siren); • water leakage sensors (MAKS Water); • magnetic contact sensors (MAKS WDC mini); • smoke detectors (MAKS Smoke). We also researched whether using LoRaWAN technology really allows the sensor to communicate with the system GW over long distances. The signal level measurement as a function of the distance to the system GW was carried out using the MAKS PRO testbed and a smartphone with the MAKS Setup app installed (Fig. 16). As a result, when the distance is more than 200 m in a built-up environment, there is no communication between the system GW and the sensors and detectors. In such an environment, it is reasonable to install sensors and detectors at a distance of no

Technologies and Solutions for Smart Home and Smart Office

207

Fig. 16 MAKS PRO LoRaWAN-testbed

more than 60 m from the main system GW. Note that the allowable minimum signal strength is −95 dBm. Since the signal level from different devices at the same point did not differ significantly, we can assume that all sensors use the same transceiver module or modules of the same power (Fig. 17). A practical solution to exploit the system’s potential is a modification that allows devices to connect to the smart system GW using Wi-Fi technology, as LoRaWAN technology cannot provide sufficient transmission speed to broadcast the video stream within the system devices. Currently, connecting video surveillance devices requires connecting the video camera to the Internet. The concept directly connects the video camera to the alarm system. Since systems with LTE modules and Ethernet connection are commercially available, this allows the video surveillance system to be used in combination with the MAKS PRO system without the use of a Wi-Fi router. Another significant advantage of this idea is the ability to communicate with the Internet using LTE technology only. Using a system with this configuration is due to the availability of 4G communication and coverage in areas where there may not yet be high-speed cable Internet. For example, a smart system needs to be installed in a house in the countryside where there is no wired Internet, but there is 4G coverage and satisfactory signal strength in the area. There is no need for cable internet because the house is rarely visited, but the owners are worried about the safety of valuable things. In this case, it is more reasonable to use the system as a self-sufficient network that controls all available sensors and video surveillance devices and connects directly to the base

208

A. Luntovskyy et al.

Fig. 17 Graph of dependence of the device signal strength via LoRaWAN technology on the distance to the system GW (dBm is a signal strength; l is the distance between sensors and GW)

station with a 4G connection [22]. The monthly payment for a 4G communication package is cheaper than paying for an Internet package. You can also compare the price of bringing the Internet to sparsely populated areas with the price of buying a starter package from one of the operators. The 4G option will save you much money.

5 Advanced Security for IoT and IIoT The assignment of the protection goals [1–4] to the security mechanisms known for IoT and IIoT solutions as well as Smart Home and Smart Office, is given in Table 5. Table 5 Protection goals and assigned mechanisms Protection goal

Security mechanism

Example

Confidence

Encryption

AES

Compulsoriness

PKI or blockchain

RSA, hash

Authorized resource access

Authentication

Login

Integrity

Secured data, message authentication code

CRC, hash

Blocking of unauthorized accesses to Firewalls and collaborative the networks intrusion detection

Firewalls, IDS, CIDN

Technologies and Solutions for Smart Home and Smart Office

209

Conventional security is provided with Public Key Method (PKI). This method is relevant for the bilateral communication models with a third party, like conventional “client–server-certification authority”. Aimed at multi-lateral communication in sensor networks (peer-2-peer), the above-mentioned PKI method must be augmented via Blockchain (BC) despite its complex, as well as its resource- and energy-consuming character [1–7]. Workflow decentralizing in IoT and IIoT scenarios requires so-called Advanced Security too. Advanced Security for IoT devices is supplied via BC technology as a continuously expandable list of data records (hashes and transactions) in individual chained blocks [3–5, 19] aimed to secure transaction processing. Modern IoT applications and mobile apps are complex nowadays and consist of multiple communicating parts. BC, in combination with CIDN (Collaborative Intrusion Detection Networks), provides better security for IoT solutions [1–5, 14]. The potential benefits of socalled “Blockchained Smart Apps”: are service decentralization and choreography, compulsoriness of the workflow steps for critical applications, and multi-lateral communication models like, e.g., M2M (ma-chine-to-machine). The Blockchain can be used to store collected sensor data securely, transparently, and tamper-proof manner. It can be used within a company and in a collaboration of multiple companies [3–7, 12, 19] too. The workflow step compulsoriness for multilateral communication is required here! The Blockchain is a decentralized network that consists of interconnected nodes in which data and information are redundantly stored. Behind every node, a unique participant is usually in the network. Transactions between participants are always directly carried out, like in a Peer-2-Peer network; an intermediary is no more required. As the name suggests, the Blockchain consists of blocks that are connected in numerical order. Each of the blocks has a collection of transactions and meta information. Once a block has reached its maximum number of transactions, all its content is used to generate a hash code. This hash is then used in the metainformation of the next block and so on. This leads to a connection chain for all blocks within a network (refer to Fig. 18). Every node in the network has a copy of the complete Blockchain, and the hash of every block can be checked for its validity. When a transaction in a block is altered, the newly generated hashcode will not match the existing one; the same applies to all following blocks. There are different options on how a blockchain can be set up. It can be public, so everyone can participate by creating their node, or it can be private, where only a few parties and nodes are present. With a blockchain, different kinds of security aspects can be achieved [1–5, 13, 21]: • Authentication: Signature and encryption can be used, as mentioned before; • Access Control: In a private blockchain, granular access policies can be defined for different participants; • Confidentiality: Data is only accessible from within the network; • Integrity: Transactions that are saved in a block cannot be deleted and are nearly temper-proof; • Transparency: Every participant has access to the complete Blockchain.

210

A. Luntovskyy et al.

Fig. 18 Advanced security with BC and a smart contracting workflow

Smart contracting (SC) provides more security via the combined AES/ RSA/ Hash method, which is underlying for BC and has the following further advantages (refer to Fig. 18): • • • • •

Decentralized processing of contracts based on BC; Mapping of the contracts as executable source code; Compulsoriness and trustworthiness through transparency; “Open Execution” instead of just “Open Source”; Legal security without an intermediary (jurist).

A private BC enables the compulsoriness by execution of the workflow steps and can be used to support IoT devices and apps.

6 Designing a Unique IoT System Using Edge/Cloud Computing and Artificial Intelligence The basic idea behind IoT and cloud computing is to improve the efficiency of everyday tasks without compromising the quality of the data stored or transmitted. Because the relationship is mutual, the two services effectively complement each

Technologies and Solutions for Smart Home and Smart Office

211

other. The IoT becomes the data source, and the cloud becomes the final destination for storing it. Over the years, we will see many changes; some will be gradual, and others will be more rapid. In addition, companies such as Amazon AWS, Google, and Microsoft will become the undisputed leaders in cloud IoT services, making this challenge even more profitable. However, the ability to perform critical processes at the local network level is often necessary to solve access network problems. Thus, our paper compares two existing IoT models, the traditional cloud IoT model, and the Edge-Cloud-IoT model. Using a three-tiered edge-IoT architecture, capable of optimally utilizing the computing capabilities of each of the three layers, is an effective measure to reduce power consumption, improve end-to-end latency and minimize operational costs in IT applications with critical latency. The main goal of the work is to create a smart temperature control system using an Edge device that will improve system performance and provide new features for owners. The NodeMCU board will take care of all the information gatherings and provide data collection from the environment, namely temperature, and humidity. The information to be collected is transferred to the cloud environment, which can be accessed through a visual interface to view the data being transferred. It is also possible to connect a web page, through which you can observe the data almost in real-time; its disadvantage is a slight delay. The next step is to receive notifications through a cloud platform, and in the first case, this amount will be much higher because direct communication is realized, and all the data that is received goes directly to the cloud. These data can be further processed in the cloud, but it may not be profitable for owners of smart factories, offices, or homes. Therefore, the following will show how connecting an Edge device in the form of a Raspberry Pi 4 will help reduce the cloud platform’s load, saving user costs. This is especially relevant in terms of scaling, and when there are a lot of such sensors in production, the Edge device will be able to take over some of the data processing and will decide what really needs to be sent directly to the cloud. And finally, we make our IoT system intelligent by implementing elements of artificial intelligence. Specifically, we demonstrate the process of temperature forecasting using the implementation of a Long Short-Term Memory (LSTM) model into the IoT system. One advantage of such a system is that data collection, analysis, and temperature forecasting are carried out on the Raspberry Pi 4 platform, which operates without the use of cloud computing. Temperature forecasting using machine learning can help improve efficiency and ensure the safety of critical systems in a smart home or office, such as server rooms or air conditioning systems. Such systems can operate 24/7, and without sufficient control, dangerous situations can arise, leading to system failure and significant losses. Temperature forecasting can help prevent system problems and take measures to prevent them. For example, if the forecasting model shows that the system temperature will rise to critical levels in the next few hours, measures can be taken to lower the temperature, such as increasing air circulation or replacing system components. In addition, temperature forecasting

212

A. Luntovskyy et al.

can help plan for the maintenance and replacement of system components. If the forecasting model shows that the system temperature will rise to critical levels after some time in operation, maintenance or replacement of some system components can be planned in advance to prevent system failure.

6.1 Configuring Data Collection and Analysis for the Designed IoT System First, an interface has been implemented that demonstrates all the processes of collecting information from the sensors. It provides a visual confirmation of the data collection from the NodeMCU. This was done using Visual Studio (Fig. 19). The next step is to connect and configure Microsoft Azure, which will act as the server part of the project. This part is quite important for the project and offers some ready-made solutions for IoT technologies. After that, we can proceed to connect the NodeMCU to collect data about the environment (Fig. 20). This board has firmware that can interpret the commands of the Lua programming language, which is a language that can be used to make a number of commands to the board both in the terminal and in flash memory, and call them for execution. Another advantage of using this board is that it can be used with the Arduino integrated development environment, which is an application written in Java that contains a code editor, compiler, and a module for transferring firmware to the board. The C+ + language is used, which is supplemented by various libraries. You can use libraries for various sensors and sensors that are suitable for the Arduino board itself, and they do not require special libraries or files; the work takes place

Fig. 19 Visual part of the system

Technologies and Solutions for Smart Home and Smart Office

213

Fig. 20 Azure settings DHT 11/22

Power source

NodeMCU

Web - page

LED

in the same mode as with the board to which this environment is intended (Figs. 20 and 21). As mentioned above, this device has access to a Wi-Fi network, so it can work autonomously if it has a power source. Ideally, it can be a small case to which a power source is connected, which can be a network or a power supply. Since this board is very economical in terms of energy consumption, this is a good solution. Next, we will use the Node-RED environment, which allows the user to design various kinds of solutions, in particular for this option (Fig. 22). In the series, we cannot just illustrate the functional diagram but connect the device and synchronize the cloud environment with the proposed blocks. From here, we can see the data coming into the visual interface, and certain values of temperature and humidity are transmitted (Fig. 23). As can be seen, we get two indicators, which correspond to the indicators in the room where the study was carried out. Also, we can refer to the integrated analytics of Microsoft Azure, where you can see the results obtained, namely sending 47 messages for a certain period of time (Fig. 24). This is the number of messages during the operation of the device, that is, Fig. 21 IoT system scheme

214

A. Luntovskyy et al.

Fig. 22 System operation scheme in the node-red environment

Fig. 23 Visual display of the incoming data

all messages were sent to the cloud. Then, this data can be worked with, analyzed, and used in the future. Depending on the application of the system, this can be data not only about temperature and humidity but also other necessary data that can be transmitted, which can then be used to improve the system. The next step in this work is to investigate the work of the system using the Edge device, which will take part in the collection and analysis of information (Fig. 25). It can perform exactly the actions that the system owner is interested in, there is almost no limited space for action here. In this case, Edge will collect information from the device, namely temperature. When the temperature is above or below the set point, it will send information to the cloud, and an e-mail alert will be sent to a certain address. This will show the effectiveness of the device, which will send to the cloud only the necessary information and not all the information collected from the

Technologies and Solutions for Smart Home and Smart Office

215

Fig. 24 Number of messages received in the cloud

Web - page LED

Power source

NodeMCU

Pi4B (Edge)

Cloud

DHT 11/22

Fig. 25 IoT system scheme with edge computing

sensors. This solution is very useful in large volumes of data when dozens or even hundreds of sensors are in one place. Often it is not necessary to transmit all the information to the cloud, and it is better to analyze it on-site, at the devices collecting this information, and, if necessary, to notify a person to make certain changes or to control these parameters fully automatically if it is possible. Also, there can often be problems with the Internet connection, and in that case, the Edge device can take over the function of collecting all the information, and when that connection is restored, transfer the necessary part of the data to the cloud. Otherwise, all the data would simply not be transferred anywhere. So the next step is to develop a functional diagram of the device and implement a real system (Fig. 26).

216

A. Luntovskyy et al.

Fig. 26 Real scheme of the IoT system

After everything has been connected, we can start testing the processing of data collected from the sensors, as in the previous time. The edge device, represented as Raspberry Pi 4, has the ability to connect to the Internet and a memory unit; it also acts as a mini-computer and can safely perform all the functions required for this study (Fig. 27). Let’s also turn to node-red, where the Edge device is also connected and configured.

Fig. 27 System operation scheme using the edge device

Technologies and Solutions for Smart Home and Smart Office

217

6.2 IoT System Testing After starting the system, we can also turn to the interface where it is possible to monitor the collection of information (Fig. 28). As we can see, for some time, the temperature in the room is stable, and the system works without problems, notifications are received, but they are not sent to the cloud. However, when we change the temperature by putting something hot on the sensor, we see that the indicator deviates from the norm, and we receive a notification about it. If we pay attention to the analytics of the cloud platform, we can see changes in the number of connected devices and the number of received messages over a certain time (Fig. 29). As can be seen from Fig. 30, the number of messages has significantly decreased compared to the first study, which means that the use of the cloud platform has significantly decreased, which makes it possible to save money within the limitations of Microsoft Azure, and it also gives room for the development of the system in accordance with the interests of production. It is possible to configure the collection and analysis of data in a certain specific period and extract useful information about the development of production capacities. Also, an e-mail notification about temperature changes was received (Fig. 31). If other parameters are set in the system for the study, then the notification of the readiness of a product, where the automated system can not cope, the role of a person who can perform certain procedures enters.

Fig. 28 Visual display of abnormal temperature with a message “Temperature is not okay”

218

Fig. 29 Number of connected devices

Fig. 30 The number of messages sent

Fig. 31 E-mail alerts

A. Luntovskyy et al.

Technologies and Solutions for Smart Home and Smart Office

219

6.3 Testing an Intelligent IoT System for Temperature Forecasting in the Smart Office Server Room The next step in the study was to implement the LSTM algorithm in the IoT system for temperature data prediction. Overall, the process of implementing a forecasting system based on temperature data involves collecting and preprocessing the data, splitting it into training and testing sets, normalizing the data, building and training an LSTM model, testing the model using the testing dataset, and deploying the forecasting system in a real-time setting. The essence of the study is to show the possible course of the system in case of abnormal or critical excess of temperature indicators (Fig. 33). This can cause a number of problems if the cause of the problem is not eliminated in time. The forecasting algorithm for a particular study is customized in such a way that it is possible to detect and predict an abnormal temperature increase in advance. This situation can be caused by many factors: it can be a failure in the ventilation or cooling system, which will require immediate human intervention. To demonstrate the example and practical application of the developed system, we conducted research in the office’s server room. It is known that the temperature in a data center should be in the range of 10–27 °C, with an optimal value of about 20–23 °C. If the temperature exceeds these limits, it can lead to equipment failure and inefficient operation of the entire system. So, to avoid this, we installed our IoT system in the server room and measured the temperature every hour for two months between February 2023 and March 2023 to form a dataset. Considering that predicting temperature using an LSTM model is a rather complex task, providing pseudocode, in this case, is not practical. We have described a general algorithm for temperature forecasting using an LSTM model based on the following steps that need to be performed: Step 1: To collect and prepare a temperature dataset that will be used to train the model. We have created our own dataset that contains real hourly temperature data in the server room of an office from February 2023 to March 2023, consisting of 1400 h of observations (Fig. 32). Step 2: Split the dataset into training and testing samples. For training, we use 95% of the dataset sequences and test on the last 5%. Step 3: Conduct preliminary data processing, which may include steps such as normalization, standardization, and filling in missing data. For better alignment and prevention of discrepancies in learning, we standardized the training data to have a zero mean and unit variance. To forecast future values of the time series, the response variables should be defined as the training sequences with values shifted by one time step. In other words, the LSTM model will be trained to predict the value of the next time step based on the current time step. The predictors, on the other hand, should be defined as the training sequences without the final time

220

A. Luntovskyy et al.

Fig. 32 Temperature dataset in the server room of the office from February 2023 to March 2023

step. This approach allows the LSTM network to learn the patterns and relationships within the time series data and make accurate predictions for future time steps. Step 4: Build an LSTM model with the necessary number of layers and neurons, the number of training epochs, and the learning rate. Given the task’s specificity and Raspberry Pi 4’s resource constraints, we used the optimal number of layers and neurons in the LSTM model. In this case, we use one layer with 200 hidden neurons. Step 5: Train the model on the training sample, using learning parameters such as solver, number of epochs, gradient threshold, initial learning rate, and learning rate decay. We will train the network for 200 epochs using the ‘Adam’ optimization method. The Adam method stores exponentially weighted averages of the squared gradients and gradients’ squares from previous steps, allowing the use of moment magnitudes adapted to each parameter in the network. To prevent exploding gradients, we set the gradient threshold to 1. We also set the initial learning rate to 0.004 and decreased the learning rate after 100 epochs by multiplying it by a factor of 0.1. Step 6: Test the model on the testing sample and evaluate its accuracy. To evaluate the accuracy of the LSTM model on the testing dataset, we used the predict and update state function to generate temperature predictions for the testing dataset. Plotting a training time series with predicted temperature values is depicted in Fig. 33.

Technologies and Solutions for Smart Home and Smart Office

221

Fig. 33 Plotting a training time series with predicted temperature values

After receiving the forecasts (Fig. 34), we compared them with the actual temperature values using the RMSE (Root Mean Square Error) metric (Fig. 35). If the RMSE value is low, it means that the LSTM model is good for predicting future temperature. However, if the RMSE value is high, the model may require additional tuning or other methods to achieve better performance. With the aforementioned training parameters, we achieved a fairly accurate model where the predicted value does not differ from the actual value by more than 0.45092 degrees, i.e., RSME = 0.45092.

Fig. 34 Comparison of predicted temperature values with test data of observed temperature values

222

A. Luntovskyy et al.

Fig. 35 Calculation of the RMSE

Step 7: Using a trained model, make temperature predictions based on new data. We used the trained model in practice to forecast the temperature in the server room of the office for the next day. As we can see, the IoT system predicts a temperature rise during the next day (Fig. 36). Step 8: Analyze the forecasting results and take action if a critical temperature is detected. We immediately notify about possible critical temperature conditions to check the cooling system or take other measures to ensure safe equipment operation. It is important to respond to critical situations promptly to prevent possible accidents and equipment damage. As a preventive measure, our system sent an e-mail notification to the system administrator that a critical value is predicted to be reached at 22:00.

Fig. 36 Forecast of temperature growth in the server room over the next day using the proposed AI-based IoT system

Technologies and Solutions for Smart Home and Smart Office

223

Also, in the future, these processes can be automated, and in case of detecting an abnormal process, an algorithm will be launched to check the components that could potentially get out of control.

7 Conclusion The given work contains a brief smart survey of the used protocols and platforms and is illustrated with some best practices for Smart Home and Smart Office. The distinguishing features and practical problems for both deployment areas are examined. IoT components compatibility, as well as cloud independence, are considered. This work can be positioned as a Work-in-Progress. Energy efficiency is considered here as one of the most important issues for IoT. Furthermore, the security aspects of IoT are examined: conventional PKI and advanced "blockchain" based compulsoriness are compared. In all scenarios that consider multi-lateral security for IoT devices under the use of Blockchain can be guaranteed. Resource-intensive blockchain-based applications are by far compensated via a row of unquestionable advantages provided by blockchain-based IoT. The authors favor open-source and cloud-free software solutions because of the better manageability of data security and privacy. By using edge computing, local users’ data privacy is higher than in cloud and fog computing because the user’s data remains at the lowest level and is easy to manage and control from intruders. The interface of the developed smart IoT platform and the possibility of its use for predicting abnormal temperature values are shown in detail. The feasibility of using edge computing in comparison with the system without their use is analyzed. The result was a more optimal use of cloud computing when using an edge device that takes over part of the computation and data analysis. As a result of the study, a very satisfactory result was obtained for the use of such devices. The greatest benefit from them will be obtained in medium and large enterprises, where there are many sensors and indicators that need to be collected and transmitted to the cloud. Therefore, the solution will not only save money when using cloud technologies but also has a large component in terms of security. The algorithm for predicting the temperature values in the smart office server room was proposed and implemented using an LSTM model in our IoT system. Its action is represented by receiving a message about an abnormal temperature increase and going beyond the limits when it can be dangerous for certain devices of the system or the product itself. Such algorithms can be of different natures depending on the scope of the application. It can be forecasting to speed up manufacturing processes or data collection and analysis for other needs. You can also expand the functionality of the remote control and add the ability to automatically check and eliminate problems with a particular case.

224

A. Luntovskyy et al.

Acknowledgements This research was supported by the Ukrainian projects: No. 0223U002453 “Development the innovative methods and models of designing the industry-oriented information and communication systems for upgrading the digital industrial infrastructures”, and No. 0123U101692 “Strategic directions, methods, and means of digitalization and intellectualization of energy systems using modern information and communication technologies”.

References 1. Zaigham, M. (ed.): Fog Computing: Concepts, Frameworks and Technologies. Springer, London (2017). ISBN 978–3–319–94890–4 2. Klymash, M., Luntovskyy, A., Beshley, M. (eds.): Future intent-based networking: on the qos robust and energy efficient heterogeneous software-defined networks. In: LNEE 831, 28 chapters, monograph, on 15.01.2022, Book, Hardcover, XXI, 530 pages, 1st ed. Springer International Publishing (2022). (ISBN 978–3–030–92433–1). 3. Luntovskyy, A., Guetter, D.: Highly-Distributed Systems: IoT, Robotics, Mobile Apps, Energy Efficiency, Security, vol. 321, p 189, 1st ed. Springer Nature Switzerland, Cham, monograph, (2022) ISBN: 978–3–030–92828–5. color figures (Foreword: A.Schill). 4. Luntovskyy, A., Spillner, J.: Architectural Transformations in Network Services and Distributed Systems: Current Technologies, Standards and Research Results in Advanced (mobile) Networks, p. 344. Springer Vieweg, Wiesbaden (2017). ISBN: 9783658148409. https://www. springer.com/gp/book/9783658148409#otherversion=9783658148423 5. Luntovskyy, A., Zobjack, T.: Secured and blockchained IoT. In: 2021 International Conference on Information and Digital Technologies (IDT), pp. 1–10. IEEE, Zilina (2021). ISBN 978–1– 6654–3692–2, ISSN 2575–677X, 2021 6. Luntovskyy, A., Zobjack, T., Shubyn, B., Klymash, M.: Energy efficiency and security for IoT scenarios via WSN, RFID and NFC. In: 2021 IEEE 5th International Conference on Information and Telecommunication Technologies and Radio Electronics (UkrMiCo), pp. 1–6. Kyiv NTUU KPI Igor Sikorsky, Kyiv, Ukraine (2021) 7. Luntovskyy, A., Shubyn, B.: Energy efficiency for IoT. In: van Gulijk, C., Zaitseva, E. (eds.) Springer LNCS series “Reliability Engineering and Computational Intelligence - Studies Comp. Intelligence”, vol. 976, pp. 199–215. (2021). ISBN: 978–3–030–74555–4 8. Energy Harvesting (Online): https://powunity.com/lte-m-die-neue-form-der-iot-technologie/ 9. IoT System Integration: https://www.imbuildings.com/ 10. Home Assistant Platform (online): https://www.home-assistant.io/ 11. Beshley, H., Beshley, M., Medvetskyi, M., Pyrih, J.: QoS-aware optimal radio resource allocation method for machine-type communications in 5G LTE and beyond cellular networks. In: Wireless Communications and Mobile Computing, vol. 2021, pp. 9966366–1–9966366–18. (2021) 12. Luntovskyy, A., Shubyn, B., Maksymyuk, T., Klymash, M.: 5G slicing and handover scenarios: compulsoriness and machine learning. In: Vorobiyenko, P., Ilchenko, M., Strelkovska, I. (eds.) Current Trends in Communication and Information Technologies, vol. 212, pp. 223–255. Springer LNNS (2021). ISBN: 978–3–030–76342–8 13. Dresden Elektronik Ingenieurtechnik GmbH (online, in German). https://www.dresden-elektr onik.de/produkte.html 14. Luntovskyy, A., Beshley, M.: Designing HDS under considering of QoS robustness and security for heterogeneous IBN. In: Klymash, M., Luntovskyy, A., Beshley, M. (eds.) Springer LNEE 831, pp. 19–37. (2022). ISBN 978–3–030–92433–1 15. Tuya Smart: Global IoT Development Platform Service Provider (online): https://www.tuya. com/

Technologies and Solutions for Smart Home and Smart Office

225

16. Schumacher, M.: Der Tuya-Smart-Home-Hack: IoT-Komponenten von unsicherer TuyaFirmware befreien (in German, online). https://www.heise.de/select/ct/2019/4/155022684983 5235 17. Azure IoT Hub (online): https://azure.microsoft.com/de-de/services/iot-hub/ 18. Luis, Á., Casares, P., Cuadrado-Gallego, J.J., Patricio, M.A.: PSON: A serialization format for IoT sensor networks. Sensors (Basel) 21, 4559 (2021). https://doi.org/10.3390/s21134559 19. Blockchain Platforms Reviews and Ratings (online). https://www.gartner.com/reviews/market/ blockchain-platforms/ 20. Guetter, D.: Exercises on IoT, Homepage. Dresden (2023) (in German, online). http://www. guetter-web.de/dokumente/IoT/ 21. Guetter, D.: Lectures on IoT. BA Dresden (2023) (in German, online). https://bildungsportal. sachsen.de/opal/ 22. Beshley, M., Kryvinska, N., Seliuchenko, M., Beshley, H., Shakshuki, E., Yasar, A.: End-to-End QoS “Smart Queue” Management algorithms and traffic prioritization mechanisms for narrowband internet of things services in 4G/5G Networks. Sensors 20(8), 2324 (2020). https://doi. org/10.3390/s20082324